Speech Recognition: A Very Short Introductory Course



It is almost impossible to tell the layman as simple as possible about the work of computer speech recognition and converting it into text. Not a single story about this is complete without complex formulas and mathematical terms. We will try to explain as clearly and slightly simplistically as possible how your smartphone understands speech, when cars have learned to recognize a human voice and in what unexpected areas this technology is used.

Necessary warning: if you are a developer or, especially, a mathematician, you are unlikely to learn anything new from the post and even complain about the insufficient scientific nature of the material. Our goal is to introduce the uninitiated readers to speech technologies in the simplest way and tell how and why Toshiba took up the creation of her voice AI.

Important milestones in the history of speech recognition


The history of recognition of human speech by electronic machines began a little earlier than it is customary to think: in most cases it is customary to count down from 1952, but in fact one of the first devices that responded to voice commands was the Televox robot, which we already wrote about . Created in 1927 in the USA, Herbert Televox robot was a simple device in which various relays reacted to sounds of different frequencies. The robot had three tuning forks, each of which was responsible for its tone. Depending on which tuning fork worked, one or another relay was activated.

image
In fact, the entire “filling” of Televox, including the command recognition system, was located on a rack in the area of ​​the body of the “robot”. It was impossible to close its lid, otherwise tuning forks could not correctly “hear” sounds. Source: Acme Telepictures / Wikimedia.

It was possible to communicate with Televox as separate signals with a whistle, and in short verbal cues - their tuning forks were also laid out in a sequence of sounds. The creator of the robot, Roy Wensley, even staged a fantastic demonstration for those times, saying the command “Sesame, open”, through which Televox turned on the relay responsible for opening the door. No digital technology, neural networks, AI and machine learning - just analog technology!

The next key invention that paved the way for true recognition of human speech was the Audrey machine, developed in 1952 at the Bell Labs Innovation Forge. The huge Audrey consumed a lot of electricity and was the size of a good cabinet, but all its functionality came down to recognizing spoken numbers from zero to nine. Just ten words, yes, but let's not forget that Audrey was an analog machine.
image
Unfortunately, the story has not preserved public photographs of Audrey, there is only a concept. Simple on paper, difficult to translate - according to the memoirs of contemporaries, Audrey components occupied an entire cabinet. Source: Bell Labs

It worked like this: the announcer spoke numbers into the microphone, making intervals of at least 350 ms between words, Audrey converted the sounds he heard into electrical signals and compared them with samples recorded in analog memory. According to the results of the comparison, the car highlighted the number on the dashboard.

It was a breakthrough, but there was no real benefit from Audrey - the machine recognized the voice of its creator with an accuracy of 97%, other specially trained speakers received an accuracy of 70-80%. Strangers who first contacted Audrey, no matter how hard they tried, saw their number on the scoreboard in only 50% of cases.

Despite the revolutionary results for its time, Audrey did not find, and could not find practical application. It was assumed that the system could be adapted instead of telephone operators, but nevertheless, human services were more convenient, faster and much more reliable than Audrey.

Presentation similar to Audrey, only much smaller, machines - IBM Shoebox. Shoebox speed is clearly visible. The machine could also perform simple mathematical operations of addition and subtraction

In the early 1960s, work on creating machines for speech recognition was carried out in Japan, the UK, the USA and even the USSR, where they invented a very important algorithm for the dynamic transformation of the timeline (DTW), with the help of which it was possible to build a system that knows about 200 words. But all the developments were similar to each other, and the recognition principle became a common drawback: words were perceived as integral sound fingerprints, and then they were checked against the base of samples (dictionary). Any changes in the speed, timbre and clarity of the pronunciation of words significantly affected the quality of recognition. Scientists have a new task: to teach the machine to hear individual sounds, phonemes or syllables and then make words from them. Such an approach would make it possible to level out the effect of changing the speaker, when, depending on the speaker, the recognition level varied sharply.

— , . , « » «» «». «» « » « » «», — «». , , .

In 1971, the Department of Defense Advanced Research Projects Agency (DARPA) launched a five-year program with a budget of $ 15 million, which was tasked with creating a recognition system that knew at least 1000 words. By 1976, Carnegie Mellon University introduced Harpy, capable of operating a dictionary of 1011 words. Harpy did not compare the completely heard words with the samples, but divided them into allophones (a sample of the sound of a phoneme depending on the letters surrounding it). This was another success, confirming that the future lies in the recognition of individual phonemes, rather than whole words. However, among the drawbacks of Harpy was an extremely low level of correct recognition of allophones (pronunciations of phonemes) - about 47%. With such a high error, the share of errors grew after the volume of the dictionary.

Description of how Harpy works. Video of the program did not survive.

Harpy's experience has shown that building up dictionaries of holistic sound fingerprints is useless - it only increases recognition time and drastically reduces accuracy, so researchers all over the world have taken a different path - recognizing phonemes. In the mid-1980s, the IBM Tangora machine could learn to understand the speech of any speaker with any accent, dialect and pronunciation, it only required a 20-minute training, during which a database of phonemes and allophone samples was accumulated. The use of the hidden Markov model also increased the vocabulary of IBM Tangora to an impressive 20,000 words - 20 times more than Harpy had, and is already comparable to the teenager's vocabulary.

All speech recognition systems from the 1950s to the mid-1990s did not know how to read a person’s natural spoken language - they had to pronounce the words separately, pausing between them. A truly revolutionary event was the introduction of the hidden Markov model developed in the 1980s - a statistical model that built precise assumptions about unknown elements based on the known ones. Simply put, with just a few recognized phonemes in one word, the hidden Markov model very accurately selects the missing phonemes, thereby greatly increasing the accuracy of speech recognition.

In 1996, the first commercial program appeared, capable of distinguishing not individual words, but a continuous flow of natural speech - IBM MedSpeak / Radiology. IBM was a specialized product that was used in medicine to shorthand describe the results of an x-ray delivered by a doctor during the study. Here, the power of computers finally became sufficient to recognize individual words "on the fly." Plus, the algorithms have become more perfect, the correct recognition of micro-pauses between the spoken words has appeared.

The first universal engine for recognizing natural speech was the program Dragon NaturallySpeaking in 1997. When working with her, the announcer (i.e. the user) did not need to undergo training or operate with a specific vocabulary, as in the case of MedSpeak, any person, even a child, could work with NaturallySpeaking, the program did not set any pronunciation rules.

image
Despite the uniqueness of Dragon NaturallySpeaking, IT browsers did not show much enthusiasm for recognizing natural speech. Among the shortcomings, recognition errors and incorrect processing of commands addressed to the program itself were noted. Source: itWeek

It is noteworthy that the recognition engine was ready back in the 1980s, but due to the insufficient computer power, the Dragon Systems development (now owned by Nuance Communications) did not have time to determine the spaces between words on the fly, which is necessary for recognizing natural speech. Without this, the words "while being treated", for example, could be heard by the computer as "crippled."

Ahead was the growing popularity of speech recognition systems, neural networks, the emergence of Google voice search on mobile devices and, finally, the Siri voice assistant, not only converting speech to text, but also adequately responding to queries constructed in any natural way.

How to hear what was said and to think of what was inaudible?


Nowadays, the best tool for creating a speech recognition engine is the recurrent neural network (RNN), on which all modern services for recognizing voice, music, images, faces, objects, text are built. RNN allows you to understand words with extreme accuracy, as well as predict the most likely word in the context of the context if it was not recognized.

The neural network temporal classification of the model (CTC) selects individual phonemes in the recorded audio stream (word, phrase) and arranges them in the order in which they were pronounced. After repeated analysis, CTC very clearly identifies certain phonemes, and their text recording is compared with the database of words in the neural network and then turns into a recognized word.

Neural networks are so called because the principle of their work is similar to the work of the human brain. Neural network training is very similar to human training. For example, in order for a very small child to learn to recognize cars and distinguish them from motorcycles, you need to at least several times draw his attention to various cars and each time pronounce the corresponding word: this is big and red - the car, and this low black - the car, but this and these are motorcycles. At some point, the child will discover patterns and common signs for different cars, and will learn to correctly recognize where the car is, where the jeep, where the motorcycle, and where the ATV, even if in passing it sees them on an advertising poster on the street. In the same way, the neural network needs to be trained in a base of examples - to make hundreds and thousands of pronunciation variants of each word, letter, phoneme “learn”.

A recurrent neural network for speech recognition is good because after a long training the base of various pronunciations, it will learn to distinguish phonemes from words and make words from them regardless of the quality and nature of the pronunciation. And even “think out” with high accuracy, within the context of the word, words that could not be recognized unambiguously due to background noises or fuzzy pronunciation.

But there is a nuance with RNN predictions - a recurrent neural network can “think out” a missing word only by relying on the closest context of about five words. Outside this space, analysis will not be conducted. And sometimes he is so necessary! For example, for recognition, we uttered the phrase “Great Russian poet Alexander Sergeyevich Pushkin”, In which the word“ Pushkin ”(specially in italics) was said so inaudibly that the AI ​​could not accurately recognize it. But a recurrent neural network, based on the experience gained during the training, may suggest that the word "Pushkin" is most often found next to the words "Russian", "poet", "Alexander" and "Sergeyevich". This is a fairly simple task for an RNN trained in Russian texts, because a very specific context allows us to make assumptions with the highest accuracy.

And if the context is vague? Take another text in which one word cannot be recognized: “Our everything, Alexander Sergeyevich Pushkin, tragically died in the prime of his life after a duel with Dantes. The Pushkin Theater Festival is named after the poet. ” If you remove the word "Pushkinsky", RNN simply can not guess it, based on the context of the proposal, because it only mentions a theater festival and a reference to the name of an unknown poet - there are a lot of possible options!

This is where the long short-term memory (LSTM) architecture for recurrent neural networks, created in 1997 (a detailed article on LSTM ) comes into play.) It was specially developed in order to add RNN's ability to take into account the context remote from the event being processed - the results of solving previous problems (that is, word recognition) pass through the entire recognition process, no matter how long the monologue is, and are taken into account in each case of doubt. Moreover, the removal distance has almost no effect on the efficiency of the architecture. With the help of LSTM, if necessary, a word network will take into account all the experience available within the framework of the task: in our example, RNN will look at the previous sentence, find that Pushkin and Dantes were mentioned earlier, therefore, “By the Name of the Poet” most likely indicates one of them. Since there is no evidence of the existence of the Dantes Theater Festival,we are talking about Pushkinsky (all the more so since the sound imprint of an unrecognized word is very similar) - such a festival was at the base for training the neural network.

"Confession of a voice assistant." When a well-trained neural network comes into play, a voice assistant can figure out exactly what needs to be done with “green slippers”

How does speech recognition make the world a better place?


In each case, the application is different - it helps someone communicate with gadgets, and according to PricewaterhouseCooper s more than half of smartphone users give voice commands to devices - among adults (25-49 years old), the percentage of those who constantly use voice interfaces, even higher than among young people (18-25) - 65% against 59%. And in Russia at least once, at least 71% of the population communicated with Siri, Google Assitant or Alice. 45 million Russians constantly communicate with Yandex from Alice, and Yandex.Maps / Yandex.Navigator only account for 30% of requests.

Speech recognition really helps someone at work - for example, as we said above, to doctors: in medicine since 1996 (when IBM MedSpeak came out), recognition is used to record anamnesis and when examining images - a physician can continue to work without being distracted by recordings in computer or paper card. By the way, work on dictation in medicine is conducted not only in the West - in Russia there is a Voice2Med program from the “Center for Speech Technologies”.

There are other examples, including our own. Organizing a Toshiba business involves full inclusion, that is, equal rights and opportunities for people with various health conditions, including for employees with hearing impairments. We have a corporate program called Universal Design Advisor System, in which people with various types of disabilities participate in the development of Toshiba products, making suggestions to improve their convenience for people with disabilities - that is, we do not assume how we can do better, but operate on real experience and employee reviews.

A few years ago, at the Toshiba headquarters in Japan, we faced a very interesting task, requiring the development of a new speech recognition system. During the operation of the Universal Design Advisor System, we received an important insight: employees with hearing impairments want to participate in discussions at meetings and lectures in real time, and not be limited to reading the processed transcript hours or days later. Starting voice recognition through a smartphone in such cases gives a very weak result, so Toshiba specialists had to start developing a specialized recognition system. And, of course, we immediately ran into problems.

Conversation differs enormously from written speech - we don’t speak the way we write letters, and a real conversation translated into text looks very sloppy and even unreadable. That is, even if we convert conversations on the morning plan into text with high accuracy, we will get an incoherent hash teeming with verbal parasites, interjections and thoughtful “aaa”, “uh” and “mmm”. To get rid of the transcription of unnecessary sounds, words and expressions of emotions in the text, we decided to develop an AI capable of maximally accurately recognizing not always necessary elements of colloquial speech, including the emotional coloring of some words (for example, “yes, well” may sound like skepticism or how sincere surprise, and these are literally opposite meanings).


It looks like a laptop with a set of peripherals for voice recognition using Toshiba AI (left) and an application with the results for end devices (right). Source: Toshiba

LSTM came in handy here, without which the recognition accuracy was insufficient for the received text to be read and understood without effort. Moreover, LSTM was useful not only for more accurate prediction of words in context, but also for the correct processing of pauses in the middle of sentences and interjections-parasites - for this we taught the neural network these parasites and pauses that are natural for colloquial speech.

Does this mean that now the neural network can remove interjections from transcripts? Yes, it can, but this is not necessary. The fact is that (another insight received) people with hearing impairments are guided, including by the movements of the speaker's lips. If the lips move, but the text corresponding to these movements does not appear on the screen, there is a feeling that the recognition system has missed part of the conversation. That is, for someone who cannot hear, it is important to get as much information as possible about the conversation, including ill-fated pauses and mejometia. Therefore, the Toshiba engine leaves these elements in the transcript, but in real time dims the brightness of the letters, making it clear that these are optional details for understanding the text.

This is how the recognition result on the fly looks on the client device. The parts of the monologue that are not meaningful are painted gray.

Now Toshiba AI works with English, Japanese and Chinese speech, and even translation between languages ​​on the fly is possible. It is not necessary to use it for shorthand on the fly - the AI ​​can be adapted to work with voice assistants, who finally learn to adequately perceive interjections, pauses and stutters when a person pronounces a command. In March 2019, the system was successfully used to add subtitles to the IPSJ National Convention broadcast in Japan. In the near future - the transformation of the Toshiba AI into a public service and experiences with the implementation of voice recognition in production.

All Articles