Difficulties in raising a voice assistant. The look of a linguist and developer

Working with a voice assistant is often compared to raising a child. He constantly learns something, repeating after the "elders". Gradually masters the language and the ability to build communication. Sometimes he understands everything too literally or just gives out something awkward. This is because processing the language is a complex and lengthy process that requires the attention of more than one specialist. We asked our fellow linguist-developer Ivan and lead engineer Bassel to share interesting cases from their experience with Sky Voice Assistant. We asked the two specialists the same questions to find out why mathematics alone cannot win when processing a language, how voice assistants learn to joke, and why this is necessary.

What are you responsible for? What is included in your area of ​​responsibility?


Linguist

I am responsible for everything related to the linguistic aspect of the work of the voice assistant. This is an analysis of the user's questions, and planning the logic of the answer, and searching or creating text for him. In addition, I developed some services that were strongly tied to the text (including weather, reminders, news, toasts, word games), and collected content for training. This includes, for example, recording various voices to activate a column.

Developer

I am responsible for the brain of our chatbot. I am writing his logic: how he receives questions, how he answers, where he gets data from, what services will work inside him. This is a communication service and knowledge base so that it can answer any questions. He can turn to her and say what the weather is, what is the dollar exchange rate, order a taxi for you, set an alarm, etc.

Do you think working with a voice assistant is like raising a child?


Linguist

Concerning children and machine learning there was a very good article on Habré, and in general it is a popular analogy.

But the problem is that AI does not have any understanding of the context beyond what is included in the training set, even the most basic knowledge of the world outside a specific task and such inherent methods of assessment as taste and common sense are missing. Because of this, the results are often unpredictable.

Developer

We cannot say that the voice assistant is a child, because the child has the ability to analyze and learn. Voice assistant is a rather stupid thing. You want him to do something, set him the task, and that’s all - he will do it.
We cannot even consider a neuron a child — by itself, it cannot learn. We must always show her the way. Artificial intelligence in this sense plays a role only when the neuron can find situations similar to those that you taught her. I do not think that this is generally intelligence, just great opportunities.

What funny cases arise in the process of working with him?


Linguist

I will answer for two. Once we selected words for the Alias ​​game, which is based on the search for similar words (“associations”) using the word2vec model. We chose very carefully, it was impossible to imagine that the associations to the word "navel" are selective sexual expressions far beyond the bounds of censorship. It would seem that the word is so childish and is used in completely different contexts.
Apparently, we don’t know something either about our assistant, or about the text collector for the used case.

One more thing. Once we decided to add to the list of Russian greetings and goodbyes their analogues from different languages. From the usual “bonjour” to Arabic and Jewish expressions. New words were indexed by our search algorithm for similar expressions, but there was nothing even close to them! As a result, the column replied to any incomprehensible or somehow distorted request: “As-salamu alaikum wa-rahmatu-Llah . When you hear this in a chased machine voice in response to the usual “where is the USA?”, It is confusing.

What about homonymy? When words sound the same, but these are two completely different words. For example, a verb and a noun


Linguist

Yes, it’s a pain for everyone involved in language processing. This happens with whole sentences, they have long been simply carved in stone. Examples such as "He saw their family with his own eyes." Is it some kind of creature that has seven eyes, and he sees them. Either he himself saw their family. Either at some moment it seemed to him that they were his seven eyes.

A simpler example: "These types of steel are in the workshop." Either several types of “steel” material are available in the workshop, or some muddy workers began to work not only in the workshop, but also exist. That is, homonymy is a very big problem, not only at the level of words, but also at the level of whole sentences. There is also a problem at the level of similarity of word forms. Say, the nominative and accusative cases of one word sound the same. Therefore, even such a seemingly simple task as determining the form of a word requires the use of complex packages for analysis. And these packages never give a definite answer. They can give out only the probability of one form or another.

How do you solve such problems? Share Lifehacks


Linguist

Yes, no tricks especially. Very carefully select the data on which the model is trained, and carefully test everything.

As for homonymy, if we try now to somehow retrain the model so that it determines the correct form of a specific word, it will be setting patches for the current imperfect solution. To really learn how to work with homonymy, of course, there are linguistic methods, but they are not always and everywhere used. And they are still working on them. For the Russian language, the situation is much worse than for English, because we have significantly more word forms.

Developer

We are reviewing the dialogue, the logic of recognition, we see that the voice assistant did not understand it so well. Sometimes you need to add a new dialog. There may be situations when he answered a question, the answer to which he did not know at all. Development history helps.

Is it true that Alice in Russia works better than her predecessors? Why?


Linguist

Quite a subjective assessment: Siri also works very well.

However, Alice is now the most competitive voice assistant, because Yandex has a huge amount of resources and services to expand its potentials. In addition, they already have the ability to add third-party services, that is, any developer or team can add some of their functions. This makes her opportunities truly wide.

On the one hand, the matter is the resources and experience of Yandex: they have been processing the language for a very long time, they themselves have developed many resources for data extraction, parsing, and wordform analysis. Many good linguists came to them.

On the other hand, it competently combines and complements each other classical and neural network algorithms. That is why she can understand clear requests, and maintain a conversation about anything.

Do not forget that this is, although very good, but an imitation of conversation.

Developer

Of course. Because at Google the main logic is based on the English language, and we are in Russia. In Yandex, people working on a voice assistant whose native language is Russian. It seems to me that Alice is better now and will be better. Because Russians are working on logic.
Here the question is not in the algorithm, not in development. Here is the context, logic and in general the soul of this development. Alice seems more natural.

Why can't mathematics win? How do language skills help you work with your voice assistant?


Linguist

Programmers, like philosophers, probably have an understandable, but sometimes dangerous illusion that they can understand any other field with the help of their knowledge apparatus. That is, it is enough for them to read the documentation for some language processing module, and they will learn how to work with it. Unfortunately, this is not entirely true, because language is too complex a system. Even linguists themselves now poorly understand how it works.

If we delve into research, it becomes clear that the language in the cognitive aspect (the way it generally works in the head, how thoughts are transformed into our speech) is very difficult to separate from all other levels. In order to create truly smart processing systems, we will need to somehow learn how to formalize this side and others too.

We often had to attract purely linguistic research. For example, we worked on a time-processing module, that is, when a person says: “Remind me to do this on the first floor.” Difficulties arose in processing the word midnight. Tomorrow at midnight is tomorrow at 0 o’clock or tomorrow at 24 o’clock? To find the answer to this question without resorting to the methods of linguistics or philology is impossible. One could only guess at the coffee grounds. They say it or not. The study was that I looked at the National Corps of the Russian Language all the cases of using the word "midnight" with different time references, that is, today / tomorrow. Looked at what people had in mind. The margin was 60% against 40% in favor of the fact that today at midnight - tomorrow at 0 o’clock.

It is impossible, just looking at some use cases, not knowing how the language works, to formulate a rule and some final list of ways to say something. For any reason, you can say an infinite number of proposals. Trying to set all this with some finite algorithms is very difficult. Systems that do not use linguistic analysis will never give 100% accuracy.

Developer

Linguist helps a lot. He can find a large number of options for how people ask about something. In addition, operating machinery is a dangerous thing. We cannot accept any request. The linguist helps us to determine what these questions will be, in what form, it helps to arrange the correct answers. He also analyzes the text, removes from it topics that are not worth talking about: politics, racist remarks, etc.

, ? , ?




Of course, language processing is an interdisciplinary problem. And now, and always it was necessary to attract specialists in psychology and psycholinguistics, who determine how a person understands the language. At a deeper level, cognitive research is also needed now. Because only now we have technologies that allow us to track how the human brain works when processing errors in syntax such as the wrong word order and semantic errors, as when something unexpected is said, completely inappropriate in meaning. And the results of these studies cast doubt on everything that was previously considered universally recognized in linguistics. Because it turns out that these errors are handled in a very similar way, both for the language, that is, voice information, and for videos or comics,or even for music and any sound sequences. That is, the mechanism for searching for errors in the structure and sense is universal for all the information that a person perceives. This suggests that it is necessary to work on the analysis of syntax and semantics not within the framework of the language, but within the framework of the general perception of information.

The developer

Turing said: "A computer would deserve to be called intelligent if it could deceive a human into believing that it was human" - A computer can be called smart only if you do not understand that it is a machine, not a person.

This is where psychologists will help in the future. We are not dependent on words only. Emotions ... how a person understands is also important. A person has five senses, at least two are used during a conversation. And the voice assistant has one source. These are his "ears."
The psychologist can work with developers who analyze audio signals and help us determine emotions by voice, to understand whether the person is angry or in a good mood. And depending on this, determine when the voice assistant should joke, and when - to be serious. As programmers, we cannot control this. If we say “joke” to the car, it will do it in any strange situation. For example, teach her the question "What to do?" answer "Take off your pants and run." If the user before this question said that his dad died or he broke up with the girl, he is not in the mood, the machine will not take into account all this information and give a joke.

Since we are talking about jokes, how to develop a sense of humor in a voice assistant?


Linguist

A sense of humor is an inherently human phenomenon that helps to adapt to changes, endure difficulties, strengthen social interaction and much more. In its exact form, it, I think, is hardly necessary for AI. Research in this area is ongoing, but it is about understanding and simulating humor. We must somehow explain to the car that leather bags sometimes do things incomprehensible to her - they joke - and expect jokes in return.

With understanding, everything is very complicated, so I’ll answer about imitation. There are two ways out:

  1. use jokes created by people - specially written or obtained by the system itself from the corpus of texts;
  2. try to understand what makes people laugh (hidden, parallel and unexpected semantic connections, a combination of words from different semantic fields, case inversion and meanings), and realize this.

There are already technical solutions: the same puns are created simply on the basis of common sequences of letters. The problem is always to objectively evaluate the result of work and somehow overcome the threshold of 5-10% of ridiculous examples.

As a rule, AI is not joking or not funny, and complex research is necessary to change the situation.
The easiest and most reliable way to add humor to voice assistants is to simply write scripts or, in extreme cases, some kind of joke patterns. Then we can intelligently generate them for one reason or another. I am sure that in Yandex Alice this often works this way. Many have noticed that Alice understands the songs and jokes from the series The Witcher. You can ask her something like “How to pay the Witcher?” And she will joke something in return. These things are likely to be manually registered.

Developer

A linguist is involved in collecting answers that can be funny. He searches for them in the language enclosure, then they end up in the voice assistant database. And, when we ask him to joke, he finds the right one in the database and gives a joke. He can also joke spontaneously if he sees situations similar to those on which he was trained. It all depends on the context.

Why do you think people want a voice assistant to joke?


Linguist

It seems to me that there are so many reasons why people so want to see humor in it. A sense of humor is a purely human quality. What makes us human. Wanting to find humanity in the chatbot, they seek a sense of humor in it. This can be seen, even if you look at all the examples of artificial intelligence in culture: any truly smart robot from the movie will joke.

Which voice assistant do you think is the most adult?


Linguist

If an adult is old, it's hard to say. Voice control is almost the same ancient thing as speech synthesis, invented, oddly enough, in the 18th century. They have been dealing with it since the beginning of the 20th century, and the first working solutions appeared in the 1960s and have been developing since then. Smart voice assistants were created at IBM in the 90s, and reached smartphones in 2011.

If an adult is boring but reliable, then Siri. It seems that the texts of the answers for the Russian language have recently been updated in it, and it gives the most correct and safe answers for the reputation. Convenient for a large company, but also not play. There is no way to chat and gather plausible dialogues like in Alice. But he doesn’t have such a goal, because it is a voice assistant built into a smartphone (or in any equipment). It has primarily a utilitarian function of controlling everything. I remember at first the answers were even more interesting and controversial than now. But, apparently, they decided that people had already played enough with a voice assistant, and it was time for him to become serious. Just do your job.

Alice is present either in the application or in a separate product - in a column. Both there and there it is important to interest a person so that he wants to buy a column or open an application. Just dry voice control will seem boring.

Developer

No adults. All voice assistants had little knowledge, and now there is more. They did not learn themselves. I remember how stupid Alice worked about 3-4 years ago. But every day she got better. Developers monitored specific situations and corrected errors, made new cases, scripts. Users helped them, noted some nuances. Yandex has great resources: there is a search engine, there are servers and everything to store data.

Still, there is an opinion that Siri is the most adult, because it is informative, but it has fewer jokes, games, etc. Do you agree?



Yes. Because they play with that which is reliable. This is better than answering 100 questions, but 40 of them are wrong. They are very neat in design. They want the assistant to always say something right and not be silly like Alice before.

To summarize


Not everyone supports the analogy between machine learning and parenting.
The language is endless. A native speaker can express the same thought in an infinite number of utterances. Without using methods of linguistic analysis, you will not get 100% accuracy.

Knowledge from other areas also helps in machine learning. Cognitive and psycholinguistic research will help to understand how the brain processes information processing, in particular, how a person understands a language in order to transfer this knowledge to machine learning. And in resolving ethical issues, psychologists will come to the rescue.
Usually AI jokes not funny or not, but people need jokes! Therefore, research in this area is ongoing.

The most powerful and competitive voice assistant in Russia is Alice. A conversation with her is close to a conversation with a person. And the most adult (by this word we mean the emphasis not on gaming moments, but on reliability and accuracy in processing requests) - Siri.

All Articles