The future is here: how voice robots work and what they can do

image

Robotization of routine operations, when robots are used to solve simple and at the same time labor-intensive tasks, rather than people, is a very active trend. Many things are being automated, including telephone conversations with customers. The company Neuro.net is engaged in the creation of technologies that provide an opportunity to improve the capabilities of robots.

In this article, the developers talk about the technologies and nuances of recognizing the interlocutor's gender by voice and working on important elements of the dialogue.

First a case, and then a breakdown of technology


image

One of the most interesting cases is the replacement of the call center employees of a partner company with a voice robot. The latter’s capabilities were used not for regular situations, such as clarifying the delivery address, but in order to find out why some customers have become less likely to visit the company's website.

The technology was based on the use of a full-fledged neural network, rather than individual scripts. It was the neural network that allowed us to solve the problems that usually confuse robots. First of all, we are talking about the answers of the interlocutor like "well, I don’t know yet, maybe yes, although no" or even "yes no." Words common to humans become an insurmountable obstacle for a robot.

image

During the training, the robot began to understand what the meaning is laid in a particular phrase, and what should be the answer. The robot had several votes - both male and female. The main task was to “humanize” the robot so that the human interlocutor did not test the capabilities of the machine, but conducted a dialogue according to the target scenario.

Below is an example of what happened.


The robot listens to the interlocutor, giving an answer depending on the meaning of what the client said. The total number of script branches that can be used for conversation is more than a thousand.
The main goal of this robot was to understand the reason for the decrease in activity of the company's client on the site and make an interesting offer to everyone. This was one of the company's first attempts to automate the work of call centers.

New robots are more perfect. Here are some more examples of how robots communicate with humans: first , second , third examples.

Now about technology


There are three key technological features that allow the robot to work:

  • recognition of the sex of the interlocutor by voice,
  • age recognition
  • building a dialogue with a human interlocutor.

image

Recognizing the sex of the interlocutor by voice


Why is this needed? Initially, this function was created to conduct surveys using robots. Previously, the survey work was carried out by people who filled out a number of points. For example, the floor of the interlocutor. It is clear that a person does not need to ask with whom he speaks - a man or a woman, to determine this parameter. In 99%, everything is clear. Robots are another matter, so that they more or less accurately learn to recognize voices, I had to carry out large-scale work. And it was not in vain, now the technology is used to personalize offers and voice prompts depending on gender.

An important point - the female voice is universal and applicable to work with the widest range of products, and it is especially important for products for women. According to various studies,a female voice is perceived positively by any audience, respectively, in this case, the conversion is greater. An exception - when promoting “male” products, a male voice is preferable.

How it works? First, primary data processing is performed, it is based on the processing of voice recordings and fragments lasting 20 ms. All collected voice fragments are pre-processed in the VAD (Voice Activity Detection) component. This is necessary to separate the "grains from the chaff", that is, speech from noise. All garbage is removed, which increases the accuracy of the models.

For recognition, the so-called space of cepstral coefficients, the first and second differences, is used. The basis is the GMM method - Gauss Mixture Models.

So, in the interval of 10-20 ms, the current power spectrum is calculated, after which the inverse Fourier transform of the logarithm of the spectrum is applied, with the search for the necessary coefficients.

Our GMM models are configured separately for teaching male and female voice mods, and models are also used to determine adult and children's voices. Of course, you cannot train the system from scratch, you need marked up voice recordings.

In order to increase the efficiency of the system, the coefficients of the timbre voice models are applied:

  • Timbral sharpness.
  • Timbral warmth.
  • Timbral brightness.
  • Timbral depth.
  • Timbral hardness.
  • Timbral growth.
  • Timbral unevenness.
  • Timbre reverb.

Timbre models are needed in order to correctly identify the voices of children - any other models accept the child's voice as female. Plus, you need to distinguish between coarse female voices (for example, an elderly smoking woman), high male voices, etc. By the way, if a person said “hello” and then coughed - all previous models that did not use timbre filters would define the voice as male.



The main part of the system is the data classification module based on the multilayer perceptron, MLP. It transmits data from models of male and female voices, data from timbral models. At the entrance to the system, we get an array of classified values, and at the output, the result of sex determination.

The technology described here is used to work both online (according to the first phrase of the client) and offline classification mode (after a conversation). Gender recognition accuracy is around 95%. An important point is that the delay when working online does not exceed 120-150 ms, which is extremely important for the humanization of the robot. Usually, pauses in communication between a robot and a person are not milliseconds, but seconds, which, of course, looks strange for a human interlocutor, and it is immediately clear that the digital system communicates.

The plans include adding work with text, more precisely - endings. If the interlocutor says “I could” - definitely, this is a woman. In the near future, this technology will be finalized and implemented in the recognition system.

Determining the age of the interlocutor


Why is this needed? First of all, in order not to offer various products and services to minors. In addition, identifying age is useful in order to personalize offers by age categories.

How it works? The exact same technologies are used as in the previous case. The accuracy of the system is about 90%.

image

Constructing Dialogs


And now we proceed to the most interesting - the principle of constructing dialogs.

Why is this needed? In order to competently replace a person, a robot must be able to work both in linear and non-linear scenarios of dialogue. In the first case, it can be a questionnaire, in the second - work with subscribers of the call center, technical support lines of the company, etc.

How does it work? We use the NLU Engine, the basis of which is the semantic analysis of the text received from ASR systems. Further, recognition objects such as entities (intents) and intents (intentions), which are used in the logic of constructing conversational flow, are distinguished from it.

Here is an example of how the technology works.

Text received from a speech recognition system (ASR):
“In general, I’m interested in your proposal, but I would like it cheaper. And now I'm a little busy, you could call me back at six o’clock tomorrow. "

Objects filled with the NLU Engine:

Intents:
confirmation = true
objection = expensive
question = null
callback = true
wrong_time = true

Entities:
date = 01/02/2019 (suppose the call date is 01/01/2019)
time =
18:00 amount = 6

Filling principle The objects in this example are:

Intents (intentions):

  • The text “I am interested in your proposal” has been translated into intent “confirmation” with a value of “true”.
  • The text “I would like it cheaper” was translated into intent “objection” with the value “expensive”.
  • The text “I'm a little busy right now” has been translated into intent “wrong_time” with a value of “true”.
  • « » intent «call_back» «true».
  • , intent «question» null

Entities ():

  • «» entity «date» «02.01.2019», current_date + 1 (, 01.01.2019).
  • « » entity «time» «18:00»,
  • «» entity «amount» «6», , entities .

For the entire list of intents and entities, certain values ​​are assigned, which are then used to build conversational flow.

Now let's talk about the work algorithms that are supported by the NLU Engine system. It includes two levels.

The first level - it works on a relatively small data sample of about 600-1000 records. ML algorithms are used here. Recognition accuracy: 90-95%.

The second level - the transition to it is carried out after the launch of the project and the accumulation of a large data sample, including more than 1 million records. DL algorithms are already used here. Recognition accuracy: 95-98%.

The solution works with two subsystems:

  • subsystem of categorization and classification of text data,
  • dialogue formation subsystem.

Both subsystems work in parallel. At the entrance to the categorization and classification system, the subscriber’s text recognized from the voice phrase is transmitted; at the output, the decision gives the filled parameters Entity and Value to form the answer.

The dialogue formation subsystem for constructing non-linear scenarios is built on a neural network. At the entrance to the system, the text of the subscriber recognized from the voice phrase is transmitted, at the exit, the decision is made about what should be played the next time.

A non-linear scenario is suitable for the first support line - the robot does not know who is calling, about which particular product and with what questions. Here, the further construction of the dialogue depends on the client’s response.

But for outgoing calls, the best solution would be a linear scenario. His example was set at the very beginning of the article. Another variant of the linear scenario is to conduct a survey when it does not matter what the client answers, this will be further analyzed by specialists. But it is important to guide the client through all the questions that are on the list.

As a result, I want to emphasize that voice robots will not replace people. Now they are doing an excellent job with routine work - calling people in order to ask them some questions and listen / record / analyze the answers. Thus, call center and technical support operators are relieved of the need to carry out the same routine procedures. Instead, they can focus on really interesting questions and challenges.

All Articles