How we taught artificial intelligence to answer support questions. Experience Yandex.Taxi

There are no ideal services - sometimes the user has questions about technical support. It’s hard to say that in such cases it’s more unpleasant to try to combine a combination of template bot replicas that can solve the problem, or wait for a specialist’s response that will be in touch with you already half a day.

In Yandex.Taxi, of the two options, we chose the third - using machine intelligence to create technical support with a human face. My name is Tatyana Savelyeva, my group is engaged in machine learning on unstructured data. Under the cut, I share user insights, talk about how to automate a complex process, organize the work of completely different teams and, of course, put into practice Deep learning and technical hacks (where without them).



Why automate anything at all?


It would seem, why invent a multi-stage support structure - hire more people. Perhaps this will work if about 10 requests per day come in support. But when the number of user calls tends to a million (which is a small percentage of trips for Yandex Taxi, but it’s absolutely impressive), you have to think about more reliable tactics: finding and training a sufficient number of operators that can cope with atypical problems in such volumes is at least more difficult .

Some time ago, the industry decided to solve this problem with the help of several levels of support. At the first stage, the most simple, predictable questions were filtered out: if the ready answer did not fit, the problem was classified and passed on to a more qualified expert. Elegant, but there is a nuance.

The number of calls is growing - it takes more time to process them. The throughput of the operators, the human factor - are there many reasons that impede the system, where the bill goes for minutes? Many of these restrictions can be circumvented with the help of a machine: it will not be mistaken if it gets tired, and it makes decisions faster.

About a year ago, we started using machine learning in order to immediately prompt the operator possible interaction scenarios. Customers now get answers faster. But there is no limit to perfection!

Where to begin?


Suppose you are out of luck: the driver did not arrive and does not get in touch. What will happen to your contacting Yandex.Taxi support?



What can be optimized to solve problems even faster? Let's start from the first stage , where the ticket goes to one of two lines. Initially, the choice depended on the keywords in the query - it worked, but the accuracy of the determination was rather low. To correct this, a classifier based on the classic neural network model-encoder BERT helped.

In this task, completeness is fixed for expert lines: cases requiring proceedings should not pass by them. But do not forget about the struggle to increase accuracy: as little as possible simple calls should get on the expert line so that the response time to really critical cases does not go beyond the user's patience. The classification accuracy by machine learning methods turned out to be 2 times more effective than keyword analysis. The speed of response to emergency situations increased by 1.5 times.

Trying to automate the work of an expert line within the framework of existing technologies today is fraught: the logic of what is happening is difficult to systematize, and any mistake will be very expensive. Let us return to typical, well-studied first-line queries - maybe entrust their processing to algorithms? So routine tasks will be solved even faster, and employees will be able to pay more attention to controversial cases that go beyond the scope of templates.

To test this idea, a sujest was developed - a hint system that offers support staff the 3 most preferred options for answering the current request: The



experiment was successful: in 70% of cases, operators chose one of the proposed messages, which reduced the response time by 10%. It seems time to fully automate the first line.

Need a plan.What does a front line employee do ?

  1. Reads the text, defines the subject of treatment.
  2. Examines trip information.
  3. Selects one of the prepared answers, taking into account the first two points.

An example to penetrate. Given: a text request for a distressed user, some trip information, a caring support staff.



First of all, the employee will determine the subject of the appeal: "Double debit from the card." Next, check the payment method, status and amount charged. Money written off once: what could be the reason? Yeah, here it is: two notifications in a row.

What should an auto-answer system do?

All the same. Even the key requirements for the answers will not change:

Quality

If the user complains about the application, there is no need to promise to ask the driver to wash the car. It is not enough to understand exactly what the problem is, one must describe in detail how to solve it.

Speed
Especially if the situation is critical and the answer is important right now.

Flexibility and scalability.

The task with an asterisk: although the creation of a support system with Taxi has begun, it is useful to transfer the result to other services: Yandex.Food or Yandex.Lavka, for example. That is, when changing the support logic - response templates, topics of calls, etc. - I want to reconfigure the system in days, not months.

How is it implemented


Stage 1. We determine the subject of the text using ML.

First, we compiled a tree of topics of references and trained the classifier to navigate them. There were about 200 possible problems: with the trip (the driver did not arrive), with the application (I can’t attach the card), with the car (dirty car), etc.

As mentioned above, we used a pre-trained model based on BERT. That is, to classify the query text, it is necessary to present it in the form of vectors so that sentences similar in meaning lie side by side in the resulting space.

BERT is pre-trained on two tasks with unallocated texts. In the first, 15% of tokens are randomly replaced with [MASK], and the network, based on the context, predicts the initial tokens - this provides the model with a natural “bi-directionality”. The second task teaches us to determine the relationship between proposals: were two entries submitted in a row or scattered throughout the text?

After completing the BERT architecture on a sample of requests for Yandex.Taxi technical support, we got a network capable of predicting the subject of the message, adjusted for the specifics of our service. However, the frequency of the topics and the topics themselves change: in order for the network to be updated with them, we will separately train only the lower layers of the model on the latest data - over the past few weeks. So knowledge of the features of the support texts is preserved, and the probabilities for possible classes are distributed adequately to the current day.

A little more about the adequacy: for all of our services - including Taxi - a whole library of model architecture modules and methods for validating probability thresholds has been developed. It allows you to:

  • , : , — ;
  • , . , . , , , .

Stage 2. We work with information about the trip: we prescribe business rules for each template.

Support staff was offered an interface where, for each response template, some mandatory rule was required. How it looks, for example, for the case of double payment:

Template: “Hello! I checked everything: the trip was paid once. The money is first “frozen” on your card and only then debited, because of this the bank may twice inform about one transaction. Please check your bank statement to make sure. If you see two written-off amounts there, please send a scan or a photo of the statement ”

Rule: payment_type is“ card ”and transaction_status is“ clear_success ”and transaction_sum == order_cost

Only for customer support templates, our experts have already completed more than 1.5 thousand rules.

Stage 3. We choose the answer: we combine the appropriate text topics and business rules for the templates.

Each topic is matched with the appropriate response templates: the topic is determined by ML methods, and the templates that respond to it are checked for truth by the rule from the previous paragraph. The user will receive a response, the verification of which yields the value “True”. If there are several such options, the most popular among the support staff will be selected.



By the way, the processes of interaction with drivers in Yandex.Taxi do not change at all: the model only selects the desired template for the operator and independently answers the user.

Finalize


Hooray! The system is designed, the launch took place, the optimization shows excellent results, but it’s too early to relax. Autoresponders should function stably without constant intervention and be easily scalable - on their own or in semi-manual mode. This we achieved thanks to the three-part structure of the system:

  1. Offline development - at this stage, models change, rules are prepared;
  2. Production service - a microservice that picks up updates, applies them and responds to users in real time;
  3. Subsequent analysis of the results to make sure that the new model is working correctly, users are happy with auto answers.



And again to the examples. Top of the most popular Wishlist of customers (and how we deal with them without writing code):

Taxi has cool auto answers: I want the same in Yandex.Ed

To connect any support to our system, you need four simple steps:

  1. Create a topic tree for texts;
  2. Match each theme with patterns;
  3. Fill in a set of rules with templates in our admin panel;
  4. Provide a correspondence table between user requests and support answers.

If all this is there, we will set the path to a new upload, the model will learn from the received data and pull into our microservice along with all the defined rules (integrate with a specific ML theme). Please note: no new logic is written, all within the framework of an existing process!

The logic of support has changed, we want new rules

Please - fill in the new rules in our admin panel. The system will analyze how the changes will affect the percentage of auto answers *, taking into account how demanded the rule was. If everything went well, the completed rules are turned into a config and loaded into the ML service. Hooray! Less than an hour has passed, and business rules have been updated in production, not a single line of code has been written, programmers are not disturbed.

* It seems this is not very obvious, so let's add an example to the example. Suppose, experts introduced a rule: the use of a certain response template is possible only for orders worth more than 200 rubles. If this restriction works, tickets for trips for a smaller amount will remain unclosed, the proportion of automatically selected answers will decrease, and the efficiency of the entire system will decrease. To prevent this from happening, it is important to intercept the failed rules in time and send them for revision.

We added a new theme, we want to change the model, we need everything to work tomorrow.

Often, content specialists want to add new topics, split them into existing ones, or delete irrelevant ones. No problem - you need to change the correspondence between the topics and response templates in the admin panel.

If new or changed topics have already appeared in the answers of first-line support employees, the model, with regular retraining, will automatically tighten this data and calculate thresholds for them (for data from the last week, except for the set deferred for testing).

On the test sample, the old and new models are compared according to special metrics - accuracy, the share is auto-fixed. If the changes are positive, a new model is rolled out in production.

We analyze metrics: do not sink, do not break


We will focus on two criteria - the average rating of an autoresponse by a user and the appearance of additional questions. Changes were monitored in the ab experiment, there was no statistically significant drawdown of metrics, moreover, often users highly rated the results of the model because of the speed of response.

However, no matter how hard we try, machine learning methods sometimes produce absurd reactions. After the next update of the model, we caught such a case:

User: Thanks to the driver, the car arrived on time, the driver did well, everything went perfectly! Support
: We will punish the driver, this will not happen again.

Launch, fortunately, was a test. And the problem was this: the model learned to respond to reviews with a rating of less than 4, and sometimes we mistakenly showed her reviews with 4 and 5 stars. Of course, due to learning limitations, nothing more intelligent neuron could answer. When implemented, such cases are rare (0.1% of the total) - we track them and take appropriate measures: the neural network will respond to the user’s repeated message.

Conclusions and plans for the future


After connecting the automatic response system, we began to respond much more quickly to user requests and pay maximum attention to really complex cases that require detailed investigation. We hope that this will help us improve the quality of Yandex.Taxi and minimize the number of unpleasant incidents.

The auto-fix model closes about 60% of the first line, without squandering the average user rating. We plan to further develop the method and increase the percentage of auto answers on the first line to 99.9%. And, of course, continue to help you - support in our applications and share experience about it on Habré.

All Articles