The Talking Head project, or the history of developing software for a telepresence robot

My work at T-Systems began with a rather unusual project. Actually, largely because of this project, I came to the company. During our first telephone conversation, the task was described to me as follows: it is necessary to develop a speech translation system for a telepresence robot. I heard about telepresence robot for the first time, but what could be more exciting than developing for robots? Therefore, I agreed almost immediately.



My first call to the customer. The task is posed approximately like this - we want the speech translation system to work during the conversation through this telepresence robot, that is, the person speaks Russian, and on the other end of the “wire” the robot reproduces his speech in German and vice versa. Immediately after such a statement of the problem was followed by the question of how much people and time it takes to develop this, and how much it will cost. We agreed that I will look at the documentation on the robot (the robot itself was not in our office), I will study the translation systems and give rough estimates in a month, but for now we will call up a couple of times a week to clarify the details.

The documentation for the robot was rather scarce, to say the least. It was impossible to find anything other than marketing materials, so I wrote to the developers. The existing system made it possible to make a video call to a telepresence robot from a web browser and control its movements in a remote location, and the obvious idea that I had was: is it possible to somehow integrate a speech translation system into the current system?

In parallel, it was necessary to understand how the translation system can be made, and most importantly, what will be the quality of this translation, will it be acceptable to the customer? Therefore, simultaneously with the correspondence with the developers of the robot, I decided to make a call with the customer and communicate with him through Translator. Translator is a mobile application from Microsoft that can translate and voice translation in different languages. The application is free, installed on Android or iOS and uses the appropriate service from Microsoft. During the call, we had to find out whether it is possible to conduct a conversation in this way. For the purity of the experiment, I spoke Russian, and the customer in German, that is, both absolutely did not know the second language, called up through a conference call and first spoke their native language in Translator,and then brought the mobile to the conference handset. 

In general, this did not look very convenient, but a general idea of ​​the quality of the translation gave (an inquisitive reader would say that it was easier to phone through Skype with the translation turned on, but for some reason it did not occur to us then). After the conversation, we decided that the project to be and the quality for the customer is acceptable.

So, following the results of the first month of work on the project, the following was revealed. The existing system for the telepresence robot cannot be modified, since the developers of the robot do not provide an API for this. However, it turned out that the robot is a kind of Segway to which the iPad connects, and there is an SDK for iOS, with which you can control the Segway. That is, you must write your own system completely, not counting on integration with the existing one. The first schematic representation of a future product has appeared.

 

After discussion with the customer, it was decided that 3 people would work on the project - iOS developer, backend & frontend developers. We will divide the project itself into stages of 2-3 months long and will set a goal for each stage with the conclusion of a separate contract. At the end of the stage - a demonstration of the results to the customer and investor of the project. It was important for the customer that by the new year there was already a working POC. Since I joined the company in July, the deadlines were pretty tight.

The goal for the first 2-month stage was set to find two more developers, choose services for video calls and translation. At the end of the stage, it was necessary to show a simple web application that allows you to make a video call with an integrated translation service.

WebRTC is now widely used for making video calls, which is currently supported by most web browsers. However, in view of limited resources, we did not dare to write on the bare WebRTC and chose a provider that provides the service on top of WebRTC. We considered two services - TokBox and Voxeet. As a result, they chose TokBox. It is necessary to take into account that these services are paid, therefore, along with the description of the application architecture, we performed cost calculations for a minute of a video call with translation.



We also considered several services for translation: TranslateYourWorld.com (it provides a single API that unites several translation providers at once, including Google and Microsoft, but the company is very small and unreliable, their site did not open at the time of this writing), Google Cloud translation, Microsoft Speech. As a result, we stopped at the MS Speech service, since it has a more convenient API, we can stream the audio stream to the service via websocket and immediately receive audio recordings with translation, plus good translation quality. In addition, Microsoft is a partner of our company, which facilitates cooperation with them.

An experienced, talented (Misha, hello!) Full stack developer joined me, and in early October, our small team successfully closed the next stage and demonstrated a simple web application that, using TokBox and MS Speech services, made it possible to arrange video calls with translation into several languages including Russian and German.

The next stage was designed for 3 months and its goal was MVP with a robot. At this stage, the third and last developer joined us. The following technology stack was chosen for the project: the backend - NodeJS + MongoDB, the web frontend on ReactJS and Swift for the iOS application that interacts with the robot. A story deserves a separate story about how I drove a robot from Germany to St. Petersburg and first explained to German, then to our customs officers, what kind of device this is in the box.





At the first stages, while we didn’t have a robot, the iOS application only allowed us to make a video call with translation. When the robot was with us, in early December, we quickly added the sending of control signals to the robot through an already debugged call connection, and it became possible to control the robot by pressing the arrows on the keyboard. Thus, in January we closed the third stage and demonstrated its results to the customer.

The result of the penultimate stage was a debugged system with the correction of all critical bugs. When calling a robot, the operator can see the room through the main camera and the floor through the secondary camera (to make it easier to go around obstacles). The operator’s face is displayed on the screen of the robot. In this case, the conversation can be conducted with translation into any of the 7 supported languages, which include Russian, German and some other languages. During the translation, two lines of text run: the first is recognized speech and the second translation into another language with karaoke lighting, so that it is clear which part of the phrase is currently being played on the other end of the connection.



And finally, at the last stage, we added a user management system, user groups, statistics and reports on the number and duration of calls. Thus, a year later, we successfully completed the project and transferred it to the customer. 

In the course of work on the project, new, interesting ideas arose. For example, wear virtual reality goggles on the operator so that he has the full immersion effect and the ability to control the robot by turning his head. 

We also conducted a study on the subject of creating a self-driving robot using the ROS framework in order to “teach” it to avoid obstacles on its own and get from point A to point B or meet guests at the entrance to the office. In general, the project turned out to be quite fascinating, and who knows what other interesting features we will add to the system in the future.


All Articles