Improving Google Duo Audio Quality with WaveNetEQ

Internet calls have become an integral part of the lives of millions of people - so they simplify their workflow and connect with their loved ones. To transfer a call over the Internet, call data is broken up into small pieces called “packages”. Packets go through the network from the sender to the recipient, where they are collected back to receive continuous video and audio stream. However, often packets arrive at the recipient in the wrong order and at the wrong time - this is usually called jitter.(trembling) - or completely lost. Such problems reduce the quality of calls, because the recipient has to try to fill in the gaps, and this seriously affects both audio and video. For example, 99% of calls through Google Duo experience packet loss, excessive jitter, or network latency. Of these, 20% of calls lose more than 3% of audio data due to network problems, and 10% of calls lose more than 8% of data.


Simplified network problem diagram

In order to make communication in real time more reliable, you have to somehow deal with the necessary packages that have not reached the addressee. For example, if you do not give a continuous audio signal, you will hear interruptions and stuttering, but you can’t call it an ideal solution to try to repeat the same signal again and again - this will lead to artifacts and reduce the overall call quality. The technology for handling the situation with the absence of packets is called “packet loss concealment” (PLC). The receiver's PLC module is responsible for creating audio (or video) that fills the interruptions caused by packet loss, strong jitter, or network problems - problems that in any case lead to a lack of necessary data.

To deal with these audio issues, we introduced a new PLC system called WaveNetEQ in Duo. This is a generative model based on WaveRNN technology from DeepMind , trained on a large body of speech data to realistically complement speech segments. She is able to fully synthesize the sound signal of the missing fragments of speech. Since calls to Duo undergo end-to-end encryption, all processing has to be done on the device itself. The WaveNetEQ model is fast enough for a telephone, while still providing excellent audio quality and a more natural-sounding PLC compared to other existing systems.

New PLC system for Duo


Like many other web-based communications programs, Duo is based on the open source WebRTC project . To conceal the consequences of packet loss, the NetEQ system component uses signal processing methods that analyze speech and produce continuous continuity - this works well for small losses (up to 20 ms), but it starts to sound bad when packet loss leads to communication breaks of 60 ms or longer. In such cases, speech becomes similar to the repetitive speech of a robot - this characteristic sound, unfortunately, is well known to many fans of making calls over the Internet.

To improve the quality of packet loss processing, we replaced NetEQ PLC with a modified version of WaveRNN. This is a recurrent neural network designed for speech synthesis, consisting of two parts - autoregressive and conditioned neural networks. The autoregressive neural network is responsible for the continuity of the signal and produces a short-term and medium-term speech structure. In the process of its operation, each generated fragment depends on the previous results of the network. A conditioned neural network affects autoregressive so that it produces an audio signal corresponding to a slower incoming data.

However, WaveRNN, like its predecessor, WaveNet, was created with the goal of converting text to speech (text-to-speech, TTS). Since WaveRNN is a TTS model, it is given information about what needs to be said and how. An air-conditioning network directly receives this information at the input in the form of phonemes making up the word and features of prosody (such non-textual information as pitch or intonation). In a sense, an air-conditioned network is capable of "looking into the future", and then redirecting the autoregressive network towards its corresponding sounds. In the case of the PLC system and real-time communication, we will not have such a context.

To create a functional PLC-system, you need to both extract the context from the current speech (i.e., from the past), and generate an acceptable sound for its continuation. Our solution, WaveNetEQ, does both. It uses an autoregressive network, which continues to sound in the event of packet loss, and a conditioned neural network that simulates long-term symptoms, such as voice characteristics. The spectrogram of the previous audio signal is fed to the input of the conditioned neural network, from which a limited amount of information is extracted that describes prosody and text content. This concentrated information is fed into an autoregressive neural network, combining it with recent audio to predict the next sound fragment.

This is slightly different from the procedure we followed during WaveNetEQ training. Then the autoregressive neural network received a real sound sample as input for the next step, instead of using the previous sample. In such a process, known as teacher forcing, it is guaranteed that the model learns valuable information even in the early stages of training, when its predictions are of poor quality. When the model is fully trained and used in audio or video calls, imposed training is used only to “warm up” the model on the first sample, and after that it already receives its own output.


WaveNetEQ architecture. During the operation of an autoregressive neural network, we “warm up” it through training with imposition. After that, she already gets her own exit to the entrance. A small-frequency spectrogram from long sections of audio is used as input for an air-conditioned neural network.

This model is applied to audio data in a Duo jitter buffer. When, after packet loss, communication resumes and the real audio signal continues to arrive, we carefully combine the synthetic and real audio streams. To best compose these two signals, the model generates a little more output than necessary, and then makes a smooth transition from one to the other. This makes the transition smooth and virtually silent.


Simulation of PLC events in an audio stream on a 60 ms sliding window. The blue line is real audio, including past and future parts of the PLC. At each measure, the orange line represents the synthetic audio that the WaveNetEQ system would predict if the sound were cut along the vertical gray line.

60 ms packet loss

[ Note perev .: examples of audio are so clumsy in appearance, since the Habr editor does not provide the ability to embed audio files. This is how mp4 looks with one audio, without a picture. ]

NetEQ


WaveNetEQ


NetEQ


WaveNetEQ


120 ms

NetEQ packet loss


WaveNetEQ


NetEQ


WaveNetEQ


We guarantee reliability


One of the important factors to consider in the PLC is the ability of the neural network to adapt to variable incoming signals, for example, when there are several talking people or when the background noise changes. To guarantee the reliability of the model for a wide range of users, we trained WaveNetEQ on a set of voice data taken from more than 100 different people who speak 48 different languages. This allowed the model to learn the general characteristics of human speech, and not the features of a particular language. To ensure the operation of WaveNetEQ in the event of noise in the background, when you, for example, answer a call while in a train or in a cafe, we supplement the data by mixing them with background noise from an extensive database.

And although our model is able to learn how to plausibly continue your speech, it only works for short periods of time - it can end syllables, but cannot predict words. In the case of packet loss over long periods of time, we gradually decrease the volume, and after 120 ms the model produces only silence. Also, to ensure that the model does not produce false syllables, we examined the sound samples from WaveNetEQ and NetEQ using the Google Cloud Speech-to-Text APIand found that the model practically does not change the percentage of errors in the resulting text, that is, the number of errors that occur during speech recognition. We experimented with WaveNetEQ at Duo, and its use positively impacted call quality and user experience. WaveNetEQ already works on all Duo calls on Pixel 4 phones, and now we are deploying it on other phones.

All Articles