Who engages in deep audio and why is it needed

Since the beginning of the year, several new AI systems have appeared that are capable of synthesizing a video recording with a talking person based on audio. We will tell you who and for what purpose is engaged in similar developments. We’ll also talk about other tools that allow you to edit audio recordings.


Photo Erik-Jan Leusink / Unsplash

What do


In December 2019, specialists from the Munich Technical University and the Max Planck Society's Institute of Informatics published a scientific paper on the Neural Voice Puppetry system .

To generate a video recording, she only needs an audio file with a person’s voice and his photo. The process consists of three stages. First, a recurrent neural network analyzes the speech on the record and builds a logit model that reflects the characteristics of the speaker’s pronunciation. It is sent to a generalizing neural network, which calculates the coefficients for building a three-dimensional model of the face. Next, a render module comes into play, which generates the final record.

The developers say that Neural Voice Puppetry plays high-quality videos, but they still have to solve some problems associated with the synchronization of sound.

A similar technology is being developed by engineers from Nanyang University in Singapore. Their system allows you to combine the recording of the speech of one person with the video of another. First of all, it forms a 3D model of the face for each frame on the target video. Further, the neural network analyzes key facial points, and modifies the three-dimensional model so that its expressions coincide with the phonemes of the original audio file. According to the authors, their tool surpasses analogues in quality. During blind tests, respondents marked 55% of the records as “real”.

Where to apply


In the future, dipfakes will allow creating realistic video avatars - personalities for voice assistants. In 2017, enthusiast Jarem Archer implemented the Cortana assistant from Windows 10 as a hologram. Artificial intelligence systems for the formation of dipfakes will take such solutions to a new level. Another area of ​​application of such algorithms is the gaming industry. Generating facial animations by soundtrack will simplify the work of game designers who customize the facial expressions of virtual characters.

Developers of diphake technology note that their systems are just a tool. And unfortunately, it will inevitably be used for illegal purposes. The first such crime was committedin 2019. The director of an English energy company transferred $ 240,000 to a fraudster. He imitated the voice of the head of the concern from Germany using neural networks and asked to complete the transaction. Therefore, experts from universities are actively working with law enforcement agencies and politicians to prevent such situations. For example, the University of Colorado in Denver is developing tools for recognizing fake audio and video recordings. In the future, there will only be more such projects.

What other projects are there


There are tools that allow you to edit audio recordings as easily as ordinary text. For example, Descript offers an audio editor that transcribes the speaker’s words and allows you to edit them in text form. You can add pauses, rearrange the fragments in places - all edits are synchronized with the audio recording. The developers say that the system processes files in .m4a, .mp3, .aiff, .aac and .wav, and the accuracy of transcription exceeds 93% .


Photo by Yohann LIBOT / Unsplash

Other projects appeared at the same time as Descript. Engineers from Princeton University introduced"Photoshop for audio" - the VoCo system. It allows not only editing records in text form, but also synthesizing phrases with the speaker’s voice (taking into account intonations).

In the future, such services will be useful to journalists and media companies that create audio content. They will also help people with specific diseases who communicate using speech synthesis systems. VoCo and its counterparts will make their voice less “robotic."



Additional reading on our Hi-Fi World blog:

“Bitchy Betty” and audio interfaces: why they speak in a female voice
Audio interfaces: sound as a source of information on the road, in the office and in the sky
The world's first “gender-neutral” voice assistant
Synthesizer history of speech: the first mechanical installation
How speech synthesis appeared on a PC



All Articles