Machine translate. From the cold war to the present

Machine translation has become very widespread in recent years. Surely, most of my readers have used Google.Translate or Yandex.Translation services at least once. It is also likely that many people remember that not so long ago, about 5 years ago, using automatic translators was very difficult. It is not easy in the sense that they gave out a translation of very poor quality. Under the cut is a brief and incomplete history of machine translation, from which it will be visible in this task and some of its causes and consequences. First, a picture that shows an important concept regarding machine translation:



This concept is called the “noisy channel” concept and came from radio engineering. In different versions, it is attributed to various scientists, Nyquist, Kupfmüller, Shannon, but in this dispute I am rooting for our compatriot - Vladimir Alexandrovich Kotelnikov, who in his 1933 work proved his famous theorem. By itself, this theorem is outside the scope of this article, so I am sending those interested in Wikipedia .

For us, something else is important. The concept of a noisy channel has been applied to a new direction - automatic machine translation. After the end of World War II, our overseas partners decided that the Soviet Union, which had shown its strength by defeating the best army in Europe and the world, posed a serious threat. Various actions were taken to stop this threat, including work on the automatic translation from Russian into English. This was necessary because the Soviet Union produced extremely much information - television programs, radio talks, books and magazines. And if we take into account the negotiations of our allies on the organization of the Warsaw Pact, then the scale of the problem was already simply frightening: it was not possible to train, and even more so maintain such an army of professional translators.And here the idea was born - let's say that the text in Russian is just a distorted text in English, and we will try algorithmically to restore the "source" text. This is exactly what was proposed by Warren Weaver in 1949.

Conceptually, it looks beautiful, but the question is how to implement it. Strongly running ahead in time, this was realized on the basis of the so-called phrase translation.

But let's go in order. What is the easiest way to translate to mind? Dictionary translation - that is, a ready-made dictionary is taken, and all words in the sentence are replaced with their equivalents in another language. This approach was proposed by the notorious IBM company in 1989.. This approach has an obvious drawback: the word order in different languages ​​can differ, and sometimes very much. The next step in this model is to allow permutation of words. And how can these permutations be predicted? In the same work, another model was proposed (if the first is called Model 1, then the second is called very logically Model 2). In this system, in addition to the dictionary, there is a so-called alignment model - correlation of words in two sentences with each other. Alignment is learned based on body statistics. The obvious drawback of this model is that it takes a lot of effort to prepare the case in which the alignment is done, professional translators must not only translate the text, but also indicate which word is which translation.

It is worth noting that in addition to the different order of words, there is, for example, the problem that some words will be completely without translation (for example, articles do not exist in Russian), and some words will require more than one translation word (e.g. preposition + noun). IBM colleagues called this the fertility rate and built models for it also based on statistics. This is Model 3 (pretty predictable, isn't it?). In the same work, several more models are described, they develop the described ideas by adding conditions for predicting the translation of a word - for example, to the previous word, since some words are better combined with each other and therefore are more common. This entire group of models gave rise to the so-called phrase-based translation.

This direction existed and developed, in particular, an open framework for machine translation Moses was developed (on the official website you can see that it has somewhat fallen into decay). At one time, this was the main means of machine translation, although machine translation was not so common at that time. But in 2014 a terrible thing happened - deep learning reached the field of machine translation. If you remember a year earlier it got to vector representations of words, I described this article about embeddings . And in 2014, an article was published by Dmitry Bogdanov (and co-authors, one of whom was the famous Yoshua Bengio) entitled Neural Machine Translation by Jointly Learning to Align and Translate(or - neural machine translation through joint training of alignment and translation). In this work, Dmitry proposed the use of the attention mechanism for recurrent neural networks and with his help he was able to beat the aforementioned Moses by a significant amount.

Here you need to digress and talk about how to measure the quality of machine translation. In the work of PapineniIn 2002, the BLEU metric was proposed (bilingual evaluation understudy - study of bilingual comparison). This metric basically compares how many words from machine translation matched words from the human version. Then the word combinations of two words, three, four are compared. All these figures are averaged and exactly one figure is obtained that describes the quality of the machine translation system on this building. This metric has its drawbacks, for example, there may be different human options for translating one text, but surprisingly for almost 20 years, nothing better has been proposed for assessing the quality of a translation.

But back to the attention mechanism. It should be said that recurrent networks were proposed 15 years earlier, and then did not create any furor. A significant problem with these networks was that they quickly forgot what they “read”. Partially solve this problem for machine translation and the attention mechanism helped. Here it is in the picture:



What is he doing? It weighs the words in the input to give one word vector for translation. This is what made it possible to automatically build alignment matrices based on raw text without markup. For example, such:

image

After everyone saw that it was possible, great efforts were devoted to machine translation, which became the fastest growing field of natural language processing. Significant quality improvements have been achieved, including for distant language pairs, such as English and Chinese or English and Russian. Recurrent networks ruled the ball for quite some time by modern standards - almost 4 years. But at the end of 2017, trumpets sounded announcing the approach of a new king of the mountain. It was an article called Attention is all you need (attention is all you need; a paraphrase of the name of the famous The Beatles song “All you need is love”). This article presented the architecture of the transformer, which a little less than completely consisted of attention mechanisms. I talked more about her in an article on2017 results , so I won’t repeat myself.

Since then, quite a lot of water has flowed, but nevertheless, much more remains. For example, two years ago, at the beginning of 2018, Microsoft researchers announced the achievement of equality in quality with a human translation translated from English into Chinese news documents. This article has been criticized a lot, primarily from the standpoint that the achievement of equal numbers by BLEU is an indicator of the incomplete adequacy of the BLEU metric. But hype was generated.

Another interesting direction in the development of machine translation is machine translation without parallel data. As you remember, the use of neural networks allowed us to abandon the alignment markup in translated texts for teaching the machine translation model. The authors of Unsupervised Machine Translation Using Monolingual Corpora Only (a machine translation using only monolingual data) presented a system that with some quality was able to translate from English to French (the quality was, of course, lower than the best achievements of that time, but only by 10%) . Interestingly, the same authors improved their approach using phrasal translation ideas later that year.

Finally, the last thing I would like to highlight is the so-called non-self-regressive translation. What it is? All models, starting with IBM Model 3, rely on previous words already translated when translating. And the authors of the work , which is called non-self-regressive machine translation, tried to get rid of this dependence. The quality also turned out to be slightly less, but the speed of such a translation can be tens of times faster than for autoregressive models. Considering that modern models can be very large and slow, this is a significant gain, especially under heavy load.

It goes without saying that the region does not stand still and new ideas are being proposed, for example, the so-called back-translation, when the monolingual data translated by the model itself is used for further training; the use of convolution networks, which is also faster than the standard transformer these days; the use of pre-trained large language models (I have a separate article about them ). All, unfortunately, cannot be listed.

Our company has one of the leading scientists in the field of machine translation - Professor Qun Liu. Professor Liu and I are leading a course in natural language processing, in which substantial attention is paid specifically to machine translation. If you are interested in this area, then you can still join our course , which began a month ago.

And if you feel the strength in yourself, then we will be glad to see you among the participants in our competition to translate from Chinese to Russian! The competition will begin on April 14 and will last exactly a month. We hope that our participants will achieve new results in this task and will be able to advance the entire field of machine translation. The competition will be held on the MLBootCamp platform, and we are very grateful to the MLBootCamp team and personally Dmitry Sannikov for their help in organizing.

Competition Link

All Articles