Artificial intelligence will first create a corpus of ancient Slavic manuscripts


“In the days of doubt, in the days of painful thoughts about the fate of the motherland,” which are especially difficult in recent weeks, what is support and support for us? :) That's right, great and mighty. And while exchange rates and a pandemic inexorably hold mass consciousness, scientists do not stop working. About who and why will create the corpus - a unique "DBMS" of ancient Slavic manuscripts - in our news.

Collaboration of scientists of NUST “MISiS, Russian Language Institute named after V.V. Vinogradova RAN, HSE, with the support of the Commission for Work with Universities and the Scientific Community under the Diocesan Council of Moscow, has launched a large-scale project to create, using artificial intelligence and machine learning technologies, a unique base of ancient Slavonic manuscripts - the corpus. Creating a corpus of the Old Slavic language will give linguistic researchers and historians a powerful tool for studying all modern national Slavic languages ​​and cultures and will be a unique key to understanding their heritage.

A corpus is a structured language database, an information and reference system based on a collection of texts in a particular language in electronic form. It is a hand-picked and specially processed (marked out ) set of texts that are used as the basis for the study of vocabulary and grammar of the language.


Ancient Slavic texts are a variety of manuscript monuments of the 11th - 17th centuries, the foundation of all modern national Slavic languages ​​and cultures. The creation of the system corpus of the language is associated with laborious, subtle and painstaking work, requiring the combined efforts of professionals from various fields and, according to scientists, is a task of a national nature.

Hieromonk Rodion (Larionov), Deputy Chairman of the Commission for Work with Universities and the Scientific Community at the Diocesan Council of Moscow:
« , . – , , XI – XVII , – . . , -, , , , -, . , – , , , , , , . , ».
Artificial intelligence will cover this entire gigantic array of data, systematize and create algorithms for arranging linguistic markup - the main characteristic of the corpus. It is she who distinguishes the case from a simple library.
Projects on the use of digital approaches to the analysis of cultural heritage are actively developing in European countries and are an excellent example of interdisciplinary interaction.

With regard to linguistic monuments, two principal areas of work can be noted - the conversion of scanned images into a "machine-readable" form and the construction of language models that simplify the analysis and understanding of texts. With Slavic texts, the spelling of letters (graphemes) which is characterized by floridness and widespread use of diacritics, such systemic developments have not yet been undertaken.


, MegaScience «», - :
« . , . , . , – ».


The first stage of the project will be the digitization and marking of the complex of Old Slavic Mena of the XI-XVII centuries in Old Russian, Bulgarian and Serbian - official church books containing the schedule of services for all days of the church year, manuscripts of which are stored in the collections of the State Historical Museum, the Russian National and State Libraries, the Russian State Archive of Ancient Acts, Holy Trinity St. Sergius Lavra.

Someone will say, well, what other ancient Slavic manuscripts are there, the world is in complete shutdown. However, it is worth remembering that after all “in the beginning was the Word” ...

All Articles