Lemmatize it faster (PyMorphy2, PyMystem3 and some magic)

I work as a programmer, including machine learning in relation to text analysis. When processing a natural language, preliminary preparation of documents is required, and one of the methods is lemmatization - bringing all words of the text to their normal forms, taking into account the context.

Recently, we were faced with the problem of large time costs for this process. In a specific task there were more than 100,000 documents, the average length of which was about 1000 characters, and it was necessary to implement processing on a regular local computer, and not on our server for calculations. We could not find a solution on the Internet, but we found it ourselves, and I would like to share it - to demonstrate a comparative analysis of the two most popular lemmatization libraries in this article.



Pymorphy2


One of the most popular is PyMorphy2 - it is found in almost every solution that can be found on the network. We also used this library, and it showed itself perfectly, until it was required to do a lemmatization for the entire database (as I wrote above, these are more than 100 thousand small documents). In order to analyze such a volume of documents, PyMorphy2 would take almost 10 hours, while the processor load all this time would be on average about 30% (Intel Core i7 7740X).

Pymystem3


In search of another solution, we analyzed the library from Yandex PyMystem3, but the result was almost twice worse (in time) than PyMorphy2: it would take 16 hours to process 100 thousand documents.

Some magic


It seemed strange to us that the load on the processor was almost zero. It was also strange that in order to get the result from one text, even a large one (3-4 thousand characters), PyMystem3 took about 1 second. Therefore, we decided to combine the texts by adding some kind of separator between them, by which we could again return the structure of our list of documents and give them for lemmatization.

Python solution code:

def checkExecTimeMystemOneText(texts):
    lol = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
    txtpart = lol(texts, 1000)
    res = []
    for txtp in txtpart:
        alltexts = ' '.join([txt + ' br ' for txt in txtp])

        words = mystem.lemmatize(alltexts)
        doc = []
        for txt in words:
            if txt != '\n' and txt.strip() != '':
                if txt == 'br':
                    res.append(doc)
                    doc = []
                else:
                    doc.append(txt)

We took 1000 documents each for the union, the separator between them was “br” (it should be noted that the texts in Russian, and the Latin alphabet and special characters, we previously removed). This solution significantly accelerated lemmatization: on average, it turned out about 25 minutes for all 100 thousand documents, the processor load was 20-40%. True, this solution loads RAM more: on average, it is about 3-7GB, but this is a good result. If there is not enough memory, then you can change the number of documents to be combined, although this will slow down processing, it will still be much faster than one text at a time. Also, if the amount of memory allows, then you can increase this value and get the result even faster.

The table shows a comparison of libraries during processing on a local computer (Intel Core i7 7740X, 32 GB of RAM):
MethodTimeRAM (Mb)Percent
Pymorphy2~ 9.5 hour500 - 60025 - 30%
PyMystem3 by 1 sentence~ 16 hour500 - 7000 - 1%
PyMystem3 with 1000 offers26 minutes2000 - 700025 - 40%

It is interesting to know the opinions of other specialists. Perhaps someone found a more time-efficient way of lemmatizing short texts?

Mikhail Kabakov, Senior Software Engineer,
Anna Mikhailova, Head of Data
Mining, Codex Consortium

All Articles