🚄 🙇🏽 💙 Compare the work of open source Python - libraries for the recognition of named entities 👨🏼‍🚀 ⬇️ 🤲🏾

Introduction

We at the company are creating a service that allows you to automatically create, manage and safely store license agreements and other agreements between freelancers and their clients.

To solve this problem, I tried dozens of solutions in the field of natural language processing, including open source solutions and would like to share my experience with open source Python - libraries for recognizing named entities.

Recognized Named Entities

A few words about the problem itself. Named Entity Recognition (NER) is a branch of human language processing technology, the software implementation of which allows one to find objectified categories of words and phrases in speech and text. At first, these were geographical names, names of people, organizations, addresses, but now this concept has been expanded to a great extent and with the help of NER we are looking in the text for relative and absolute dates, numbers, numbers, etc.
The identification of named entities is the “gateway” to the human language, it allows you to identify and process the intentions of a person, to establish the connection of words in his speech and the real world.

Language inequality

To begin with, I would like to draw attention to the obvious inequality in software solutions for different languages. So, most of the developments (including those created by Russian programmers) work with English. Finding ready-made models for Bahasa, Hindi or Arabic is an ungrateful task.

European languages are at the very least represented in the most popular libraries; African languages do not exist in modern Natural Language Processing in principle. Meanwhile, from my own experience I know that the African continent is a huge and rich market, and this attitude is most likely the inertia of the market.

There are several solutions for the Russian language that are surprising in their quality, however, they do not feel such commercial power and academic potential as for developed libraries "built" to process English.

Text to be processed

I took several suggestions from different sources, and combined them into a somewhat hypnotic text to test how well the selected libraries would do their job.

english_text = ''' I want a person available 7 days and with prompt response all most every time. Only Indian freelancer need I need PHP developer who have strong experience in Laravel and Codeigniter framework for daily 4 hours. I need this work by Monday 27th Jan. should be free from plagiarism . 
Need SAP FICO consultant for support project needs to be work on 6 months on FI AREAWe.  Want a same site to be created as the same as this https://www.facebook.com/?ref=logo, please check the site before contacting to me and i want this site to be ready in 10 days. They will be ready at noon tomorrow .'''

russian_text = '''   110     ,     .        https://www.sobyanin.ru/  , 1 .     .51 (   :  , )  ?     2107   47     24,    . 
 c        10  1970 ,     -, . ,  5/1 8 000 ( )  00  .               .              - .'''

NLTK library

NLTK is a classic library for natural language processing, it is easy to use, does not require long-term study and performs 99% of the tasks that may arise when solving student problems.

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
for sent in nltk.sent_tokenize(english_text):
   for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
         print(chunk)

Output: As we can see, NLTK did a good job, but in order to make the result more “rich” we will have to train our own tagger (or choose another one from a fairly wide list). But is it worth it in 2020 if there are simpler solutions?

(GPE Indian/JJ)
(ORGANIZATION PHP/NNP)
(GPE Laravel/NNP)
(PERSON Need/NNP)
(ORGANIZATION SAP/NNP)
(ORGANIZATION FI/NNP)

Stanford CoreNLP

One way to extend the capabilities of NLTK is to use the classic Java library from Stanford CoreNLP along with the classic Python library. Quality improves significantly, with relatively low requirements.

from nltk.tag.stanford import StanfordNERTagger
jar = "stanford-ner-2015-04-20/stanford-ner-3.5.2.jar"
model = "stanford-ner-2015-04-20/classifiers/" 
st_3class = StanfordNERTagger(model + "english.all.3class.distsim.crf.ser.gz", jar, encoding='utf8') 
st_4class = StanfordNERTagger(model + "english.conll.4class.distsim.crf.ser.gz", jar, encoding='utf8') 
st_7class = StanfordNERTagger(model + "english.muc.7class.distsim.crf.ser.gz", jar, encoding='utf8')
for i in [st_3class.tag(english_text.split()), st_4class.tag(english_text.split()), st_7class.tag(english_text.split())]:
  for b in i:
    if b[1] != 'O':
        print(b)

Output: As we can see, the quality of the output has improved significantly, and now, given the speed and ease of use, it is obvious that NLTK is quite suitable for industrial applications as well.

('PHP', 'ORGANIZATION')
('Laravel', 'LOCATION')
('Indian', 'MISC')
('PHP', 'ORGANIZATION')
('Laravel', 'LOCATION')
('Codeigniter', 'PERSON')
('SAP', 'ORGANIZATION')
('FICO', 'ORGANIZATION')
('PHP', 'ORGANIZATION')
('Laravel', 'LOCATION')
('Monday', 'DATE')
('27th', 'DATE')
('Jan.', 'DATE')
('SAP', 'ORGANIZATION')

Spacy

Spacy is an open source Python library for natural language processing, it is published under the MIT license (!), It was created and developed by Matthew Hannibal and Ines Montany, founders of the developer company Explosion.
As a rule, everyone who is faced with the need to solve some problems for processing a natural language will sooner or later learn about this library. Most of the functions are accessible “out of the box”; developers take care that the library is easy to use.
Space offers 18 tags that mark named entities, as well as a simple way to retrain your own model. Add excellent documentation, a huge community and good support here - and it will become clear why this solution has become so popular in the last couple of years.

import spacy
model_sp = en_core_web_lg.load()
for ent in model_sp(english_text).ents:
  print(ent.text.strip(), ent.label_)

Output: As you can see, the result is much better, and the code is much simpler and more understandable. Cons of the work - a large weight of models, slow operation, relatively illogical "tags", the lack of models for many languages, including Russian (although there are multilingual models).

7 days DATE
New York GPE
Indian NORP
Laravel LOC
Codeigniter NORP
4 hours TIME
Monday 27th Jan. DATE
FICO ORG
6 months DATE
10 days DATE
noon TIME
tomorrow DATE
Iceland GPE

Flair

Flair offers a much deeper immersion in the subject area, the library was created, in fact, for solving research problems, the documentation is not bad, but with some failures, there is integration with a large number of other libraries, clear, logical and readable code.
The library has a developed community, and not only oriented to English, due to the large number of available models, Flair is significantly more democratic in choosing languages than Spacy.

from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner')
from flair.data import Sentence
s = Sentence(english_text)
tagger.predict(s)
for entity in s.get_spans('ner'):
    print(entity)

Output: As you can see, the trained model worked not in the best way. However, few people use Flair “out of the box” - it is primarily a library for creating their own tools. The same, with reservations, can be said about the next library.

Span [6,7]: "7 days" [− Labels: DATE (0.9329)]
Span [17]: "Indian" [− Labels: NORP (0.9994)]
Span [35,36]: "4 hours." [− Labels: TIME (0.7594)]
Span [42,43,44]: "Monday 27th Jan." [− Labels: DATE (0.9109)]
Span [53]: "FICO" [− Labels: ORG (0.6987)]
Span [63,64]: "6 months" [− Labels: DATE (0.9412)]
Span [98,99]: "10 days." [− Labels: DATE (0.9320)]
Span [105,106]: "noon tomorrow" [− Labels: TIME (0.8667)]

Deeppavlov

DeepPavlov is an open source library built by TensorFlow and Keras.

Developers suggest using the system primarily for “conversational” systems, chat bots, etc., but the library is also excellent for solving research problems. Using it “in production” without serious work to “customize” and “finish” the solution is a task that, it seems, is not even suggested by the creators of MIPT.
A strange and illogical approach to the code architecture, which contradicts Python Zen, nevertheless brings good results if you spend enough time to deal with it.

from deeppavlov import configs, build_model
from deeppavlov import build_model, configs

ner_model = build_model(configs.ner.ner_ontonotes_bert, download=True)
result = ner_model([english_text])
for i in range(len(result[0][0])):
     if result [1][0][i] != 'O':
         print(result[0][0][i], result[1][0][i])

Output: The result is predictable, understandable, detailed and one of the best. The model itself can also be used directly in Hugging Face Transformers, which removes, in many ways, claims to the code architecture.

7 B-DATE
days I-DATE
Indian B-NORP
Laravel B-PRODUCT
Codeigniter B-PRODUCT
daily B-DATE
4 B-TIME
hours I-TIME
Monday B-DATE
27th I-DATE
Jan I-DATE
6 B-DATE
months I-DATE
FI B-PRODUCT
AREAWe I-PRODUCT
10 B-DATE
days I-DATE
noon B-TIME
tomorrow B-DATE

deepmipt / ner

This is, in fact, the library with which Deep Pavlov began. It can be used to understand the direction of developers' thoughts and the progress that they have made.

import ner
example = russian_text
def deepmint_ner(text):
  extractor = ner.Extractor()
  for m in extractor(text):
     print(m)
deepmint_ner(example)

Output:

Match(tokens=[Token(span=(7, 13), text='')], span=Span(start=7, end=13), type='LOC')
Match(tokens=[Token(span=(492, 499), text='')], span=Span(start=492, end=499), type='PER')
Match(tokens=[Token(span=(511, 520), text=''), Token(span=(521, 525), text='')], span=Span(start=511, end=525), type='PER')
Match(tokens=[Token(span=(591, 600), text='')], span=Span(start=591, end=600), type='LOC')
Match(tokens=[Token(span=(814, 820), text=''), Token(span=(821, 829), text='')], span=Span(start=814, end=829), type='PER')

Polyglot

One of the oldest libraries, fast work and a large number of supported languages make it still popular. On the other hand, the viral GPLv3 license does not allow its full use in commercial development.

from polyglot.text import Text
for ent in Text(english_text).entities:
 print(ent[0],ent.tag)

Output: And for the Russian language:

Laravel I-LOC
SAP I-ORG
FI I-ORG

!polyglot download embeddings2.ru ner2.ru
for ent in Text(russian_text).entities:
 print(ent[0],ent.tag)

Output: The result is not the best, but the speed and good support can improve it if you make an effort.

24 I-ORG
I-PER
I-LOC
I-PER
I-ORG
I-PER

AdaptNLP

Another new library with an extremely low entry threshold for the researcher.
AdaptNLP allows users, from students to experienced data engineers, to use modern NLP models and training methods.
The library is built on top of the popular Flair and Hugging Face Transformers libraries.

from adaptnlp import EasyTokenTagger
tagger = EasyTokenTagger()
sentences = tagger.tag_text(
    text = english_text, model_name_or_path = "ner-ontonotes"
)
spans = sentences[0].get_spans("ner")
for sen in sentences:
    for entity in sen.get_spans("ner"):
        print(entity)

Output: The result is acceptable, but the library allows you to use a variety of models to complete the task, and it can be repeatedly improved if you make efforts (but why, if you have Flair and Hugging Face Transformers directly). Nevertheless, simplicity, a large list of tasks and a good architecture, as well as the systematic efforts of developers allow us to hope that the library has a future.

DATE-span [6,7]: "7 days"
NORP-span [18]: "Indian"
PRODUCT-span [30]: "Laravel"
TIME-span [35,36,37]: "daily 4 hours"
DATE-span [44,45,46]: "Monday 27th Jan."
ORG-span [55]: "FICO"
DATE-span [65,66]: "6 months"
DATE-span [108,109]: "10 days"
TIME-span [116,117]: "noon tomorrow"

Stanza

Stanza from StanfordNlp is a gift to developers in 2020 from Stanford University. What Spacy lacked was multilingualism, a deep immersion in the language along with ease of use.
If the community supports this library, it has every chance of becoming one of the most popular.

import stanza
stanza.download('en')
def stanza_nlp(text):
  nlp = stanza.Pipeline(lang='en', processors='tokenize,ner')
  doc = nlp(text)
  print(*[f'entity: {ent.text}\ttype: {ent.type}' for sent in doc.sentences for ent in sent.ents], sep='\n')
stanza_nlp(english_text)

Output: And for the Russian language:

entity: 7 days type: DATE
entity: Indian type: NORP
entity: Laravel type: ORG
entity: Codeigniter type: PRODUCT
entity: daily 4 hours type: TIME
entity: Monday 27th Jan. type: DATE
entity: SAP type: ORG
entity: FICO type: ORG
entity: 6 months type: DATE
entity: FI AREAWe type: ORG
entity: 10 days type: DATE
entity: noon tomorrow type: TIME

import stanza
stanza.download('ru')
def stanza_nlp_ru(text):
  nlp = stanza.Pipeline(lang='ru', processors='tokenize,ner')
  doc = nlp(text)
  print(*[f'entity: {ent.text}\ttype: {ent.type}' for sent in doc.sentences for ent in sent.ents], sep='\n')
stanza_nlp_ru(russian_text)

Output: Fast work, beautiful code, good result.

2020-05-15 08:01:18 INFO: Use device: cpu
2020-05-15 08:01:18 INFO: Loading: tokenize
2020-05-15 08:01:18 INFO: Loading: ner
2020-05-15 08:01:19 INFO: Done loading processors!
entity: type: LOC
entity: type: LOC
entity: type: PER
entity: 2107 type: MISC
entity: 47 type: MISC
entity: 24 type: MISC
entity: type: PER
entity: - type: LOC
entity: . type: LOC
entity: type: LOC
entity: type: LOC
entity: type: PER

Allennlp

Library for research, built on PyTorch /
On the one hand, simple architecture and fast speed, on the other hand, developers are constantly changing something in the architecture, which affects the work of the library as a whole.

from allennlp.predictors.predictor import Predictor
import allennlp_models.ner.crf_tagger
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/ner-model-2020.02.10.tar.gz")
allen_result = predictor.predict(
  sentence=english_text
)
for i in zip(allen_result['tags'], allen_result['words']):
    if (i[0]) != 'O':
      print(i)

Output:
('U-MISC', 'Indian') ('U-MISC', '
PHP')
('U-MISC', 'Laravel')
('U-MISC', 'Codeigniter')
('B- ORG ',' SAP ')
(' L-ORG ',' FICO ')
The module works quickly, but the result is unacceptably poor.

Hanlp

HanLP is one of the open source libraries from developers from China. A smart, well-developed, active project, which, it seems to me, will find its niche beyond the borders of the Celestial Empire.
NLP library for researchers and companies created on TensorFlow 2.0.
HanLP comes with pre-prepared models for different languages, including English, Chinese and many others.
The only problem is the quality of the output "jumps" after each update of the library.

recognizer = hanlp.load(hanlp.pretrained.ner.MSRA_NER_BERT_BASE_ZH)
recognizer([list('上海华安工业（集团）公司董事长谭旭光和秘书张晚霞来到美国纽约现代艺术博物馆参观。'),
                list('萨哈夫说，伊拉克将同联合国销毁伊拉克大规模杀伤性武器特别委员会继续保持合作。')])

Output:


[[('上海华安工业（集团）公司', 'NT', 0, 12), ('谭旭光', 'NR', 15, 18),
('张晚霞', 'NR', 21, 24),
('美国', 'NS', 26, 28),
('纽约现代艺术博物馆', 'NS', 28, 37)],
[('萨哈夫', 'NR', 0, 3),
('伊拉克', 'NS', 5, 8),
('联合国销毁伊拉克大规模杀伤性武器特别委员会', 'NT', 10, 31)]]

import hanlp
tokenizer = hanlp.utils.rules.tokenize_english
testing = tokenizer('Need SAP FICO consultant for support project needs to be work on 6 months on FI AREAWe')
recognizer = hanlp.load(hanlp.pretrained.ner.CONLL03_NER_BERT_BASE_UNCASED_EN)
recognizer(testing)

Output: For English, the result is unstable, but, this is solved using the tokenizer from NLTK.

[('SAP FICO', 'ORG', 1, 3)]

PullEnti

C # library for NER in Russian. In 2016, she won first place in the factRuEval-2016 competition. In 2018, the author ported the code to Java and Python.
Probably the prettiest solution for the Russian language.
Quickly, deeply, with attention to detail. The decision is rule based, which naturally limits its development, but its autonomy, speed and results allow us to hope for the development of the project.
There is a python-wrapper for the library, although it looks “abandoned”.

from pullenti_wrapper.processor import (
    Processor,
    MONEY,
    URI,
    PHONE,
    DATE,
    KEYWORD,
    DEFINITION,
    DENOMINATION,
    MEASURE,
    BANK,
    GEO,
    ADDRESS,
    ORGANIZATION,
    PERSON,
    MAIL,
    TRANSPORT,
    DECREE,
    INSTRUMENT,
    TITLEPAGE,
    BOOKLINK,
    BUSINESS,
    NAMEDENTITY,
    WEAPON,
)

processor = Processor([PERSON, ORGANIZATION, GEO, DATE, MONEY])
text = russian_text
result = processor(text)
result.graph

Output:

Natasha

Natasha, this seems to be one of the main NLP projects for the Russian language. It has a long history, and began with a rule based solution that was developed through the popular Yargy Parser, and now solves the main NLP tasks for the Russian language: tokenization, sentence segmentation, lemmatization, phrase normalization, parsing, NER-tagging, fact extraction.

from natasha import (
    Segmenter,
    MorphVocab,
    
    NewsEmbedding,
    NewsMorphTagger,
    NewsSyntaxParser,
    NewsNERTagger,
    
    PER,
    NamesExtractor,

    Doc
)

segmenter = Segmenter()
morph_vocab = MorphVocab()

emb = NewsEmbedding()
morph_tagger = NewsMorphTagger(emb)
syntax_parser = NewsSyntaxParser(emb)
ner_tagger = NewsNERTagger(emb)

names_extractor = NamesExtractor(morph_vocab)

doc = Doc(russian_text)

Output: The result, unfortunately, is not stable, unlike the custom rules of Yargy Parser from the same developer, however, the project is actively developing, and shows a decent result for commercial use.

110 ,
LOC───
.
https://www.sobyanin.ru/ , 1 .
.51 ( :
LOC───────────────
, ) ?
ORG PER────────────
2107 47
24, .
ORG──
c
PER────────────────
10 1970 ,
─────────
-, . , 5/1 8 000 ( )
LOC──────────── PER─────────
00
LO
.

PER────────────
- .

ner-d

The last module is a private project, not particularly popular, built on top of Spacy and Thinc and, nevertheless, worthy of attention to the approach chosen to the architecture (emphasis on ease of use).

from nerd import ner
doc_nerd_d = ner.name(english_text)
text_label = [(X.text, X.label_) for X in doc_nerd_d]
print(text_label)

Output: Of all the projects, the most “balanced” and convenient, with acceptable results and ease of use seems to me to be Stanza from StanfordNlp - the work of most languages out of the box, high-quality academic study and support of the scientific community of the university itself makes the project the most promising, in my opinion . Next time I will share my experience about working with "closed" solutions and the proposed API for processing natural language. All Code Available by Google Colab

[('7 days', 'DATE'), ('Indian', 'NORP'), ('PHP', 'ORG'), ('Laravel', 'GPE'), ('daily 4 hours', 'DATE'), ('Monday 27th Jan.', 'DATE'), ('Need SAP FICO', 'PERSON'), ('6 months', 'DATE'), ('10 days', 'DATE'), ('noon', 'TIME'), ('tomorrow', 'DATE')]

Compare the work of open source Python - libraries for the recognition of named entities

Introduction

Recognized Named Entities

Language inequality

Text to be processed

NLTK library

Stanford CoreNLP

Spacy

Flair

Deeppavlov

deepmipt / ner

Polyglot

AdaptNLP

Stanza

Allennlp

Hanlp

PullEnti

Natasha

ner-d

More articles: