Dans un domaine d'apprentissage en profondeur tel que le traitement d'images, la bibliothèque Keras joue un rôle clé, simplifiant considérablement l'apprentissage par transfert et utilisant des modèles pré-formés. Dans le domaine du traitement du langage naturel (PNL), pour résoudre des problèmes assez complexes, comme répondre à des questions ou classer des intentions, il faut combiner une série de modèles. Dans cet article, nous allons montrer comment la bibliothèque DeepPavlov simplifie la construction de chaînes de modèles pour NLP. Basé sur DeepPavlov et utilisant Azure ML, nous allons construire un réseau neuronal de questions-réponses formé sur l'ensemble de données COVID-19.

, , , BERT. BERT, , , , .
BERT , , BERT . , , , , — , — , TF-IDF ( ) .
, . , :
.
DeepPavlov
DeepPavlov . :
- ;
- , config-;
- , ;
- Python SDK .
NLP . REST API Microsoft Bot Framework. , DeepPavlov , Keras .
DeepPavlov - demo.deeppavlov.ai.
BERT DeepPavlov
BERT. DeepPavlov , , Twitter. chainer
, :
simple_vocab
(y
), , (y_ids
);transformers_bert_preprocessor
x
BERT;transformers_bert_embedder
BERT-one_hotter
y_ids
one-hot encoding, ;keras_classification_model
— , CNN ;proba2labels
— , .
:
dataset_reader
— ;train
— ;- .
, :
python -m deeppavlov install sentiment_twitter_bert_emb.json
python -m deeppavlov download sentiment_twitter_bert_emb.json
python -m deeppavlov train sentiment_twitter_bert_emb.json
install
(, Keras
, transformers
..), download
, .
, :
python -m deeppavlov interact sentiment_twitter_bert_emb.json
Python SDK:
model = build_model(configs.classifiers.sentiment_twitter_bert_emb)
result = model(["This is input tweet that I want to analyze"])
: ODQA
, BERT, , , ODQA (Open Domain Question Answering). ODQA , Wikipedia. , , . BERT .
, ODQA :

ODQA DeepPavlov , R-NET, BERT. , , ODQA BERT. "" COVID-19 OpenResearch Dataset, 52 000 COVID-19. , .
Azure ML
Azure Machine Learning, Notebooks. AzureML — Dataset. COVID-19 Semantic Scholar. , JSON-.
Azure ML Dataset. Azure ML Portal, Datasets from web files. file, . tabular, . URL, .

, , . , notebook compute, . ODQA , Azure ML NC12 112 . .
:
from azureml.core import Workspace, Dataset
workspace = Workspace.from_config()
dataset = Dataset.get_by_name(workspace, name='COVID-NC')
.tar.gz
. , UNIX:
mnt_ctx = dataset.mount('data')
mnt_ctx.start()
!tar -xvzf ./data/noncomm_use_subset.tar.gz
mnt_ctx.stop()
. noncomm_use_subset
.json
, abstract
body_text
. , Python-:
from os.path import basename
def get_text(s):
return ' '.join([x['text'] for x in s])
os.makedirs('text',exist_ok=True)
for fn in glob.glob('noncomm_use_subset/pdf_json/*'):
with open(fn) as f:
x = json.load(f)
nfn = os.path.join('text',basename(fn).replace('.json','.txt'))
with open(nfn,'w') as f:
f.write(get_text(x['abstract']))
f.write(get_text(x['body_text']))
text
, . :
!rm -fr noncomm_use_subset
ODQA
, ODQA DeepPavlov. en_odqa_infer_wiki
:
import sys
!{sys.executable} -m pip --quiet install deeppavlov
!{sys.executable} -m deeppavlov install en_odqa_infer_wiki
!{sys.executable} -m deeppavlov download en_odqa_infer_wiki
. , , , . !
, :
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model
odqa = build_model(configs.odqa.en_odqa_infer_wiki)
answers = odqa([ "Where did guinea pigs originate?",
"When did the Lynmouth floods happen?" ])
:
['Andes of South America', '1804']
, Wikipedia. , :
- What is coronavirus? — a strain of a particular virus
- What is COVID-19? — nest on roofs or in church towers
- Where did COVID-19 originate? — northern coast of Appat
- When was the last pandemic? — 1968
, … , , . — .
(ranker), . DeepPavlov. ODQA en_ranker_tfidf_wiki
, data_path
, , :
from deeppavlov.core.common.file import read_json
model_config = read_json(configs.doc_retrieval.en_ranker_tfidf_wiki)
model_config["dataset_reader"]["data_path"] = os.path.join(os.getcwd(),"text")
model_config["dataset_reader"]["dataset_format"] = "txt"
model_config["train"]["batch_size"] = 1000
, .
, :
doc_retrieval = train_model(model_config)
doc_retrieval(['hydroxychloroquine'])
, .
ODQA , :
squad = build_model(configs.squad.multi_squad_noans_infer, download = True)
odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = False)
odqa(["what is coronavirus?","is hydroxychloroquine suitable?"])
:
['an imperfect gold standard for identifying King County influenza admissions',
'viral hepatitis']
…
BERT Q&A
DeepPavlov , Stanford Question AnsweringDataset (SQuAD): R-NET BERT. R-NET. BERT. squad_bert_infer
- BERT:
!{sys.executable} -m deeppavlov install squad_bert_infer
bsquad = build_model(configs.squad.squad_bert_infer, download = True)
ODQA, :
{
"class_name": "logit_ranker",
"squad_model":
{"config_path": ".../multi_squad_noans_infer.json"},
"in": ["chunks","questions"],
"out": ["best_answer","best_answer_score"]
}
, multi_squad_noans_infer
. ODQA, squad_model
squad_bert_infer
:
odqa_config = read_json(configs.odqa.en_odqa_infer_wiki)
odqa_config['chainer']['pipe'][-1]['squad_model']['config_path'] =
'{CONFIGS_PATH}/squad/squad_bert_infer.json'
, :
odqa = build_model(odqa_config, download = False)
odqa(["what is coronavirus?",
"is hydroxychloroquine suitable?",
"which drugs should be used?"])
, :
, Azure Machine Learning NLP DeepPavlov - . DeepPavlov , , , . COVID Kaggle , , DeepPavlov Azure Machine Learning. , DeepPavlov – .
Azure ML DeepPavlov . , . . Data Science , , , !