En un área de aprendizaje profundo, como el procesamiento de imágenes, la biblioteca de Keras juega un papel clave, simplificando drásticamente el aprendizaje de transferencia y utilizando modelos previamente entrenados. En el campo del procesamiento del lenguaje natural (PNL), para resolver problemas bastante complejos, como responder preguntas o clasificar intenciones, debe combinar una serie de modelos. En este artículo, mostraremos cómo la biblioteca DeepPavlov simplifica la construcción de cadenas de modelos para PNL. Basado en DeepPavlov y usando Azure ML, crearemos una red neuronal de preguntas y respuestas capacitada en el conjunto de datos COVID-19.

, , , BERT. BERT, , , , .
BERT , , BERT . , , , , — , — , TF-IDF ( ) .
, . , :
.
DeepPavlov
DeepPavlov . :
- ;
- , config-;
- , ;
- Python SDK .
NLP . REST API Microsoft Bot Framework. , DeepPavlov , Keras .
DeepPavlov - demo.deeppavlov.ai.
BERT DeepPavlov
BERT. DeepPavlov , , Twitter. chainer
, :
simple_vocab
(y
), , (y_ids
);transformers_bert_preprocessor
x
BERT;transformers_bert_embedder
BERT-one_hotter
y_ids
one-hot encoding, ;keras_classification_model
— , CNN ;proba2labels
— , .
:
dataset_reader
— ;train
— ;- .
, :
python -m deeppavlov install sentiment_twitter_bert_emb.json
python -m deeppavlov download sentiment_twitter_bert_emb.json
python -m deeppavlov train sentiment_twitter_bert_emb.json
install
(, Keras
, transformers
..), download
, .
, :
python -m deeppavlov interact sentiment_twitter_bert_emb.json
Python SDK:
model = build_model(configs.classifiers.sentiment_twitter_bert_emb)
result = model(["This is input tweet that I want to analyze"])
: ODQA
, BERT, , , ODQA (Open Domain Question Answering). ODQA , Wikipedia. , , . BERT .
, ODQA :

ODQA DeepPavlov , R-NET, BERT. , , ODQA BERT. "" COVID-19 OpenResearch Dataset, 52 000 COVID-19. , .
Azure ML
Azure Machine Learning, Notebooks. AzureML — Dataset. COVID-19 Semantic Scholar. , JSON-.
Azure ML Dataset. Azure ML Portal, Datasets from web files. file, . tabular, . URL, .

, , . , notebook compute, . ODQA , Azure ML NC12 112 . .
:
from azureml.core import Workspace, Dataset
workspace = Workspace.from_config()
dataset = Dataset.get_by_name(workspace, name='COVID-NC')
.tar.gz
. , UNIX:
mnt_ctx = dataset.mount('data')
mnt_ctx.start()
!tar -xvzf ./data/noncomm_use_subset.tar.gz
mnt_ctx.stop()
. noncomm_use_subset
.json
, abstract
body_text
. , Python-:
from os.path import basename
def get_text(s):
return ' '.join([x['text'] for x in s])
os.makedirs('text',exist_ok=True)
for fn in glob.glob('noncomm_use_subset/pdf_json/*'):
with open(fn) as f:
x = json.load(f)
nfn = os.path.join('text',basename(fn).replace('.json','.txt'))
with open(nfn,'w') as f:
f.write(get_text(x['abstract']))
f.write(get_text(x['body_text']))
text
, . :
!rm -fr noncomm_use_subset
ODQA
, ODQA DeepPavlov. en_odqa_infer_wiki
:
import sys
!{sys.executable} -m pip --quiet install deeppavlov
!{sys.executable} -m deeppavlov install en_odqa_infer_wiki
!{sys.executable} -m deeppavlov download en_odqa_infer_wiki
. , , , . !
, :
from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model
odqa = build_model(configs.odqa.en_odqa_infer_wiki)
answers = odqa([ "Where did guinea pigs originate?",
"When did the Lynmouth floods happen?" ])
:
['Andes of South America', '1804']
, Wikipedia. , :
- What is coronavirus? — a strain of a particular virus
- What is COVID-19? — nest on roofs or in church towers
- Where did COVID-19 originate? — northern coast of Appat
- When was the last pandemic? — 1968
, … , , . — .
(ranker), . DeepPavlov. ODQA en_ranker_tfidf_wiki
, data_path
, , :
from deeppavlov.core.common.file import read_json
model_config = read_json(configs.doc_retrieval.en_ranker_tfidf_wiki)
model_config["dataset_reader"]["data_path"] = os.path.join(os.getcwd(),"text")
model_config["dataset_reader"]["dataset_format"] = "txt"
model_config["train"]["batch_size"] = 1000
, .
, :
doc_retrieval = train_model(model_config)
doc_retrieval(['hydroxychloroquine'])
, .
ODQA , :
squad = build_model(configs.squad.multi_squad_noans_infer, download = True)
odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = False)
odqa(["what is coronavirus?","is hydroxychloroquine suitable?"])
:
['an imperfect gold standard for identifying King County influenza admissions',
'viral hepatitis']
…
BERT Q&A
DeepPavlov , Stanford Question AnsweringDataset (SQuAD): R-NET BERT. R-NET. BERT. squad_bert_infer
- BERT:
!{sys.executable} -m deeppavlov install squad_bert_infer
bsquad = build_model(configs.squad.squad_bert_infer, download = True)
ODQA, :
{
"class_name": "logit_ranker",
"squad_model":
{"config_path": ".../multi_squad_noans_infer.json"},
"in": ["chunks","questions"],
"out": ["best_answer","best_answer_score"]
}
, multi_squad_noans_infer
. ODQA, squad_model
squad_bert_infer
:
odqa_config = read_json(configs.odqa.en_odqa_infer_wiki)
odqa_config['chainer']['pipe'][-1]['squad_model']['config_path'] =
'{CONFIGS_PATH}/squad/squad_bert_infer.json'
, :
odqa = build_model(odqa_config, download = False)
odqa(["what is coronavirus?",
"is hydroxychloroquine suitable?",
"which drugs should be used?"])
, :
, Azure Machine Learning NLP DeepPavlov - . DeepPavlov , , , . COVID Kaggle , , DeepPavlov Azure Machine Learning. , DeepPavlov – .
Azure ML DeepPavlov . , . . Data Science , , , !