DeepPavlov: Keras untuk Memproses Bahasa Alami membantu menjawab pertanyaan tentang COVID-2019

Dalam bidang pembelajaran mendalam seperti pemrosesan gambar, perpustakaan Keras memainkan peran kunci, menyederhanakan pembelajaran transfer secara drastis dan menggunakan model pra-terlatih. Di bidang pemrosesan bahasa alami (NLP), untuk menyelesaikan masalah yang cukup rumit, seperti menjawab pertanyaan atau niat mengklasifikasikan, Anda harus menggabungkan serangkaian model. Pada artikel ini, kami akan menunjukkan bagaimana perpustakaan DeepPavlov menyederhanakan konstruksi rantai model untuk NLP. Berdasarkan DeepPavlov dan menggunakan Azure ML, kami akan membangun jaringan saraf tanya-jawab yang dilatih pada kumpulan data COVID-19.



, , , BERT. BERT, , , , .


BERT , , BERT . , , , , — , — , TF-IDF ( ) .


, . , :


  • BERT,
  • ,

.


DeepPavlov


DeepPavlov . :


  • ;
  • , config-;
  • , ;
  • Python SDK .

NLP . REST API Microsoft Bot Framework. , DeepPavlov , Keras .


DeepPavlov - demo.deeppavlov.ai.


BERT DeepPavlov


BERT. DeepPavlov , , Twitter. chainer , :


  • simple_vocab (y), , (y_ids);
  • transformers_bert_preprocessor x BERT;
  • transformers_bert_embedder BERT-
  • one_hotter y_ids one-hot encoding, ;
  • keras_classification_model — , CNN ;
  • proba2labels — , .

:


  • dataset_reader — ;
  • train — ;
  • .

, :


python -m deeppavlov install sentiment_twitter_bert_emb.json
python -m deeppavlov download sentiment_twitter_bert_emb.json
python -m deeppavlov train sentiment_twitter_bert_emb.json

install (, Keras, transformers ..), download , .


, :


python -m deeppavlov interact sentiment_twitter_bert_emb.json

Python SDK:


model = build_model(configs.classifiers.sentiment_twitter_bert_emb)
result = model(["This is input tweet that I want to analyze"])

: ODQA


, BERT, , , ODQA (Open Domain Question Answering). ODQA , Wikipedia. , , . BERT .


, ODQA :


  • , .. ;
  • BERT-, — .

Gambar dari DeepPavlov Blog


ODQA DeepPavlov , R-NET, BERT. , , ODQA BERT. "" COVID-19 OpenResearch Dataset, 52 000 COVID-19. , .


Azure ML


Azure Machine Learning, Notebooks. AzureML — Dataset. COVID-19 Semantic Scholar. , JSON-.


Azure ML Dataset. Azure ML Portal, Datasets from web files. file, . tabular, . URL, .


Set Data COVID


, , . , notebook compute, . ODQA , Azure ML NC12 112 . .


:


from azureml.core import Workspace, Dataset
workspace = Workspace.from_config()
dataset = Dataset.get_by_name(workspace, name='COVID-NC')

.tar.gz. , UNIX:


mnt_ctx = dataset.mount('data')
mnt_ctx.start()
!tar -xvzf ./data/noncomm_use_subset.tar.gz
mnt_ctx.stop()

. noncomm_use_subset .json, abstract body_text. , Python-:


from os.path import basename
def get_text(s):
    return ' '.join([x['text'] for x in s])

os.makedirs('text',exist_ok=True)

for fn in glob.glob('noncomm_use_subset/pdf_json/*'):
    with open(fn) as f:
        x = json.load(f)
    nfn = os.path.join('text',basename(fn).replace('.json','.txt'))
    with open(nfn,'w') as f:
        f.write(get_text(x['abstract']))
        f.write(get_text(x['body_text']))

text, . :


!rm -fr noncomm_use_subset

ODQA


, ODQA DeepPavlov. en_odqa_infer_wiki:


import sys
!{sys.executable} -m pip --quiet install deeppavlov
!{sys.executable} -m deeppavlov install en_odqa_infer_wiki
!{sys.executable} -m deeppavlov download en_odqa_infer_wiki

. , , , . !


, :


from deeppavlov import configs
from deeppavlov.core.commands.infer import build_model
odqa = build_model(configs.odqa.en_odqa_infer_wiki)
answers = odqa([ "Where did guinea pigs originate?",
                 "When did the Lynmouth floods happen?" ])

:


['Andes of South America', '1804']

, Wikipedia. , :


  • What is coronavirus? — a strain of a particular virus
  • What is COVID-19? — nest on roofs or in church towers
  • Where did COVID-19 originate? — northern coast of Appat
  • When was the last pandemic? — 1968

, … , , . — .



(ranker), . DeepPavlov. ODQA en_ranker_tfidf_wiki, data_path, , :


from deeppavlov.core.common.file import read_json
model_config = read_json(configs.doc_retrieval.en_ranker_tfidf_wiki)
model_config["dataset_reader"]["data_path"] = os.path.join(os.getcwd(),"text")
model_config["dataset_reader"]["dataset_format"] = "txt"
model_config["train"]["batch_size"] = 1000

, .


, :


doc_retrieval = train_model(model_config)
doc_retrieval(['hydroxychloroquine'])

, .


ODQA , :


# Download all the SQuAD models
squad = build_model(configs.squad.multi_squad_noans_infer, download = True)
# Do not download the ODQA models, we've just trained it
odqa = build_model(configs.odqa.en_odqa_infer_wiki, download = False)
odqa(["what is coronavirus?","is hydroxychloroquine suitable?"])

:


['an imperfect gold standard for identifying King County influenza admissions',
 'viral hepatitis']


BERT Q&A


DeepPavlov , Stanford Question AnsweringDataset (SQuAD): R-NET BERT. R-NET. BERT. squad_bert_infer - BERT:


!{sys.executable} -m deeppavlov install squad_bert_infer
bsquad = build_model(configs.squad.squad_bert_infer, download = True)

ODQA, :


{
   "class_name": "logit_ranker",
   "squad_model": 
    {"config_path": ".../multi_squad_noans_infer.json"},
   "in": ["chunks","questions"],
   "out": ["best_answer","best_answer_score"]
}

, multi_squad_noans_infer. ODQA, squad_model squad_bert_infer:


odqa_config = read_json(configs.odqa.en_odqa_infer_wiki)
odqa_config['chainer']['pipe'][-1]['squad_model']['config_path'] = 
                    '{CONFIGS_PATH}/squad/squad_bert_infer.json'

, :


odqa = build_model(odqa_config, download = False)
odqa(["what is coronavirus?",
      "is hydroxychloroquine suitable?",
      "which drugs should be used?"])

, :


what is coronavirus?respiratory tract infection
is hydroxychloroquine suitable?well tolerated
which drugs should be used?antibiotics, lactulose, probiotics
what is incubation period?3-5 days
is patient infectious during incubation period?MERS is not contagious
how to contaminate virus?helper-cell-based rescue system cells
what is coronavirus type?enveloped single stranded RNA viruses
what are covid symptoms?insomnia, poor appetite, fatigue, and attention deficit
what is reproductive number?5.2
what is the lethality?10%
where did covid-19 originate?uveal melanocytes
is antibiotics therapy effective?less effective
what are effective drugs?M2, neuraminidase, polymerase, attachment and signal-transduction inhibitors
what is effective against covid?Neuraminidase inhibitors
is covid similar to sars?All coronaviruses share a very similar organization in their functional and structural genes
what is covid similar to?thrombogenesis


, Azure Machine Learning NLP DeepPavlov - . DeepPavlov , , , . COVID Kaggle , , DeepPavlov Azure Machine Learning. , DeepPavlov – .


Azure ML DeepPavlov . , . . Data Science , , , !


All Articles