SpaL的NLP库概述

如今,自然语言处理(NLP)变得非常流行,因为人们与机器进行交流的方式与人们进行交流的方式相比无疑更加容易。


图片


, , NLP: -, , , , . Python SpaCy, , , : https://spacy.io/, . !


NLTK


NLTK(Natural Language ToolKit) — NLP , . , NLTK . , Python .


SpaCy — - NLTK. , Cython , . SpaCy . — numpy NLP.


, SpaCy , , API , , NLTK — . , . Tensorflow, PyTorch .



, SpaCy , NLTK , .


图片



SpaCy Doc Vocab. Doc . Vocab , . , . .


Doc , Span Token , Spacy , , . Doc Tokenizer, in-place pipeline. Language . pipeline, .


图片



NLP SpaCy, : https://course.spacy.io/en



Colab .


Python( pip/conda), , . Python 3.8.2 Pop!_OS 20.04( Ubuntu):
Ubuntu:


sudo apt-get install build-essential python-dev git

SpaCy en_core_web_sm — , (https://spacy.io/models/en):


pip3 install -U spacy
pip3 install -U spacy-lookups-data
python3 -m spacy download en_core_web_sm

Spacy CUDA, . Google Colab . , . , Spacy : https://github.com/buriy/spacy-ru



Colab: https://colab.research.google.com/drive/1BmOAjjYt-t_lT9suZNnf1j5ykDX5IYT0?usp=sharing
SpaCy nlp, :


import spacy
nlp = spacy.load("en_core_web_sm")

, POS-tagging — SpaCy:
( , , , -)


doc = nlp("While Samsung has expanded overseas, South Korea is still host to most of its factories and research engineers.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

While while SCONJ True
Samsung Samsung PROPN False
has have AUX True
expanded expand VERB False
overseas overseas ADV False
,, PUNCT False
South South PROPN False
Korea Korea PROPN False
is be AUX True
still still ADV True
host host NOUN False
to to ADP True
most most ADJ True
of of ADP True
its -PRON- DET True
factories factory NOUN False
and and CCONJ True
research research NOUN False
engineers engineer NOUN False
.. PUNCT False


:
( , ( Universal Dependency), )


for token in doc:
    print(token.text, token.dep_, token.head)

While mark expanded
Samsung nsubj expanded
has aux expanded
expanded advcl is
overseas advmod expanded
, punct is
South compound Korea
Korea nsubj is
is ROOT is
still advmod is
host attr is
to prep host
most pobj to
of prep most
its poss factories
factories pobj of
and cc factories
research compound engineers
engineers conj factories
. punct is


, — . Spacy , . , 11 :


from spacy import displacy
displacy.render(doc[:11], style='dep', jupyter=True)

image


:


doc2 = nlp("Nasa administrator Jim Bridenstine says at the moment of launch, he was praying.")
for ent in doc2.ents:
    print(ent.text, ent.label_)
displacy.render(doc2, style='ent', jupyter=True)

Nasa ORG
Jim Bridenstine PERSON
image


.





SpaCy: https://spacy.io/universe, , PyTorch, - .



, . SpaCy , NLP state-of-the-art . , !


All Articles