The fastText model is one of the most effective vector representations of words for the Russian language. However, its practical use suffers due to the impressive (several gigabytes) size of the model. In this article we show how you can reduce the fastText model from 2.7 gigabytes to 28 megabytes without losing too much in its quality (3-4%). Spoiler: quantization and feature selection work well, but matrix expansions do not. We also publish a Python package for this compression and examples of a compact model for Russian words.

Why and what is it about
, fastText: fastText , . unsupervised β n-. navec β glove- . .
: ? , β , , (, 300-), - . , ( ). , , , , , . , , (, ) . , , "" .
β , ELMO BERT. , fastText. fastText' β ( , ) n- ( ) . , , , , . fastText , n- .
fastText Facebook AI Research. :
def embed(word, model):
if word in model.vocab:
result = model.vectors_vocab[word]
else:
result = zeros()
n = 1
for ngram in get_ngrams(word, model.min_n, model.max_n):
result += model.vectors_ngrams[hash(ngram)]
n += 1
return result / n
: β "" ( ), n-. , , ,
β -
,
,
, n-. , , , , .
fastText : fastText ( Python ), Gensim ( Python). , Gensim.
, Gensim , . model.vectors_vocab
model.vectors_ngrams
model.vectors
, "" , n-. model.vectors_vocab
, model.vectors_ngrams
.
FastText ( ) . : , n-, . n- , , fastText hashing trick: , n-, n-. , ( , ), . , ruscorpora_none_fasttextskipgram_300_2_2019 RusVectores 2 , 330 .
, β n- β fastText . 2 500 , . " + n- ", ; . , , . " ", . 16 2 , 94% , n-, ( gensim
).
, fastText, β " " , ( , ) . "" n- . (, , self-supervized) .
, . fasttext β 300. (SVD), n*300
n*k
k*300
. k
β . ( , , ), , ( 300 β ).
β "" , . 32- . 16-, , . float' Python , , . 8 256 . , 256 , , . ( ) , .
, 300 32- 300 8- . ? , , β ! , 300- 100 3- , 3- . , , 3- , 3- . product quantization, . , navec , glove- , 25 50 . , fasttext . - .
. ruscorpora_none_fasttextskipgram_300_2_2019 c 300- , 165K 2000K n- (n 3 5), . " " 2.7 . gensim, ( gensim==3.8.1
). , n- ( , ).
: , adjust_vectors
, n-. , . ruscorpora_none_fasttextskipgram_300_2_2019
gensim
, - , adjust_vectors
. : intrinsic evalution ( ) . . , , , , gensim
. : , , .
, sys.getsizeof
( , numpy
-), numpy.ndarray.nbytes
( , ), gc.get_referents
"" . , ( save
gensim
, , , pickle
) , , , .
, , . 80 , ( 10 ) . pymorphy2, , ; . , 54 fastText, 26 β . .
, . intrinsic evaluation: , . , : , , NER, .. , .. . .
intrinsic : hj
, ae
rt
RUSSE, simlex965
( sl
) β RusVectores ( ). hj
sl
, . ae
rt
, 2*ROC_AUC-1
, ROC AUC . , ae
rt
precision, . .
: vectors_vocab
( , ), vectors
vectors_ngrams
32 16 . , 2.7 1.28 . , . n- (1.14 ) (136 ).
: TruncatedSVD
scikit-learn
. : 8% . .
, , . navec
, ( ). : ( float int) 99.6% . , 96%. . : , , (!) , 256 . 12 ( 25 ), 94 , 75%. .
β n-, .. . , . , 128 (x10, 16- ) 95%, 25 β 82%. , .
n- ? . , . : - , n- ( ), . ( ) , "" 450 ( ). 45- 93.6% .
, : , fastText-, " " . (20 , 100 n-, 100- ), 28 ( 100 !), 96.15% . . , 36 .
. β , β . , : . , , , n-.

, "" : - .
( 36 15 -, ). , , .

!
RAM 80 . ?
, , . . , , n- . , (.. n-) , . , , , , , n- β , .

, , . .

, 80 . : . , , , β n-, , . .

, intrinsic evaluation. , . β n-. , OOV , -.

, intrinsic evaluation . , , . , , , intrinsic evaluation .

, , .
Fasttext β , , -, , . β β 100 , . 96% , 3% .
PyPI. 13, 28, 51 180- β ruscorpora_none_fasttextskipgram_300_2_2019 RusVectores.
. , -, ODS.