How to compress fastText model 100 times

The fastText model is one of the most effective vector representations of words for the Russian language. However, its practical use suffers due to the impressive (several gigabytes) size of the model. In this article we show how you can reduce the fastText model from 2.7 gigabytes to 28 megabytes without losing too much in its quality (3-4%). Spoiler: quantization and feature selection work well, but matrix expansions do not. We also publish a Python package for this compression and examples of a compact model for Russian words.



Why and what is it about


, fastText: fastText , . unsupervised β€” n-. navec β€” glove- . .


: ? , β€” , , (, 300-), - . , ( ). , , , , , . , , (, ) . , , "" .


β€” , ELMO BERT. , fastText. fastText' β€” ( , ) n- ( ) . , , , , . fastText , n- .


fastText Facebook AI Research. :


def embed(word, model):
    if word in model.vocab:
        #       
        # return model.vectors[word]
        result = model.vectors_vocab[word]
    else:
        result = zeros()
    n = 1
    for ngram in get_ngrams(word, model.min_n, model.max_n):
        result += model.vectors_ngrams[hash(ngram)]
        n += 1
    return result / n

: β€” "" ( ), n-. , , , β€” - , , , n-. , , , , .


fastText : fastText ( Python ), Gensim ( Python). , Gensim.


, Gensim , . model.vectors_vocab model.vectors_ngrams model.vectors, "" , n-. model.vectors_vocab , model.vectors_ngrams .



FastText ( ) . : , n-, . n- , , fastText hashing trick: , n-, n-. , ( , ), . , ruscorpora_none_fasttextskipgram_300_2_2019 RusVectores 2 , 330 .


, β€” n- β€” fastText . 2 500 , . " + n- ", ; . , , . " ", . 16 2 , 94% , n-, ( gensim).


, fastText, β€” " " , ( , ) . "" n- . (, , self-supervized) .


, . fasttext β€” 300. (SVD), n*300 n*k k*300. k β€” . ( , , ), , ( 300 β€” ).


β€” "" , . 32- . 16-, , . float' Python , , . 8 256 . , 256 , , . ( ) , .


, 300 32- 300 8- . ? , , β€” ! , 300- 100 3- , 3- . , , 3- , 3- . product quantization, . , navec , glove- , 25 50 . , fasttext . - .



. ruscorpora_none_fasttextskipgram_300_2_2019 c 300- , 165K 2000K n- (n 3 5), . " " 2.7 . gensim, ( gensim==3.8.1). , n- ( , ).


: , adjust_vectors, n-. , . ruscorpora_none_fasttextskipgram_300_2_2019 gensim, - , adjust_vectors . : intrinsic evalution ( ) . . , , , , gensim. : , , .


, sys.getsizeof ( , numpy-), numpy.ndarray.nbytes ( , ), gc.get_referents "" . , ( save gensim, , , pickle) , , , .


, , . 80 , ( 10 ) . pymorphy2, , ; . , 54 fastText, 26 β€” . .


, . intrinsic evaluation: , . , : , , NER, .. , .. . .


intrinsic : hj, ae rt RUSSE, simlex965 ( sl) β€” RusVectores ( ). hj sl , . ae rt , 2*ROC_AUC-1, ROC AUC . , ae rt precision, . .



: vectors_vocab ( , ), vectors vectors_ngrams 32 16 . , 2.7 1.28 . , . n- (1.14 ) (136 ).


: TruncatedSVD scikit-learn. : 8% . .


, , . navec, ( ). : ( float int) 99.6% . , 96%. . : , , (!) , 256 . 12 ( 25 ), 94 , 75%. .


β€” n-, .. . , . , 128 (x10, 16- ) 95%, 25 β€” 82%. , .


n- ? . , . : - , n- ( ), . ( ) , "" 450 ( ). 45- 93.6% .


, : , fastText-, " " . (20 , 100 n-, 100- ), 28 ( 100 !), 96.15% . . , 36 .


. β€” , β€” . , : . , , , n-.



, "" : - .


( 36 15 -, ). , , .



!


RAM 80 . ?


, , . . , , n- . , (.. n-) , . , , , , , n- β€” , .



, , . .



, 80 . : . , , , β€” n-, , . .



, intrinsic evaluation. , . β€” n-. , OOV , -.



, intrinsic evaluation . , , . , , , intrinsic evaluation .



, , .



Fasttext β€” , , -, , . β€” β€” 100 , . 96% , 3% .


PyPI. 13, 28, 51 180- β€” ruscorpora_none_fasttextskipgram_300_2_2019 RusVectores.


. , -, ODS.


All Articles