Two-week news aggregator

On November 18, Telegram launched a data clustering contest : Data Clustering Contest . It was necessary in two weeks to make your news aggregator. The restrictions that were set in this competition scared off a bunch of people, but not me and my colleagues. I will tell you how we went, what elections we made and what difficulties we encountered. The solution that we sent to the competition processed 1000 documents in 3.5 seconds, took 150 MB, took 6 place in a public vote and 3 place in the final results. We made many mistakes, because of which we did not take a place higher, most of them are now fixed. All code and all models can be found in the repository . All scripts for training models transferred to Colab.


Top of the public vote
Top of the public vote


Task


.


5 :


  • ,
  • 7
  • , ,



, . , : //, . , 5 : , , . .


. , , . , “” , , , . , .


:


  • : 200 ( 1.5 )
  • 1000
  • 2
  • Debian GNU/Linux 10.1,

, 1000 , . : 200 , ( word2vec, fasttext, GloVe, ) ULMFiT/ELMo/BERT. , . , 2 .


, , . .



. , Python ( ). . , , . , .


, Go, , , . C++, , . , . C++11, - .


C NLP, 2016 FastText’. , TF-IDF, , , , , . FastText — word2vec n-, . ELMo 197 , BERT — 632 , ( ). , FastText C++ .


- . , ( , !). OpenNMT, . , C++, Python, . .


, , DBSCAN, . DBSCAN MLPack, MLPack Debian’. . , , DBSCAN’ . MLPack .


- : TensorFlow, Torch, MXNet. “TensorFlow C++, ” — . -, , . -, 200 . Tensorflow Lite, . .


. , , . , Eigen, . Keras, , , Torch ( ). .


:


  • : C++, FastText, OpenNMTTokenzer, Eigen
  • : Python, FastText, OpenNMTTokenzer, Keras

..



FastText’ .


, , . , .


. . , , . 3 2/3, 5 4/5. , . . 60$ . , 327 1176 , . 3-4 .


, . , BBC News categories. .


, FastText . : ; BBC, All the news, News categories . , .


supervised FastText’ ( autotune). Supervised — , , .


Classifiers


2 : . , . ( ) . , , , , . , . , — ( hard-negative ). ( ) . triplet loss. Keras’, Torch’. .


Model for learning vectors


, , , . , , BERT. , . unsupervised ELMo. , ELMo .


SLINK: O(n^2) . , — : , , , .


Agglomerative Clustering
. .


O(n^2) — . , : . , 10000 2000 . , . , .


3 : , , . — , . 99 . , . PageRank , .


3 : , , .



  1. .
  2. “” , - “”.
  3. , - , .
  4. , - .
  5. , - , .
  6. , - .
  7. std::sort std::stable_sort, - .
  8. , - .
  9. , - .


, . .


?


-, rss- . Telegram — , Instant View, . . .


-, . -: , , , . .


, , README .


You can see the current version here , the version from the contest here .

Source: https://habr.com/ru/post/undefined/


All Articles