👨🏾‍💼 🍥 🐛 FastText: code recipe 🚕 👉🏿 🔵

Good day, friends! I present to you an amateur translation of the original article: FastText: stepping through the code by Maria Mestre.

A small warning: some of the information provided may not be completely true due to the passage of time and random errors of the author. In any case, any feedback will be desirable!

You may have encountered a tool such as FastText to vectorize your body of texts, but did you know that FastText can also deal with their classification? Or maybe they knew, but did he know how he does it? Let's look at it from the inside ... I mean, through the screen.

The FastText library was primarily developed by the Facebook team to classify texts, but it can also be used to train word embeddings. Since the moment when FastText became a product accessible to everyone (2016), it has been widely used due to its good training speed and excellent performance.

Reading the official documentation (very scarce for explanation), I realized that it contains a fairly large part of the algorithms that may not be completely transparent, so it was decided to independently figure out how it all works and what it eats with. I started by reading the main article from the creators and a quick look at the Stanford deep learning course in NLP. In the process of all this pleasure, I only increased questions. For example: that in the course of Stanford's lectures and in one part of the article they talk about the use of N-grams, but they do not say any features in their application. So, the parameters prescribed in the command line include a certain bucket the only comments on which sound like “the number of containers (buckets)”. How are these N-grams calculated? Bucket? It’s not clear ... One option remains, watch the code ongithub .

After watching all the good, the following was found out

FastText confidently uses N-grams of characters with the same success as N-grams of words.
FastText supports multiclass classification, which may not be so obvious the first time.
The number of model parameters

Model Introduction

According to an article written by developers, the model is a simple neural network with one hidden layer. The text represented by the word bag (BOW) is passed through the first layer in which it is transformed into word embeds. Subsequently, the resulting embeddings are averaged over the entire text and reduced to the only one that is applicable throughout the text. In the hidden layer, we work with n_words * dim by the number of parameters, where dim is the size of embedding, and n_words is the size of the dictionary used for text. After averaging, we get one single vector that passes through a rather popular classifier: the softmax function is usedfor linear mapping (conversion) of the input data of the first layer to the last. A matrix of dimension dim * n_output acts as a linear transformation , where n_output is our number of real classes. In the original article, the final probability is considered as log-likelyhood :

where:

x_n is the representation of a word in n-gram.
And this is a look_up matrix that extracts the embedding of a word.
This is a linear transformation of the output.
f directly the softmax function itself.

Everything in general is not so bad, but still let's take a look at the code:

Code as an idea translation

The file with the source data that we can apply for our classifier must follow a specific form: __label__0 (text).

As examples:
__label__cat This thext is about cats.
__label__dog This text is about dogs.

The source file is used as an argument to the train function . The train function starts with the initialization and subsequent filling of the dict_ variable .

void FastText::train(const Args args) {
  args_ = std::make_shared<Args>(args);
  dict_ = std::make_shared<Dictionary>(args_);
  if (args_->input == "-") {
    // manage expectations
    throw std::invalid_argument("Cannot use stdin for training!");
  }
  std::ifstream ifs(args_->input);
  if (!ifs.is_open()) {
    throw std::invalid_argument(
        args_->input + " cannot be opened for training!");
  }
  dict_->readFromFile(ifs);

dict_ is an instance of the Dictionary class.


Dictionary::Dictionary(std::shared_ptr<Args> args) : args_(args),
  word2int_(MAX_VOCAB_SIZE, -1), size_(0), nwords_(0), nlabels_(0),
  ntokens_(0), pruneidx_size_(-1) {}

After reading and parsing the sentences of the source file, we fill two vectors:

words_ containing unique words extracted from text
word2int_ containing hashes for each word according to its position in the words_ vector . This is actually very important, as it determines which will be used to search for embeddings of matrix A

The words_ vector contains instances of entry. Each of which can be a word type for label and have a counter of calls to them. There is also a subwords field , but we will look at it a bit further.

struct entry {
  std::string word;
  int64_t count;
  entry_type type;
  std::vector<int32_t> subwords;
};

By adding words or tags to the word2int_ variable, collisions are necessarily resolved in such a way that we never refer to two different words that have the same index. There simply won't be any.

int32_t Dictionary::find(const std::string& w, uint32_t h) const {
  int32_t word2intsize = word2int_.size();
  int32_t id = h % word2intsize;
  while (word2int_[id] != -1 && words_[word2int_[id]].word != w) {
    id = (id + 1) % word2intsize;
  }
  return id;
}

int32_t Dictionary::find(const std::string& w) const {
  return find(w, hash(w));
}

void Dictionary::add(const std::string& w) {
  int32_t h = find(w);
  ntokens_++;
  if (word2int_[h] == -1) {
    entry e;
    e.word = w;
    e.count = 1;
    e.type = getType(w);
    words_.push_back(e);
    word2int_[h] = size_++;
  } else {
    words_[word2int_[h]].count++;
  }
}

Both vectors are filtered to ensure that words and tags that are mentioned at least once are included. After we move on to the part where we use the readFromFile function where initNgrams is called . We almost got to the mystic of using N-grams.


void Dictionary::readFromFile(std::istream& in) {
  std::string word;
  int64_t minThreshold = 1;
  while (readWord(in, word)) {
    add(word);
    if (ntokens_ % 1000000 == 0 && args_->verbose > 1) {
      std::cerr << "\rRead " << ntokens_  / 1000000 << "M words" << std::flush;
    }
    if (size_ > 0.75 * MAX_VOCAB_SIZE) {
      minThreshold++;
      threshold(minThreshold, minThreshold);
    }
  }
  threshold(args_->minCount, args_->minCountLabel);
  initTableDiscard();
  initNgrams();

The initNgrams function defines all combinations of N-gram characters and adds them to the ngram vector . In the end, all the hashes for N-grams are calculated and added up in the ngram vector , forming the size of the bucket . In other words, hashes of N-gram characters are added after adding hashes of N-gram words. As a result, their indices do not overlap with the indices of words, but can overlap with each other.

In general, for each word in the text, you can provide subwords ... N-grams of characters.
Embedding matrix A displays the following:

The initial nwords_ lines containing embeddings for each word from the dictionary available for the text.
Follow the bucket of lines containing embeddings for each N-gram of characters


void Dictionary::initNgrams() {
  for (size_t i = 0; i < size_; i++) {
    std::string word = BOW + words_[i].word + EOW;
    words_[i].subwords.clear();
    words_[i].subwords.push_back(i);
    if (words_[i].word != EOS) {
      computeSubwords(word, words_[i].subwords);
    }
  }
}

void Dictionary::computeSubwords(const std::string& word,
                               std::vector<int32_t>& ngrams) const {
  for (size_t i = 0; i < word.size(); i++) {
    std::string ngram;
    if ((word[i] & 0xC0) == 0x80) continue;
    for (size_t j = i, n = 1; j < word.size() && n <= args_->maxn; n++) {
      ngram.push_back(word[j++]);
      while (j < word.size() && (word[j] & 0xC0) == 0x80) {
        ngram.push_back(word[j++]);
      }
      if (n >= args_->minn && !(n == 1 && (i == 0 || j == word.size()))) {
        int32_t h = hash(ngram) % args_->bucket;
        pushHash(ngrams, h);
      }
    }
  }
}

void Dictionary::pushHash(std::vector<int32_t>& hashes, int32_t id) const {
  if (pruneidx_size_ == 0 || id < 0) return;
  if (pruneidx_size_ > 0) {
    if (pruneidx_.count(id)) {
      id = pruneidx_.at(id);
    } else {
      return;
    }
  }
  hashes.push_back(nwords_ + id);
}

So we guessed the mystical N-gram characters. Now let's deal with N-grams of words.

Returning to the train function , the following instructions are executed:

  if (args_->pretrainedVectors.size() != 0) {
    loadVectors(args_->pretrainedVectors);
  } else {
    input_ = std::make_shared<Matrix>(dict_->nwords()+args_->bucket, args_->dim);
    input_->uniform(1.0 / args_->dim);
  }

  if (args_->model == model_name::sup) {
    output_ = std::make_shared<Matrix>(dict_->nlabels(), args_->dim);
  } else {
    output_ = std::make_shared<Matrix>(dict_->nwords(), args_->dim);
  }
  output_->zero();
  startThreads();

This is where the embedding matrix A is initialized. It must be pointed out that if pretrainedVectors work with this function, then this variable is considered full. If this does not happen, then the matrix is initialized with random numbers -1 / dim and 1 / dim , where dim is the size of our embedding. Dimension of matrix A ( n_words_ + bucket ; dim ), i.e. we are going to set up all these embeddings for each word. The output is also initialized in this step.

As a result, we get the part where we begin training our model.


void FastText::trainThread(int32_t threadId) {
  std::ifstream ifs(args_->input);
  utils::seek(ifs, threadId * utils::size(ifs) / args_->thread);

  Model model(input_, output_, args_, threadId);
  if (args_->model == model_name::sup) {
    model.setTargetCounts(dict_->getCounts(entry_type::label));
  } else {
    model.setTargetCounts(dict_->getCounts(entry_type::word));
  }

  const int64_t ntokens = dict_->ntokens();
  int64_t localTokenCount = 0;
  std::vector<int32_t> line, labels;
  while (tokenCount_ < args_->epoch * ntokens) {
    real progress = real(tokenCount_) / (args_->epoch * ntokens);
    real lr = args_->lr * (1.0 - progress);
    if (args_->model == model_name::sup) {
      localTokenCount += dict_->getLine(ifs, line, labels);
      supervised(model, lr, line, labels);
    } else if (args_->model == model_name::cbow) {
      localTokenCount += dict_->getLine(ifs, line, model.rng);
      cbow(model, lr, line);
    } else if (args_->model == model_name::sg) {
      localTokenCount += dict_->getLine(ifs, line, model.rng);
      skipgram(model, lr, line);
    }
    if (localTokenCount > args_->lrUpdateRate) {
      tokenCount_ += localTokenCount;
      localTokenCount = 0;
      if (threadId == 0 && args_->verbose > 1)
        loss_ = model.getLoss();
    }
  }
  if (threadId == 0)
    loss_ = model.getLoss();
  ifs.close();
}

There are two main points in this part of the code. First, the getLine function calls the dict_ variable . Second, the supervised function is called .

int32_t Dictionary::getLine(std::istream& in,
                            std::vector<int32_t>& words,
                            std::vector<int32_t>& labels) const {
  std::vector<int32_t> word_hashes;
  std::string token;
  int32_t ntokens = 0;

  reset(in);
  words.clear();
  labels.clear();
  while (readWord(in, token)) {
    uint32_t h = hash(token);
    int32_t wid = getId(token, h);
    entry_type type = wid < 0 ? getType(token) : getType(wid);

    ntokens++;
    if (type == entry_type::word) {
      addSubwords(words, token, wid);
      word_hashes.push_back(h);
    } else if (type == entry_type::label && wid >= 0) {
      labels.push_back(wid - nwords_);
    }<source lang="cpp">
    if (token == EOS) break;
  }
  addWordNgrams(words, word_hashes, args_->wordNgrams);
  return ntokens;
}

In the function above, we read the texts from the input data, determine the indices for each word one after another by using word2int_ . We add the N-grams that make up these words to the words variable , as indicated in the code. And in the end, we add labels directly to the labels vector .

After fully reading and adding text (sentences from the corpus) directly to the vectors created for them, we get a part of the code that processes N-grams. This is the addWordNgrams function .

void Dictionary::addWordNgrams(std::vector<int32_t>& line,
                               const std::vector<int32_t>& hashes,
                               int32_t n) const {
  for (int32_t i = 0; i < hashes.size(); i++) {
    uint64_t h = hashes[i];
    for (int32_t j = i + 1; j < hashes.size() && j < i + n; j++) {
      h = h * 116049371 + hashes[j];
      pushHash(line, h % args_->bucket);
    }
  }
}

We look further. The hashes variable is a set of hashes for each word in the text, where line contains the number of words in the sentence and the number of N-grams used. The parameter n is the parameter wordNgrams and indicates the maximum length of the N-grams of words. Each N-gram of words gets its own hash calculated by a recursive formula

h = h * 116049371 + h a s h e s [j]

$h = h*116049371+hashes[j]$

. This formula is an FNV hash algorithm applied to a string: it takes the hash of each word in the N-gram of words and adds them. Thus, a set of unique hashes is obtained. As a result, this value (can turn out to be quite large) is transmitted modulo.

Thus, N-grams of words are calculated approximately in the same way as N-grams of characters, but with a slight difference - we do not hash a specific word. Amazing move.

After reading the sentence, the supervised function is called . If the offer has several tags, we randomly select one of them.


void FastText::supervised(
    Model& model,
    real lr,
    const std::vector<int32_t>& line,
    const std::vector<int32_t>& labels) {
  if (labels.size() == 0 || line.size() == 0) return;
  std::uniform_int_distribution<> uniform(0, labels.size() - 1);
  int32_t i = uniform(model.rng);
  model.update(line, labels[i], lr);
}

And finally, the model update function.

void Model::computeHidden(const std::vector<int32_t>& input, Vector& hidden) const {
  assert(hidden.size() == hsz_);
  hidden.zero();
  for (auto it = input.cbegin(); it != input.cend(); ++it) {
    if(quant_) {
      hidden.addRow(*qwi_, *it);
    } else {
      hidden.addRow(*wi_, *it);
    }
  }
  hidden.mul(1.0 / input.size());
}

void Model::update(const std::vector<int32_t>& input, int32_t target, real lr) {
  assert(target >= 0);
  assert(target < osz_);
  if (input.size() == 0) return;
  computeHidden(input, hidden_);
  if (args_->loss == loss_name::ns) {
    loss_ += negativeSampling(target, lr);
  } else if (args_->loss == loss_name::hs) {
    loss_ += hierarchicalSoftmax(target, lr);
  } else {
    loss_ += softmax(target, lr);
  }
  nexamples_ += 1;

  if (args_->model == model_name::sup) {
    grad_.mul(1.0 / input.size());
  }
  for (auto it = input.cbegin(); it != input.cend(); ++it) {
    wi_->addRow(grad_, *it, 1.0);
  }
}

Input variables passing through the supervised function have a list of indices of all their components (words, N-grams of words and symbols) of the sentence. The goal is to define a class on the output. The computeHidden function determines all embeddings for each component of the input by searching for them in the wi_ matrix and averaging ( addRow is summed and divided by their size). After modifying the hidden_ vector , send them for activation via softmax and define a label.

This section of code shows the use of the softmax activation function . Log-loss is also calculated .

void Model::computeOutputSoftmax(Vector& hidden, Vector& output) const {
  if (quant_ && args_->qout) {
    output.mul(*qwo_, hidden);
  } else {
    output.mul(*wo_, hidden);
  }
  real max = output[0], z = 0.0;
  for (int32_t i = 0; i < osz_; i++) {
    max = std::max(output[i], max);
  }
  for (int32_t i = 0; i < osz_; i++) {
    output[i] = exp(output[i] - max);
    z += output[i];
  }
  for (int32_t i = 0; i < osz_; i++) {
    output[i] /= z;
  }
}

void Model::computeOutputSoftmax() {
  computeOutputSoftmax(hidden_, output_);
}

real Model::softmax(int32_t target, real lr) {
  grad_.zero();
  computeOutputSoftmax();
  for (int32_t i = 0; i < osz_; i++) {
    real label = (i == target) ? 1.0 : 0.0;
    real alpha = lr * (label - output_[i]);
    grad_.addRow(*wo_, i, alpha);
    wo_->addRow(hidden_, i, alpha);
  }
  return -log(output_[target]);
}

Using this method, we will not get an increase in the size of N-grams of words and characters. Growth is limited by the number of buckets already available .

By default, FastText does not use N-grams, but we recommend the following options:

bucker = 2,000,000; wordNgrams> 1 or maxn> 0
dim = 100
n_output = 2
n_words = 500000

In total, we get a fairly large number of parameters for training

(5000000 + 2000000) * 100 + (2 * 100) = 250, 000, 000

$(5000000+2000000)*100+(2*100) = 250,000,000$

Too much, isn't it?) As we can see even through such a simple approach, we have a fairly large number of parameters, which is very typical of deep learning methods. A very rough estimate is used, for example, the number of N-grams taken for 2_000_000, in order to show order. In general, the complexity of the model can be reduced by tuning some hyperparameters, such as a bucket or the sampling threshold.

Some links may be useful:
research.fb.com/fasttext
arxiv.org/pdf/1607.01759.pdf and arxiv.org/pdf/1607.04606v1.pdf
www.youtube.com/watch?v=isPiE-DBagM&index=5&list=PLU40WL8Ol94IJzQtileLTqGL_XtG (from 47:39)

FastText: code recipe

After watching all the good, the following was found out

Model Introduction

Code as an idea translation

More articles: