☄️ 📐 🏉 Multi-tagged classification 👩🏽‍🏫 🤞🏼 🤛🏻

Hello, habrozhiteli! We decided to cite an excerpt from the book by Andrei Burkov , Machine Learning Without Extra Words , dedicated to classification.

To describe the image in the figure, several labels can be used simultaneously: “coniferous forest”, “mountains”, “road”. If the number of possible values for labels is large, but they all have the same nature as tags, each tagged sample can be converted to several tagged data, one for each tag. All this new data will have the same feature vectors and only one label. As a result, the task becomes a multiclass classification problem. It can be solved using the “one against all” strategy. The only difference from the usual multiclass classification problem is the appearance of a new hyperparameter: the threshold. If the similarity score for a label is above a threshold value, this label is assigned to the input feature vector. In this scenario, multiple labels can be assigned to one characteristic vector.The threshold value is selected using the control set.

To solve the classification problem with many labels, one can similarly apply algorithms that are naturally converted to multiclass (decision trees, logistic regression, neural networks, etc.). They return an estimate for each class, so we can define a threshold and then assign several labels to one feature vector for which the proximity score exceeds this threshold.

Neural networks can naturally be trained in multi-label classifications using binary cross-entropy as a cost function. The output layer of the neural network in this case has one node per label. Each node in the output layer has a sigmoid activation function. Accordingly, each label l is binary

where l = 1, ..., L and i = 1, ..., N. Binary cross-entropy determines the probability

that the sample xi has the label l, is defined as the

Minimization criterion - a simple average of all members of the binary cross-entropy in all training samples and all their tags.

In cases where the number of possible label values is small, you can try to convert the classification problem with many labels into a multiclass classification problem. Imagine the following problem. You need to assign two types of labels to images. Labels of the first type can have two possible meanings: { photo, painting }; marks of the second type can have three possible meanings: { portrait, landscape, other}. For each combination of two source classes, you can create a new dummy class, for example:

Now we have the same tagged data, but we replaced the set of true labels with one dummy label with values from 1 to 6. In practice, this approach gives good results when there are not too many possible combinations of classes. Otherwise, much more training data needs to be used to compensate for the increase in the set of classes.

The main advantage of this latter approach is that the labels remain correlated, unlike the methods described above, which predict each label independently of each other. In many tasks, correlation between labels can be a significant factor. For example, imagine you want to classify email as spam and non- spam, and at the same time as ordinary and important. You would probably want to exclude forecasts such as [ spam, important ].

7.5. Ensemble training

The fundamental algorithms that we covered in Chapter 3 have their limitations. Because of its simplicity, sometimes they cannot create a model that is effective enough for your task. In such cases, you can try to use deep neural networks. However, in practice, deep neural networks require a significant amount of labeled data, which you may not have. Another way to increase the effectiveness of simple learning algorithms is to use ensemble training .

Ensemble training is a training paradigm that is based on training not just one super-correct model, but a large number of models with low accuracy and combining the forecasts given by these weak models to obtain a more correct metamodel .

Models with low accuracy are usually trained by weak learning algorithms that are not able to train complex models and therefore show high speed at the training and forecasting stages. Most often, the decision tree learning algorithm is used as the weak algorithm, which usually stops breaking the training set after several iterations. The result is small and not very regular trees, but, as the idea of learning the ensemble says, if the trees are not identical and each tree is at least slightly better than random guessing, we can get high accuracy by combining a large number of such trees.

To get the final forecast for entry x, forecasts of all weak models are combined using some method of weighted voting. The specific form of weighting the votes depends on the algorithm, but the essence does not depend on it: if, collectively, weak models predict that the email is spam, we assign the spam label x to the sample . The two main methods for training ensembles are boosting and bagging (aggregation). Translations of the terms boosting and bagging are inaccurate and not accustomed.

7.5.1. Boosting and Bagging

The boosting method is to use the initial training data and iteratively create several models using a weak algorithm.

Each new model differs from the previous ones in that, constructing it, a weak algorithm tries to “fix” the errors made by previous models. The final ensemble model is a combination of these many weak iteratively constructed models.

The essence of bagging is to create a lot of “copies” of training data (each copy is slightly different from the others) and then apply a weak algorithm to each copy in order to obtain several weak models, and then combine them. A widely used and efficient machine learning algorithm based on the idea of bagging is a random forest .

7.5.2. Random forest

The “classic” bagging algorithm works as follows. B random samples are created from the existing training set

(for each b = 1, ..., B) and a decision tree

model is built on the basis of each sample

. To get a sample

for some b, a sample is made with replacement . That is, first an empty sample is created, and then a random sample is selected from the training set, and its exact copy is placed in

, while the sample itself remains in the original training set. The selection of data continues until the condition is fulfilled.

As a result of training, B decision trees are obtained . The forecast for the new sample x , in the case of regression, is determined as the average of B forecasts

or by a majority vote in case of classification.

Random forest has only one difference from classic bagging. It uses a modified tree learning algorithm that, with each splitting in the learning process, checks a random subset of features. This is done in order to eliminate the correlation between trees: if one or more features have a large predictive ability, many trees will choose them for splitting data. This will lead to the appearance in the "forest" of a large number of correlated trees. Sign correlation with high predictive ability prevents the prediction accuracy from increasing. The high efficiency of the ensemble of models is explained by the fact that good models are most likely to agree with the same forecast, and bad models are not likely to agree and will give different forecasts. Correlation will make poor models more likely to agree,which will distort the voting pattern or affect the average.

The most important hyperparameters for tuning are the number of trees B and the size of a random subset of features that must be considered for each splitting.
Random forest is one of the most widely used ensemble learning algorithms. What determines its effectiveness? The reason is that by using several samples from the original dataset, we reduce the variance of the final model. Remember that low variance means a weak predisposition to retrain. Retraining occurs when the model tries to explain small variations in the data set because the data set is just a small sample of all possible examples of the phenomenon that we are trying to simulate. In the case of an unsuccessful approach to the formation of the training set, some undesirable (but inevitable) artifacts may fall into it: noise, abnormal and excessively or insufficiently representative data. By creating several random samples with the replacement of the training set, we reduce the influence of these artifacts.

7.5.3. Gradient Boost

Another effective ensemble training algorithm based on the idea of boosting is gradient boosting. First, consider the use of gradient boosting in regression. We will start building an effective regression model with a constant model

(as we did in ID3):

Then change the labels in all samples i = 1, ..., N in the training set:

where it

is called the remainder and is the new label of the sample

Now we use the modified training set with the residues instead of the original labels to build a new model of the decision tree.

The boosting model is now defined as

where α is the learning speed (hyperparameter).

Then we recalculate the residuals using Equation 7.2, replace the labels in the training data again, teach a new model of the decision tree,

redefine the boosting model as

we repeat the process, until we combine the predetermined maximum number M of trees.

Let's intuitively understand what is happening here. By calculating the residuals, we determine how well (or poorly) the goal of each training sample is predicted by the current model f. Then we train another tree to correct the errors of the current model (which is why we use leftovers instead of actual labels) and add a new tree to the existing model with some weight α. As a result, each new tree added to the model partially corrects the mistakes made by previous trees. The process continues until the maximum number M (another hyperparameter) of the trees is combined.

Now let's try to answer the question why this algorithm is called gradient boosting. In gradient boosting, we do not calculate the gradient, unlike what we did in chapter 4, solving the linear regression problem. To see the similarities between gradient boosting and gradient descent, remember why we computed the gradient in linear regression: to find out the direction of the parameter values to minimize the MSE cost function. The gradient shows the direction, but does not show how far to go in this direction, so in each iteration we took a small step, and then again determined the direction. The same thing happens in gradient boosting, but instead of directly calculating the gradient, we use its estimate in the form of residuals: they show how the model should be adjusted to reduce the error (residual).

In gradient boosting, three main hyperparameters are available for tuning: the number of trees, the speed of learning, and the depth of the trees. All three affect the accuracy of the model. The depth of trees also affects the speed of learning and forecasting: the smaller the depth, the faster.

It can be shown that learning by residuals optimizes the overall model f for the standard error standard. Here you can see the difference from bagging: boosting reduces bias (or lack of education) instead of variance. As a result, boosting is subject to retraining. However, by adjusting the depth and number of trees, retraining can be largely avoided.

Gradient boosting is similar for grading tasks, but the steps are slightly different. Consider the case of binary classification. Suppose there are M regression decision trees. By analogy with the logistic regression, the forecast of the ensemble of decision trees is modeled using the sigmoid function:

where

is the regression tree.

And again, as in logistic regression, when trying to find a model f maximizing

, the principle of maximum likelihood is applied. Similarly, to avoid numerical overflow, we maximize the sum of the likelihood logarithms, rather than the product of the likelihood.

The algorithm starts with the initial constant model

where

(It can be shown that such initialization is optimal for the sigmoid function.) Then, at each iteration m, a new tree fm is added to the model. To find the best tree

To find the best tree

, the partial derivative of the

current model is first calculated for each i = 1, ..., N:

where f is the model of the ensemble classifier built on the previous iteration m - 1. To calculate

, we need to find the derivatives of with

respect to f for all i. Note that the

derivative with respect to f of the right term in the previous equation is

Then, the training set is transformed by replacing the original label of the

corresponding partial derivative

, and a new tree is built on the basis of the converted training set.

Next, the optimal update step is determined

as:

At the end of iteration m, we update the ensemble model by

adding a new tree

Iterations continue until the condition m = M is fulfilled, after which the training stops and the ensemble model f is obtained.

Gradient boosting is one of the most powerful machine learning algorithms. Not only because it creates very accurate models, but also because it is capable of processing huge data sets with millions of data and features. As a rule, it is superior in accuracy to a random forest, but due to the consistent nature it can learn much more slowly.

Multi-tagged classification

7.5. Ensemble training

7.5.1. Boosting and Bagging

7.5.2. Random forest

7.5.3. Gradient Boost

More articles: