The book "Machine Learning without words"

imageHello, habrozhiteli! Everything you really need to know about machine learning can fit in a couple of hundred pages.

Let's start with a simple truth: cars do not learn. Typical machine learning involves finding a mathematical formula that, when applied to a set of input data (called training data), will produce the desired results.

Andrei Burkov tried to give everything necessary so that everyone could become an excellent modern analyst or machine learning specialist. What managed to fit into a couple of hundred pages in other books stretched out to thousands. Typical books on machine learning are conservative and academic, here the emphasis is on algorithms and methods that are useful in everyday work.

Excerpt. 9.2.3. Determining the number of clusters


The most important question is how many clusters are in the dataset? When the feature vectors are one-, two- or three-dimensional, you can draw the data distribution on the graph and see the “clouds” of points in the feature space. Each cloud is a potential cluster. However, for D-dimensional data, with D> 3, drawing such a graph is problematic.

One way to determine a reasonable number of clusters is based on the idea of ​​predictive power. The bottom line is to divide the data into training and test sets, as is done in teaching with a teacher. Having selected the training and test sets, Str with the size Ntr and Ste with the size Nte, respectively, you fix the number of clusters k, run the clustering algorithm C on the sets Str and Ste and get the results of clustering C (Str, k) and C (Ste, k).

Let A be the result of clustering C (Str, k) obtained for the training set. Clusters in A can be considered as regions. If a sample falls into one of these areas, it means that it belongs to some specific cluster. For example, if we apply the k means algorithm to a certain data set, the result is a partition of the feature space into k polygonal regions, as shown in Fig. 9.2.

We define a Nte × Nte matrix of joint membership D [A, Ste], whose elements D [A, Ste] (i, i`) = 1 if and only if the data xi and xi` from the test set belong to the same cluster, according to to the partition A. Otherwise, D [A, Ste] (i, i`) = 0.

And now let's stop and see what happened. We created a partition A using a training dataset into k clusters. Then we constructed a joint affiliation matrix that indicates whether two samples from the test set belong to one cluster in A.

Obviously, if k is reasonable, then two samples belonging to the same cluster in solution C (Ste, k) are most likely to be belong to one cluster in the solution and C (Str, k). On the other hand, if the value of k is not reasonable (too high or too low), then the partitions based on training and test data are likely to be less consistent.

In fig. 9.3 shows the data used, and Fig. 9.4 illustrates the idea. The graphs in fig. 9.4a and 9.4b show the results of C (Str, 4) and C (Ste, 4) with the corresponding regions of the clusters. In fig. 9.4c shows test data plotted on the area of ​​clusters obtained during the clustering of training data. In fig. 9.4c, you can see that the orange test data no longer belongs to one cluster in accordance with the areas obtained on the training data. As a result, many zeros appear in the matrix D [A, Ste], which in turn shows that k = 4 is probably not the best number of clusters.

A more formally predictive force of the number of clusters k is defined as

image

where is the imagejth cluster from the partition C (Ste, k) and | Aj | Is the number of data in the cluster Aj.

image

Taking into account the partition C (Str, k) for each test cluster, the fraction of pairs in it is calculated, which also fell into the same cluster, determined by the centroid for the training set. Predictive strength is determined by at least this value for k test clusters.

As experiments show, a reasonable number of clusters is the largest k at ps (k) above 0.8. Figure 9.5 shows examples of determining the predictive power of different values ​​of k for data divided into two, three and four clusters.

For non-deterministic clustering algorithms, such as k means, which can generate different partitioning options, depending on the initial positions of the centroids, it is recommended to perform several clustering algorithm runs for the same k and calculate the average predictive forceimage

image

Another effective method for estimating the number of clusters is called gap statistics. Other, less automated methods that are still used by some analysts include the elbow method and the average silhouette method.

»More information about the book can be found on the publisher’s website
» Contents
» Excerpt

For Khabrozhiteley 25% discount on the coupon - Machine Learning

Upon payment of the paper version of the book, an electronic book is sent by e-mail.

All Articles