🔣 🌓 🛏️ We disassemble EM-algorithm into small bricks 👨🏿‍🤝‍👨🏻 👞 👨‍👨‍👦

In this article, as you probably already guessed, we will talk about the device EM-algorithm. First of all, the article may be of interest to those who are slowly joining the community of datasainists. The material presented in the article will be more useful to those who recently started taking the third course "Searching for data structures" within the specialization "Machine Learning and Data Analysis" from MIPT and Yandex.

The material presented in the article, in a sense, is an addition to the first week of training in the aforementioned course, namely, it allows you to answer some important questions regarding the principle of operation of the EM algorithm. For a better understanding of the material, it is advisable for our esteemed reader to be able to perform operations with matrices (matrix multiplication, finding the determinant of a matrix and an inverse matrix), understand the basics of probability theory and matstat, and of course, have at least a basic understanding of basic clustering algorithms and understand what clustering takes place in machine learning. Although, of course, without this knowledge you can read the article, something and yes it will certainly be clear :)

Also, according to the old tradition, the article will not contain deep theoretical research, but will be filled with simple and understandable examples. Each subsequent example will explain the operation of the EM algorithm a little deeper than the previous one, which ultimately leads us straightforward to the analysis of the algorithm itself. For each example, code will be written. All code is written in python 2.7, and for this I apologize in advance. It turned out that now I am using this particular version, but after switching to python 3, I will try to change the code in the article.

Let us get acquainted with the outline of the article: 1) Let us consider in general terms how the EM algorithm works.

2) We repeat the main points from the Bayes theorem (formula):

$P_{x_i}(B_j) = \frac{P(B_j) \centerdot P_{B_j}(x_i)}{P(x_i)}$

3) Consider the first example on the Bayesian theorem, in which all elements of the formula (or rather probability

$P(B_j), P_{B_j}(x_i)$ ) for the equal sign on the right are known. In this case, we only need to correctly understand which numbers where to substitute and then perform basic operations.

4) Consider the second example on the Bayes theorem. To solve the problem, we will need to additionally calculate the probability density of a certain event (

$x_i$ ) subject to the hypothesis (

$B_j$ ) -

$\rho_{B_j}(x_i)$ . For calculations, we will be given parameters of a random variable, namely, mathematical expectation -

$\mu$ and standard deviation -

$\sigma$ . Probability density calculations will be carried out according to the following formula:

$\rho_{B_j}(x_i) = \frac{1}{\sigma_j \centerdot \sqrt{2 \pi}} \centerdot \exp(- \frac{(x_i-\mu_j)^2}{2 \sigma_j^2})$

Using the above formula assumes that the random variable has a one-dimensional dimension and is distributed normally (in other words, the Gaussian or Gaussian-Laplace distribution is a probability distribution).

5) Consider the third example, which differs from the previous one only in that it considers the case of a multidimensional (in our two-dimensional version) normal distribution of a random variable. In this case, the density calculation is carried out according to the following formula:

$\rho_{B_j}(x_i) = \frac{1}{(2\pi)^{k/2} \centerdot \mid \Sigma_j\mid ^{1/2}} \centerdot \exp(- \frac{1}{2} \centerdot (x_i-\mu_j)^T \Sigma_j^{-1} (x_i-\mu_j))$

6) Finally, we modify the third example in such a way as to clearly demonstrate the operation of the EM algorithm.

7) In conclusion, we compare the quality of the algorithm we have deduced with the quality of the EM algorithm, which is built into the sklearn library (class sklearn.mixture.GaussianMixture)

If, instead of the formulas specified in the outline of the article, you saw Japanese characters, then please do not be scared - after Bayes' theorem (the second paragraph of the outline of the paper), we will analyze such examples that you will probably understand at least intuitively. Well, let's go!

EM algorithm in general

On the course, based on which the article was written, the EM algorithm is presented as one of the methods of clustering. In other words, this is a machine-learning method without a teacher, when we do not know the true answers in advance. In our article, we will also consider the algorithm as part of one of the clustering methods of a random variable with a normal distribution. However, it should be noted that the algorithm has wider application. The EM-algorithm (English expectation-maximization) is a general method for finding estimates of the likelihood function in models with hidden variables, which from a mixture of distributions allows you to build (approximate) complex probability distributions.

The EM algorithm in clustering problems is used as an iterative algorithm, which performs two steps at each iteration:

E-step. At the first E-step, in some way, for example, randomly, we select hidden variables, in our case it will be a mathematical expectation -

$\mu$ and standard deviation -

$\sigma$ . Using the selected variables, we calculate the probability of assigning each object to a particular cluster. Subsequent E-steps use the hidden variables defined in the M-steps.

M-step . At the M-step, we, in accordance with the probabilities of assigning each object to a particular cluster obtained at the E-step, recalculate hidden variables

$\mu$ and

$\sigma$

Iterations are repeated until convergence occurs. We can assume that convergence occurred when the values of the hidden variables do not change significantly, for example, within a given constant. In the last example, which is considered in the article, we will not set constants and, accordingly, we will not determine changes in the values of hidden variables at each iteration. Let's do it simpler - we will limit our algorithm to a fixed number of steps.

The M-step is simple enough to understand, and we will see this in the last example. We will focus on the E-step in the article.

The E-step is based on the Bayesian theorem, on the basis of which we determine the probability of each object to be assigned to one or another cluster. Let's remember what this theorem is about.

Recall the Bayes formula

$P_{x_i}(B_j) = \frac{P(B_j) \centerdot P_{B_j}(x_i)}{P(x_i)}$

where

$P_{x_i}(B_j)$ - the probability that with the event

$x_i$ there was a hypothesis

$B_j$

$P(B_j)$ - probability of hypothesis

$B_j$

$P_{B_j}(x_i)$ - probability of occurrence of an event

$x_i$ subject to hypothesis

$B_j$

$P(x_i)$ - probability of occurrence of an event

$x_i$ subject to the fulfillment of each hypothesis (calculated by the formula of the total probability of the event)

Provided that the event

$x_i$ already happened, the probability of fulfilling the hypothesis

$B_j$ revalued using the above formula. We can say that

$P(B_j)$ this is the a priori probability of the hypothesis (before the test), and

$P_{x_i}(B_j)$ Is the posterior probability of the same hypothesis evaluated after the event

$x_i$ that is, given the fact that the event

$x_i$ reliably happened.

Example One By Bayes Theorem

Imagine that we received parts that were produced on two different machines. The following characteristics of the products received at the warehouse are known:

In total, parts - 10,000 pcs (

$N = 10000$ ), of which parts produced on the first machine - 6000 pcs. (

$N1 = 6000$ ), on the second - 4000 pcs. (

$N2 = 4000$ )

The proportion of standard (i.e., non-defective) products manufactured on the first machine is 0.9, the share of standard products manufactured on the second machine is 0.8.

We randomly extracted one part from the received parts, and it turned out to be standard.

We need to determine the probability that:

1) the part is produced on the 1st machine;

2) the item was produced on the 2nd machine.

Decision

1) Event

$x$ in our example, extracting a standard product.

2) Now, let's determine the hypotheses

$B_j$ . We have two lots of parts, which means two hypotheses:

Hypothesis

$B_1$ : a product that was accidentally removed was produced on machine No. 1.

Hypothesis

$B_2$ : a product that was accidentally removed was produced on machine No. 2.

Accordingly, the probability of extracting a product manufactured on the first machine is

$P(B_1)$ makes up

$N1/N = 6000/10000 = 0.6$ .

The probability of extracting a product manufactured on a second machine is

$P(B_2)$ makes up

$N2/N = 0.4$ . And we don’t care about the standard product or the defective one, it’s important from which batch it is.

3) The probability of extracting a standard product from products manufactured on machine No. 1 (that is, under the first hypothesis) corresponds to the proportion of standard products produced on machine No. 1 and is 0.9,

$P_{B_1}(x_i) = 0.9$ .

The probability of extracting the product is a standard product subject to hypothesis No. 2 -

$P_{B_2}(x_i) = 0.8$

4) Determine the probability of extracting a standard product from the entire set of products -

$P(x_i)$ . In accordance with the formula for the total probability of an event

$P(x_i) = P(B_1) \centerdot P_{B_1}(x_i) + P(B_2) \centerdot P_{B_2}(x_i) = 0.6 \centerdot 0.9 + 0.4 \centerdot 0.8 = 0.54 + 0.32 = 0.86$ . Here we understand that then

$0.14$ - this is the probability of extracting the defective part from all the parts that arrived at the warehouse and in total it turns out 1

$(0.86 + 0.14 = 1)$ .

5) Now we have all the data in order to solve the problem.

5.1) Determine the probability that a randomly extracted standard part is produced on a machine

№

$№1$ :

$P_{x_i}(B_1) = \frac{P(B_1) \centerdot P_{B_1}(x_i)}{P(x_i)} = \frac{0.6 \centerdot 0.9}{0.86} \approx 0.63$

5.2) We determine the probability that a randomly extracted standard part is produced on a machine

№

$№2$ :

$P_{x_i}(B_2) = \frac{P(B_2) \centerdot P_{B_2}(x_i)}{P(x_i)} = \frac{0.4 \centerdot 0.8}{0.86} \approx 0.37$

Thus, we reevaluated the hypotheses

$B_1$ and

$B_2$ . After reassessment, hypotheses also form a complete group of events:

$0.63 + 0.37 = 1$ .

Well, now is the time to move on to the second example.

Second example for Bayes theorem using the parameters of the normal distribution of a random variable: mathematical expectation $\mu$ and standard deviation $\sigma$

Let's slightly modify the conditions of the previous example. We assume that we do not know the proportion of standard products manufactured on machines No. 1 and No. 2, but suppose, instead, we know the average dimensions of the diameters of the products and the standard deviation of the diameter of the products on each machine.

Machine No. 1 produces parts 64 mm in diameter and a standard deviation of 4 mm.

Machine No. 2 produces parts 52 mm in diameter and a standard deviation of 2 mm.

An important condition is that the entire set of products is described by the normal distribution or Gaussian distribution.

Otherwise, the conditions are the same, we write them:

$N1 = 6000, \mu_1 = 64, \sigma_1 = 4$

$N2 = 4000, \mu_2 = 52, \sigma_2 = 2$

We assume that in the process of product acceptance there was a small incident, as a result of which all products were mixed.

Our task is to sort out the details and for each to determine the probability that it was produced on machine No. 1 or machine No. 2. We will also assume that the part was produced on the machine, the probability of which is higher.

Solution

First of all, we will analyze the solution algorithm. We can easily calculate the probability of each hypothesis, however, we already know their values from a past example:

$P(B_1)=0.6$ ,

$P(B_2)=0.4$ . But how do we calculate the probability of seizing a standard product individually for each of the available hypotheses? Everything is simple! We will not consider probability, instead, we will determine the probability density of the item to be removed with its diameter value for each hypothesis separately.

To do this, we use the well-known formula:

$\rho_{B_j}(x_i) = \frac{1}{\sigma_j \centerdot \sqrt{2 \pi}} \centerdot \exp(- \frac{(x_i-\mu_j)^2}{2 \sigma_j^2})$

where

$x_i$ - a random value (in our case, the actual size of the diameter of the product),

$\mu_j$ - mathematical expectation of random variables

$j$ hypothesis (in our case, the average size of the diameter of the product produced on

$j$ machine tool)

$\sigma_j$ - standard deviation of random variables

$j$ hypothesis (in our case, the deviation from the average size of the diameter of the product produced on

$j$ machine tool).

Thus, we make a clever substitution of probability for probability density in the numerator of the Bayes formula, we will carry out similar replacements in the denominator.

We adapt the formulas specifically for our case.

The likelihood that the part with diameter dimensions

$x_i$ produced on machine No. 1 is determined by the formula:

$P_{x_i}(B_1) = \frac{w_1 \centerdot \rho_{B_1}(x_i)}{w_1 \centerdot \rho_{B_1}(x_i) + w_2 \centerdot \rho_{B_1}(x_i)}$

The likelihood that the part with diameter dimensions

$x_i$ produced on machine No. 2:

$P_{x_i}(B_2) = \frac{w_2 \centerdot \rho_{B_2}(x_i)}{w_1 \centerdot \rho_{B_1}(x_i) + w_2 \centerdot \rho_{B_1}(x_i)}$

Note that we have replaced the notation for the probability of a hypothesis with

$P(B_j)$ on the

$w_j$ . This is due to the corresponding designation on the previously designated course. Further we will use just such a notation.

Now we are 100% complete and ready to solve the problem.

We look at the code

Import libraries

#  , 
import numpy as np
import random
import math
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import accuracy_score

Functions Used in the Second Example

#      ,    
#      : .,  . 
def gaus_func_01(mu,sigma,x):
    return math.e**(-(x-mu)**2/(2*sigma**2)) / (sigma*(2*math.pi)**0.5)

#        
def proba_x(X, w1, w2, mu_1, mu_2, sigma_1, sigma_2):
    for i in X:
        P1_x = gaus_func_01(mu_1,sigma_1,i)
        P2_x = gaus_func_01(mu_2,sigma_2,i)
        P_x = w1*P1_x + w2*P2_x
        P_x_1 = (w1*P1_x)/P_x
        P_x_2 = (w2*P2_x)/P_x
        proba_temp = []
        proba_temp.append(P_x_1)
        proba_temp.append(P_x_2)
        proba_X.append(proba_temp)
    return proba_X

#        
def pred_x(proba_X, limit_proba):
    pred_X = []
    for x in proba_X:
        if x[0] >= limit_proba:
            pred_X.append(1)
        else:
            pred_X.append(2)
    return np.array(pred_X)

#    
def graph_01(X, pred_X, mu_1, sigma_1, mu_2, sigma_2):
    true_pred = []
    false_pred_1 = []
    false_pred_2 = []
    for i in range(X.shape[0]):
        if pred_X[i] == y[i]:
            true_pred.append([X[i], -0.025])
        else:
            if y[i] == 1:
                false_pred_1.append([X[i], -0.0075])
            else:
                false_pred_2.append([X[i], -0.015])

    false_pred_1 = np.array(false_pred_1)            
    false_pred_2 = np.array(false_pred_2)
    true_pred = np.array(true_pred)

    x_theory = np.linspace(42, 85, 20000)
    y_theory_1 = []
    for x in x_theory:
        y_theory_1.append(gaus_func_01(mu_1,sigma_1,x))
    y_theory_2 = []
    for x in x_theory:
        y_theory_2.append(gaus_func_01(mu_2,sigma_2,x))

    plt.figure(figsize=(18, 8))    
    plt.plot(
        x_theory, y_theory_1, color = 'green', lw = 2, label = 'Theoretical probability density for machine 1')
    plt.plot(
        x_theory, y_theory_2, color = 'firebrick', lw = 2, label = 'Theoretical probability density for machine 2')
    plt.hist(
        X[:N1], bins = 'auto', color='#539caf', normed = True, alpha = 0.35, label = 'machine tool products 1')
    plt.hist(
        X[N1:N], bins = 'auto', color='sandybrown', normed = True, alpha = 0.75, label = 'machine tool products 2')
    plt.plot(mu_1, 0, 'o', markersize = 11, color = 'blue', label = 'Mu 1')
    plt.plot(mu_2, 0, 'o', markersize = 11, color = 'red', label = 'Mu 2')

    plt.plot([mu_1 - sigma_1, mu_1 - sigma_1], [0,0.85*np.max(y_theory_1)],
             ':', lw = 3, color = 'blue', alpha = 0.55, label = 'Mu1 - sigma1')
    plt.plot([mu_1 + sigma_1, mu_1 + sigma_1], [0,0.85*np.max(y_theory_1)],
             ':', lw = 3, color = 'blue', alpha = 0.55, label = 'Mu1 + sigma1')
    plt.plot([mu_2 - sigma_2, mu_2 - sigma_2], [0,0.85*np.max(y_theory_2)],
             ':', lw = 3, color = 'red', alpha = 0.55, label = 'Mu2 - sigma2')
    plt.plot([mu_2 + sigma_2, mu_2 + sigma_2], [0,0.85*np.max(y_theory_2)],
             ':', lw = 3, color = 'red', alpha = 0.55, label = 'Mu2 + sigma2')

    plt.plot([mu_1 - 2 * sigma_1, mu_1 - 2 * sigma_1], [0, 0.9*0.5 * np.max(y_theory_1)],
             ':', lw = 2.5, color = 'blue', alpha = 0.35, label = 'Mu1 - 2*sigma1')
    plt.plot([mu_1 + 2 * sigma_1, mu_1 + 2 * sigma_1], [0, 0.9*0.5 * np.max(y_theory_1)],
             ':', lw = 2.5, color = 'blue', alpha = 0.35, label = 'Mu1 + 2*sigma1')
    plt.plot([mu_2 - 2 * sigma_2, mu_2 - 2 * sigma_2], [0, 0.9*0.5 * np.max(y_theory_2)],
             ':', lw = 2.5, color = 'red', alpha = 0.35, label = 'Mu2 - 2*sigma2')
    plt.plot([mu_2 + 2 * sigma_2, mu_2 + 2 * sigma_2], [0, 0.9*0.5 * np.max(y_theory_2)],
             ':', lw = 2.5, color = 'red', alpha = 0.35, label = 'Mu2 + 2*sigma2')

    plt.plot(false_pred_1[:,0], false_pred_1[:,1], 'o', markersize = 2.5, color = 'blue', alpha = 0.2, label = 'errors1')
    plt.plot(false_pred_2[:,0], false_pred_2[:,1], 'o', markersize = 2.5, color = 'red', alpha = 0.3, label = 'errors2')
    plt.plot(true_pred[:,0], true_pred[:,1], 'o', markersize = 3, color = 'green', alpha = 0.2, label = 'right answers')

    plt.xlabel('Caliber')
    plt.ylabel('Probability density')
    plt.legend()
    plt.show()

Specify the parameters

#    
#      №1
N1 = 6000
#      №2
N2 = 4000
#      
N = N1+N2

#    №1
mu_1 = 64.
#        №1
sigma_1 = 3.5

#    №2
mu_2 = 52
#        №2
sigma_2 = 2.

We will calculate

X = np.zeros((N))
np.random.seed(seed=42)
#    ,   №1
X[:N1] = np.random.normal(loc=mu_1, scale=sigma_1, size=N1)
#  ,   №2
X[N1:N] = np.random.normal(loc=mu_2, scale=sigma_2, size=N2)

#   
y = np.zeros((N))
y[:N1] = np.array((1))
y[N1:N] = np.array((2))

#     ,    №1
w1 = float(N1)/N
#     ,    №2
w2 = float(N2)/N

#           
proba_X = []
proba_X = proba_x(X, w1, w2, mu_1, mu_2, sigma_1, sigma_2)

#   ,   ,        
limit_proba = 0.5

#     
pred_X = []
pred_X = pred_x(proba_X, limit_proba)

#    
print '   :', round(accuracy_score(y, pred_X),3)
print

print ' №1'
graph_01(X, pred_X, mu_1, sigma_1, mu_2, sigma_2)

Chart # 1

What have we just done? We generated in a pseudo-random way 10,000 values described by the normal distribution, of which 6,000 values with a mean of 64 mm and a standard deviation of 4 mm, 4,000 values with a mean of 52 mm and a standard deviation of 2 mm. Further, in accordance with the above algorithm, for each part, the probability of its production on machine No. 1 or No. 2 was determined. After that, we chose the hypothesis about the production of the product on one or another machine, depending on which hypothesis is most likely. Finally, we compared the results of our algorithm with the true answers.

At the output, we got a share of correct answers - 0.986. In general, this is very good, given that we conducted the training without using true answers.

Let's look at the chart. What is recommended to pay attention to? See where the products that are not correctly identified by the algorithm are located.

Firstly, we see algorithm errors only in the area where products manufactured on different machines have the same diameter. It looks quite logical.

Secondly, we see that the algorithm is mistaken only in those objects that are most distant from the true average value of the objects and at the same time are quite close to the false center.

But the most interesting thing is that the problems begin mainly after crossing approximately equal values

$\mu +- 2\sigma$ , specifically in our case at the intersection

$\mu_1 - 2\sigma_1$ and

$\mu_2 + 2\sigma_2$

Those who wish can "twist" the parameters and see how the quality of the algorithm changes. This will be useful for better assimilation of the material.

Well, we move on to the next example

The third example. Two-dimensional case

This time we will consider the case when we have data not only on the diameter of the products, but also, for example, on the weight of each product. Let machine No. 1 produce parts weighing 12 g and a standard deviation of 1 g, machine No. 2 produces products weighing 10 g and a standard deviation of 0.8 mm.

For the rest, we will use the data from the previous example, we will write them down.

$N1 = 6000, \mu_{1 1} = 64, \sigma_{1 1} = 4, \mu_{1 2} = 14, \sigma_{1 2} = 1$

$N2 = 4000, \mu_{2 1} = 52, \sigma_{2 1} = 2, \mu_{2 2} = 10, \sigma_{2 2} = 0.9$

The situation is the same - all the details were mixed in one big pile and we need to sort them out and determine for each part the probability of its production on machine No. 1 and machine No. 2.

Solution

In general, the solution approach to the previous example does not change, with the exception of the formula for determining the probability density of a random variable. For the multidimensional case, the following formula is used:

$p_{B_j}(x_i) = \frac{1}{(2\pi)^{k/2} \centerdot \mid \Sigma_j\mid ^{1/2}} \centerdot \exp(- \frac{1}{2} \centerdot (x_i-\mu_j)^T \Sigma_j^{-1} (x_i-\mu_j))$

where

$\Sigma_j$ Is a matrix of covariances of random variables

$j$ hypothesis (in our case, the matrix of covariances of the parameters of products produced on

$j$ machine tool)

$\mid \Sigma_j\mid$ Is the determinant of the covariance matrix of random variables

$j$ hypothesis

$\mu_j$ - mathematical expectation of random variables

$j$ hypothesis (in our case, the average value of products manufactured on

$j$ machine tool)

$x_i$ - random variable (in our case, the parameters

$i$ -th product)

This time, to understand the code, a respected reader will need knowledge about operations with matrices

Functions Used in the Third Example

#      ,    
#      : .,  . 
def gaus_func_02(k, m, x, mu, sigma):
    pj_xi = []
    for j in range(k):
        det_sigma_j = np.linalg.det(sigma[j])
        factor_1 = 1 / (((2 * math.pi)**(k/2)) * ((det_sigma_j)**0.5))
        factor_2 = []
        for i in x:
#    ,            
#          
#   .    ,      
#     
            factor_2.append(math.e**float(
            -0.5 * np.matrix(i - mu[j]) * np.matrix(np.linalg.inv(sigma[j])) * np.matrix(i - mu[j]).T))
        pj_xi.append(factor_1 * np.array(factor_2))
    return np.array(pj_xi)

#     ,       №1  №2
def proba_func_02(pjxi, w, k):
    #          
    P_X = []
    for j in range(k):
        P_X.append(w[j] * pjxi[j])
    P_X = np.sum(np.array(P_X), axis = 0)
    #    ,       №1  №2
    P_J_X = []
    for j in range(k):
        P_J_X.append(w[j] * pjxi[j] / P_X)
    return np.array(P_J_X)

#        
def pred_x_02(proba_X, limit_proba):
    pred_X = []
    for x in proba_X[0]:
        if x >= limit_proba:
            pred_X.append(1)
        else:
            pred_X.append(2)
    return np.array(pred_X)

#             
def graph_02_algorithm(pred_X, mu):
    #   
    pred_X = np.array(pred_X)

    #   ,         
    answers_1 = []
    answers_2 = []

    for i in range(pred_X.shape[0]):
        if pred_X[i] == 1:
            answers_1.append(X[i])
        else:
            answers_2.append(X[i])
    
    print ' "     "'
    plt.figure(figsize=(16, 6))  
    plt.plot(
        np.array(answers_1)[:,0], np.array(answers_1)[:,1], 'o', alpha = 0.7, color='sandybrown', label = 'Produced on machine #1')
    plt.plot(
        np.array(answers_2)[:,0], np.array(answers_2)[:,1], 'o', alpha = 0.45, color = 'darkblue', label = 'Produced on machine #2')
    plt.plot(mu[0][0], mu[0][1], 'o', markersize = 16, color = 'red', label = 'Mu 1')
    plt.plot(mu[1][0], mu[1][1], 'o',  markersize = 16, color = 'slateblue', label = 'Mu 2')
    plt.xlabel('Diameter')
    plt.ylabel('Weight')
    plt.legend()
    plt.show()
    
#          
def graph_02_true(X, mu):
    print ' "  "'
    plt.figure(figsize=(16, 6))  
    plt.plot(
        X[0:N1,0], X[0:N1,1], 'o', alpha = 0.7, color='sandybrown', label = 'Produced on machine #1')
    plt.plot(
        X[N1:N,0], X[N1:N,1], 'o', alpha = 0.45, color = 'darkblue', label = 'Produced on machine #2')
    plt.plot(mu[0][0], mu[0][1], 'o', markersize = 16, color = 'red', label = 'Mu 1')
    plt.plot(mu[1][0], mu[1][1], 'o',  markersize = 16, color = 'slateblue', label = 'Mu 2')
    plt.xlabel('Diameter')
    plt.ylabel('Weight')
    plt.legend()
    plt.show()

Specify the parameters

#  
k = 2

#      №1
N1 = 6000
#      №2
N2 = 4000
N = N1+N2

#   (  )
m = 2

#    №1
mu_1_1 = 64.
#    №1
mu_1_2 = 14.
#  
sigma_1_1 = 3.5
sigma_1_2 = 1.

#    №2
mu_2_1 = 52.
#    №2
mu_2_2 = 9.5
#  
sigma_2_1 = 2.
sigma_2_2 = 0.7

We will calculate

X = np.zeros((N, m))
np.random.seed(seed=42)
#    ,   №1
X[:N1, 0] = np.random.normal(loc=mu_1_1, scale=sigma_1_1, size=N1)
X[:N1, 1] = np.random.normal(loc=mu_1_2, scale=sigma_1_2, size=N1)
#  ,   №2
X[N1:N, 0] = np.random.normal(loc=mu_2_1, scale=sigma_2_1, size=N2)
X[N1:N, 1] = np.random.normal(loc=mu_2_2, scale=sigma_2_2, size=N2)

#    (   ,    )
y = np.zeros((N))
y[:N1] = np.array((1))
y[N1:N] = np.array((2))

#           (  )
mu  = np.array(([mu_1_1, mu_1_2], [mu_2_1, mu_2_2]))

#        (  )
sigma = np.array(([sigma_1_1, 0.],[0., sigma_1_2], [sigma_2_1, 0.],[0., sigma_2_2]))
sigma = sigma.reshape(k, m, m)

#     ,    №1  №2
w = np.array([float(1./k), float(1./k)])

#   
pj_xi = gaus_func_02(k, m, X, mu, sigma)
proba_X = proba_func_02(pj_xi, w, m)

#   ,   ,        
limit_proba = 0.5

pred_X = pred_x_02(proba_X, limit_proba)
        
#    
print '   :', round(accuracy_score(y, pred_X),3)
print

graph_02_algorithm(pred_X, mu)

graph_02_true(X, mu)

Graph No. 2.1 “Distribution of products in accordance with the algorithm”

Graph No. 2.2 “True distribution of products”

By analogy with the previous example, we generated 10,000 values in accordance with the above parameters

$\mu$ and

$\sigma$ , wrote several functions for the operation of our algorithm and launched it. The fundamental difference between this code and the code from the previous example is that this time we used matrix expressions for calculations.

As a result, our algorithm shows the proportion of correct answers equal to 0.998, which is actually quite good.

The problems, as we see, are all the same - errors in the details that were produced on different machines and at the same time have similar dimensions and weights.

The code is written so that the reader can substitute any parameter values

$\mu$ and

$\sigma$ and see how the algorithm will work: in which cases the quality will deteriorate, in which it will improve. The main thing is not to forget to study the schedule. Well, we are moving to our final point in parsing the EM algorithm, in fact the EM algorithm itself.

Meet the EM algorithm

We continue our example with the parts that arrived at the warehouse. But this time we will only know that the products were manufactured on two different machines, there are 10,000 of them, each part has a diameter and size, and we don’t know anything else. But the task has not changed - we, as before, from the whole large pile, randomly mixed items, we will need to determine which machine this or that part belongs to.

At first glance, it sounds almost unrealistic, but in fact we have in our hands a powerful toolkit: the Bayes formula and the probability density formula of a random variable. Let us take all this goodness.

Solution

What will we do? As it should be in the EM-algorithm, we first initialize the parameters: The

probability of the hypothesis is to extract the part produced on the machine No. 1 -

$w_1$ we determine the likelihood of a hypothesis to extract the part produced on the machine No. 2 -

$w_2$ . There are only two hypotheses, which means that each of them in the first step will be equal to 0.5.

Mathematical expectation of random variables

$\mu$ define as follows. We mix all the products using the random function, divide the set equally into two parts, for each part for each parameter (diameter, weight) we determine the average value.

Standard deviation

$\sigma$ take what is called from the ceiling - set it equal to unity in all respects. We write in the format of the covariance matrix.

We are ready to make the first E-step of the algorithm. Using the initialized parameters of random variables, we determine the probability of each part being assigned to machine No. 1 or machine No. 2.

Actually, in this way, we took the first E-step.

Now it's up to the M-step. Everything is simple here. After we have determined the probability of each part being produced on a particular machine, we can recalculate the probability of each hypothesis -

$w_1$ ,

$w_2$ , and

$\mu$ and

$\sigma$ .

We will do 15 such iterations, in two steps each.

We look at the code.

Functions Used in the EM Algorithm

#   E-
def e_step(x, k, m, n, w, mu, sigma):
    #      i-     j-  
    pj_xi = []
    for j in range(k):
        det_sigma_j = np.linalg.det(sigma[j])
        factor_1 = 1 / (((2 * math.pi)**(k/2)) * ((det_sigma_j)**0.5))
        factor_2 = []
        for i in x:
            factor_2.append(math.e**float(
                -0.5 * np.matrix(i - mu[j]) * np.matrix(np.linalg.inv(sigma[j])) * np.matrix(i - mu[j]).T))
        pj_xi.append(factor_1 * np.array(factor_2))
    pj_xi = np.array(pj_xi)
    
    #     ,  i-    j- 
    pj_xi_w = []
    for j in range(k):
        pj_xi_w.append(pj_xi[j] * w[j])
    pj_xi_w = np.array(pj_xi_w)
    
    #     i-      
    sum_pj_xi_w = np.sum(pj_xi_w, axis = 0)
    
    #         
    proba_xi = []
    for j in range(k):
        proba_xi.append(pj_xi_w[j] / sum_pj_xi_w)
    
    return np.array(proba_xi)

#  ,    ,            ,
#       
def x_new(proba_xi):
    X1_new_ind = []
    X2_new_ind = []
    X_answers = []

    count = 0
    for x in proba_xi[0]:
        if x >= 0.5:
            X1_new_ind.append(count)
            X_answers.append(1)
        else:
            X2_new_ind.append(count)
            X_answers.append(2)
        count += 1
    #      ,    №1, №2   
    return X1_new_ind, X2_new_ind, X_answers


#   M-
def m_step(x, proba_xi,N):
    w_new = np.sum(proba_xi, axis = 1) / N
    
    #   
    mu_new = (np.array((np.matrix(proba_xi) * np.matrix(X))).T / np.sum(proba_xi, axis = 1)).T
    
    #  
    cov_new = []
    for mu in range(mu_new.shape[0]):
        X_cd = []
        X_cd_proba = []
        count = 0
        for x_i in x:
            cd = np.array(x_i - mu_new[mu])
            X_cd.append(cd)
            X_cd_proba.append(cd * proba_xi[mu][count])
            count += 1
        X_cd = np.array(X_cd)
        X_cd = X_cd.reshape(N, m)
        X_cd_proba = np.array(X_cd_proba)
        X_cd_proba = X_cd_proba.reshape(N, m)

        cov_new.append(np.matrix(X_cd.T) * np.matrix(X_cd_proba))
    cov_new = np.array((np.array(cov_new) / (np.sum(proba_xi, axis = 1)-1)))
    #             ,    
    #       ,        
    if cov_new[0][0][1] < 0:
        cov_new[0][0][1] = 0
    if cov_new[0][1][0] < 0:
        cov_new[0][1][0] = 0
    
    if cov_new[1][0][1] < 0:
        cov_new[1][0][1] = 0
    if cov_new[1][1][0] < 0:
        cov_new[1][1][0] = 0
    
    #   
    sigma_new = cov_new**0.5
    return w_new, mu_new, sigma_new

Specify the parameters

#   
#      №1 ( 1)
N1 = 6000
#      №2 ( 2)
N2 = 4000
#       
N = N1 + N2

#  
k = 2

#    №1
mu_samples_1_1 = 64.
#    №1
mu_samples_1_2 = 14.

#    №2
mu_samples_2_1 = 52.
#    №2
mu_samples_2_2 = 9.5

#      №1
sigma_samples_1_1 = 3.5
#      №1
sigma_samples_1_2 = 1.

#      №2
sigma_samples_2_1 = 2.
#      №2
sigma_samples_2_2 = 0.7

We will calculate

X = np.zeros((N, 2))

np.random.seed(seed=42)
#    ,    №1
X[:N1, 0] = np.random.normal(loc=mu_samples_1_1, scale=sigma_samples_1_1, size=N1)
X[:N1, 1] = np.random.normal(loc=mu_samples_1_2, scale=sigma_samples_1_2, size=N1)
#    ,    №2
X[N1:N, 0] = np.random.normal(loc=mu_samples_2_1, scale=sigma_samples_2_1, size=N2)
X[N1:N, 1] = np.random.normal(loc=mu_samples_2_2, scale=sigma_samples_2_2, size=N2)

#   
m = X.shape[1]

#   
n = X.shape[0]

#        (   )
y = np.zeros((N))
y[:N1] = np.array((1))
y[N1:N] = np.array((2))

#     ,    №1  №2
w = np.array([float(1./k), float(1./k)])

np.random.seed(seed = None)
#        (   )
mu  = np.array(
    (np.mean(X[np.random.choice(n, n/k)], axis = 0), np.mean(X[np.random.choice(n, n/k)], axis = 0)))
# mu = np.array(([mu_samples_1_1, mu_samples_1_2],[mu_samples_2_1, mu_samples_2_2]))

#         (    )
sigma = np.array(([1., 0.],[0., 1.], [1., 0.],[0., 1.]))
# sigma = np.array(([sigma_samples_1_1, 0.],[0., sigma_samples_1_2], [sigma_samples_2_1, 0.],[0., sigma_samples_2_2]))
sigma = sigma.reshape(k, m, m)

#    EM-
steps = 15
#   EM-
for i in range(steps):
    proba_xi = e_step(X, k, m, n, w, mu, sigma)
    w, mu, sigma = m_step(X, proba_xi,N)
    X1_new_ind, X2_new_ind, X_answers = x_new(proba_xi)
    print ' №', i+1
    print
    print '   '
    print mu
    print
    print '   '
    print sigma
    print
    print '   '
    print round(accuracy_score(y, X_answers),3)
    
    plt.figure(figsize=(16, 6))  
    plt.plot(
        X[X1_new_ind,0], X[X1_new_ind,1], 'o', alpha = 0.7, color='sandybrown', label = 'Produced on machine #1')
    plt.plot(
        X[X2_new_ind,0], X[X2_new_ind,1], 'o', alpha = 0.45, color = 'darkblue', label = 'Produced on machine #2')
    plt.plot(mu[0][0], mu[0][1], 'o', markersize = 16, color = 'red', label = 'Mu 1')
    plt.plot(mu[1][0], mu[1][1], 'o',  markersize = 16, color = 'purple', label = 'Mu 2')
    plt.xlabel('Diameter')
    plt.ylabel('Weight')
    plt.legend()
    plt.show()

Chart 3.1 “Distribution of parts at each iteration of the EM-algorithm”

All iterations

What happened in the end? And it turned out the following, having no preliminary information about the likelihood of hypotheses for the extraction of parts produced on machine No. 1 -

$w_1$ or machine number 2

$w_2$ , mathematical expectation and standard deviation of random variables -

$\mu$ and

$\sigma$ , we achieved a quality comparable to the previous example, in which all of the above parameters were known.

Compare the parameters that we set when generating the set of products, the so-called true parameters and the parameters that the EM algorithm determined for us:

True parameters

$\mu_{11} = 64, \mu_{12} = 14, \sigma_{11} = 3.5, \sigma_{12} = 1.$

$\mu_{21} = 52, \mu_{12} = 9.5, \sigma_{21} = 2.0, \sigma_{22} = 0.7$

Specific parameters

$\mu_{11} \approx 64.03, \mu_{12} \approx 13.99, \sigma_{11} \approx 3.44, \sigma_{12} \approx 1.23$

$\mu_{21} \approx 52.06, \mu_{12} \approx 9.54, \sigma_{21} \approx 1.65, \sigma_{22} \approx 0.77$

On this article comes to an end. In conclusion, I’ll add that for the purposes of data analysis, it does not make sense to reinvent the wheel and write an EM algorithm on your own, you can simply use the provided sklearn library function.

Let's see how the sklearn GaussianMixture library works out our example.

Import module

from sklearn.mixture import GaussianMixture as GMM

Specify the parameters

#   
#      №1 ( 1)
N1 = 6000
#      №2 ( 2)
N2 = 4000
#       
N = N1 + N2

#    №1
mu_samples_1_1 = 64.
#    №1
mu_samples_1_2 = 14.

#    №2
mu_samples_2_1 = 52.
#    №2
mu_samples_2_2 = 9.5

#      №1
sigma_samples_1_1 = 3.5
#      №1
sigma_samples_1_2 = 1.

#      №2
sigma_samples_2_1 = 2.
#      №2
sigma_samples_2_2 = 0.7


#  
k = 2

X = np.zeros((N, 2))

np.random.seed(seed=42)
#    ,    №1
X[:N1, 0] = np.random.normal(loc=mu_samples_1_1, scale=sigma_samples_1_1, size=N1)
X[:N1, 1] = np.random.normal(loc=mu_samples_1_2, scale=sigma_samples_1_2, size=N1)
#    ,    №2
X[N1:N, 0] = np.random.normal(loc=mu_samples_2_1, scale=sigma_samples_2_1, size=N2)
X[N1:N, 1] = np.random.normal(loc=mu_samples_2_2, scale=sigma_samples_2_2, size=N2)

#        (   )
y = np.zeros((N))
y[:N1] = np.array((1))
y[N1:N] = np.array((2))

Run the EM algorithm

np.random.seed(1)
model = GMM(n_components=k, covariance_type='full')
model.fit(X)


temp_predict_X = model.predict(X)
X_answers = []
for i in range(X.shape[0]):
    if temp_predict_X[i] == 0:
        X_answers.append(1)
    else:
        X_answers.append(2)
        

print '   '
print round(accuracy_score(y, X_answers),3)

Worked with a bang! No mistakes! As you would expect, the GaussianMixture sklearn library is faster and better than our algorithm. However, the purpose of the article, as we recall, was to understand how EM algorithms work. I hope that the goal has been achieved :)

Sources of information

Literature

1) Applied statistics: Classifications and dimensionality reduction, S.A. Ayvazyan, V.M. Buchstaber, I.S. Enyukov, L.D. Meshalkin, Moscow, Finance and Statistics, 1989

Internet resources

1) Article 'A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models', International Computer Science Institute, Jeff A. Bilmes, 1998

2) Presentation 'Mixture Models & EM-algorithm', NY University, David Sontag

3) Course 'Search for data structures'

4) Formula for full probability and Bayes formulas

5) Function 'sklearn.mixture.GaussianMixture'

6) Normal distribution, Wikipedia

7) Multidimensional normal distribution, Wikipedia

8)github.com/diefimov/MTH594_MachineLearning/blob/master/ipython/Lecture10.ipynb

9) An article about the EM algorithm

10) Another article about the EM algorithm

11) Github github.com/AlexanderPetrenko83/Articles

We disassemble EM-algorithm into small bricks

EM algorithm in general

Recall the Bayes formula

Example One By Bayes Theorem

Second example for Bayes theorem using the parameters of the normal distribution of a random variable: mathematical expectation μ\mu and standard deviation σ\sigma

The third example. Two-dimensional case

Meet the EM algorithm

Sources of information

More articles:

Second example for Bayes theorem using the parameters of the normal distribution of a random variable: mathematical expectation $\mu$ and standard deviation $\sigma$