We disassemble EM-algorithm into small bricks



In this article, as you probably already guessed, we will talk about the device EM-algorithm. First of all, the article may be of interest to those who are slowly joining the community of datasainists. The material presented in the article will be more useful to those who recently started taking the third course "Searching for data structures" within the specialization "Machine Learning and Data Analysis" from MIPT and Yandex.

The material presented in the article, in a sense, is an addition to the first week of training in the aforementioned course, namely, it allows you to answer some important questions regarding the principle of operation of the EM algorithm. For a better understanding of the material, it is advisable for our esteemed reader to be able to perform operations with matrices (matrix multiplication, finding the determinant of a matrix and an inverse matrix), understand the basics of probability theory and matstat, and of course, have at least a basic understanding of basic clustering algorithms and understand what clustering takes place in machine learning. Although, of course, without this knowledge you can read the article, something and yes it will certainly be clear :)

Also, according to the old tradition, the article will not contain deep theoretical research, but will be filled with simple and understandable examples. Each subsequent example will explain the operation of the EM algorithm a little deeper than the previous one, which ultimately leads us straightforward to the analysis of the algorithm itself. For each example, code will be written. All code is written in python 2.7, and for this I apologize in advance. It turned out that now I am using this particular version, but after switching to python 3, I will try to change the code in the article.



Let us get acquainted with the outline of the article: 1) Let us consider in general terms how the EM algorithm works.

2) We repeat the main points from the Bayes theorem (formula):

P x i ( B j ) = P ( B j ) P B j ( x i )P ( x i )



3) Consider the first example on the Bayesian theorem, in which all elements of the formula (or rather probability P ( B j ) , P B j ( x i )) for the equal sign on the right are known. In this case, we only need to correctly understand which numbers where to substitute and then perform basic operations.

4) Consider the second example on the Bayes theorem. To solve the problem, we will need to additionally calculate the probability density of a certain event (x i) subject to the hypothesis (B j) - ρBj(xi). For calculations, we will be given parameters of a random variable, namely, mathematical expectation -μ and standard deviation - σ. Probability density calculations will be carried out according to the following formula:

ρBj(xi)=1σj2πexp(-(xi-μj)22σ2j)


Using the above formula assumes that the random variable has a one-dimensional dimension and is distributed normally (in other words, the Gaussian or Gaussian-Laplace distribution is a probability distribution).

5) Consider the third example, which differs from the previous one only in that it considers the case of a multidimensional (in our two-dimensional version) normal distribution of a random variable. In this case, the density calculation is carried out according to the following formula:

ρBj(xi)=1(2π)k/2Σj1/2exp(-12(xi-μj)TΣ-1j(xi-μj))



6) Finally, we modify the third example in such a way as to clearly demonstrate the operation of the EM algorithm.

7) In conclusion, we compare the quality of the algorithm we have deduced with the quality of the EM algorithm, which is built into the sklearn library (class sklearn.mixture.GaussianMixture)

If, instead of the formulas specified in the outline of the article, you saw Japanese characters, then please do not be scared - after Bayes' theorem (the second paragraph of the outline of the paper), we will analyze such examples that you will probably understand at least intuitively. Well, let's go!

EM algorithm in general


On the course, based on which the article was written, the EM algorithm is presented as one of the methods of clustering. In other words, this is a machine-learning method without a teacher, when we do not know the true answers in advance. In our article, we will also consider the algorithm as part of one of the clustering methods of a random variable with a normal distribution. However, it should be noted that the algorithm has wider application. The EM-algorithm (English expectation-maximization) is a general method for finding estimates of the likelihood function in models with hidden variables, which from a mixture of distributions allows you to build (approximate) complex probability distributions.

The EM algorithm in clustering problems is used as an iterative algorithm, which performs two steps at each iteration:

E-step. At the first E-step, in some way, for example, randomly, we select hidden variables, in our case it will be a mathematical expectation -μ and standard deviation - σ. Using the selected variables, we calculate the probability of assigning each object to a particular cluster. Subsequent E-steps use the hidden variables defined in the M-steps.

M-step . At the M-step, we, in accordance with the probabilities of assigning each object to a particular cluster obtained at the E-step, recalculate hidden variablesμ and σ

Iterations are repeated until convergence occurs. We can assume that convergence occurred when the values ​​of the hidden variables do not change significantly, for example, within a given constant. In the last example, which is considered in the article, we will not set constants and, accordingly, we will not determine changes in the values ​​of hidden variables at each iteration. Let's do it simpler - we will limit our algorithm to a fixed number of steps.

The M-step is simple enough to understand, and we will see this in the last example. We will focus on the E-step in the article.

The E-step is based on the Bayesian theorem, on the basis of which we determine the probability of each object to be assigned to one or another cluster. Let's remember what this theorem is about.

Recall the Bayes formula


Pxi(Bj)=P(Bj)PBj(xi)P(xi)



where Pxi(Bj) - the probability that with the event xi there was a hypothesis Bj

P(Bj) - probability of hypothesis Bj

PBj(xi) - probability of occurrence of an event xi subject to hypothesis Bj

P(xi) - probability of occurrence of an event xisubject to the fulfillment of each hypothesis (calculated by the formula of the total probability of the event)

Provided that the eventxi already happened, the probability of fulfilling the hypothesis Bjrevalued using the above formula. We can say thatP(Bj) this is the a priori probability of the hypothesis (before the test), and Pxi(Bj) Is the posterior probability of the same hypothesis evaluated after the event xithat is, given the fact that the event xi reliably happened.

Example One By Bayes Theorem


Imagine that we received parts that were produced on two different machines. The following characteristics of the products received at the warehouse are known:

In total, parts - 10,000 pcs (N=10,000), of which parts produced on the first machine - 6000 pcs. (N1=6000), on the second - 4000 pcs. (N2=4000)

The proportion of standard (i.e., non-defective) products manufactured on the first machine is 0.9, the share of standard products manufactured on the second machine is 0.8.

We randomly extracted one part from the received parts, and it turned out to be standard.

We need to determine the probability that:

1) the part is produced on the 1st machine;

2) the item was produced on the 2nd machine.

Decision

1) Eventxin our example, extracting a standard product.

2) Now, let's determine the hypothesesBj. We have two lots of parts, which means two hypotheses:

HypothesisB1: a product that was accidentally removed was produced on machine No. 1.

HypothesisB2: a product that was accidentally removed was produced on machine No. 2.

Accordingly, the probability of extracting a product manufactured on the first machine isP(B1) makes up N1/N=6000/10,000=0.6.

The probability of extracting a product manufactured on a second machine isP(B2) makes up N2/N=0.4. And we don’t care about the standard product or the defective one, it’s important from which batch it is.

3) The probability of extracting a standard product from products manufactured on machine No. 1 (that is, under the first hypothesis) corresponds to the proportion of standard products produced on machine No. 1 and is 0.9,PB1(xi)=0.9.

The probability of extracting the product is a standard product subject to hypothesis No. 2 -PB2(xi)=0.8

4) Determine the probability of extracting a standard product from the entire set of products -P(xi). In accordance with the formula for the total probability of an eventP(xi)=P(B1)PB1(xi)+P(B2)PB2(xi)=0.60.9+0.40.8=0.54+0.32=0.86. Here we understand that then0.14 - this is the probability of extracting the defective part from all the parts that arrived at the warehouse and in total it turns out 1 (0.86+0.14=1).

5) Now we have all the data in order to solve the problem.

5.1) Determine the probability that a randomly extracted standard part is produced on a machineNo.1:

Pxi(B1)=P(B1)PB1(xi)P(xi)=0.60.90.860.63



5.2) We determine the probability that a randomly extracted standard part is produced on a machineNo.2:

Pxi(B2)=P(B2)PB2(xi)P(xi)=0.40.80.860.37



Thus, we reevaluated the hypotheses B1 and B2. After reassessment, hypotheses also form a complete group of events:0.63+0.37=1.

Well, now is the time to move on to the second example.

Second example for Bayes theorem using the parameters of the normal distribution of a random variable: mathematical expectation μ and standard deviation σ



Let's slightly modify the conditions of the previous example. We assume that we do not know the proportion of standard products manufactured on machines No. 1 and No. 2, but suppose, instead, we know the average dimensions of the diameters of the products and the standard deviation of the diameter of the products on each machine.

Machine No. 1 produces parts 64 mm in diameter and a standard deviation of 4 mm.

Machine No. 2 produces parts 52 mm in diameter and a standard deviation of 2 mm.

An important condition is that the entire set of products is described by the normal distribution or Gaussian distribution.

Otherwise, the conditions are the same, we write them:

N1=6000,μ1=64,σ1=4

N2=4000,μ2=52,σ2=2

We assume that in the process of product acceptance there was a small incident, as a result of which all products were mixed.

Our task is to sort out the details and for each to determine the probability that it was produced on machine No. 1 or machine No. 2. We will also assume that the part was produced on the machine, the probability of which is higher.

Solution

First of all, we will analyze the solution algorithm. We can easily calculate the probability of each hypothesis, however, we already know their values ​​from a past example:P(B1)=0.6, P(B2)=0.4. But how do we calculate the probability of seizing a standard product individually for each of the available hypotheses? Everything is simple! We will not consider probability, instead, we will determine the probability density of the item to be removed with its diameter value for each hypothesis separately.

To do this, we use the well-known formula:

ρBj(xi)=1σj2πexp(-(xi-μj)22σ2j)



where xi - a random value (in our case, the actual size of the diameter of the product),

μj - mathematical expectation of random variables jhypothesis (in our case, the average size of the diameter of the product produced on jmachine tool)

σj - standard deviation of random variables jhypothesis (in our case, the deviation from the average size of the diameter of the product produced on jmachine tool).

Thus, we make a clever substitution of probability for probability density in the numerator of the Bayes formula, we will carry out similar replacements in the denominator.

We adapt the formulas specifically for our case.

The likelihood that the part with diameter dimensionsxi produced on machine No. 1 is determined by the formula:

Pxi(B1)=w1ρB1(xi)w1ρB1(xi)+w2ρB1(xi)



The likelihood that the part with diameter dimensions xi produced on machine No. 2:

Pxi(B2)=w2ρB2(xi)w1ρB1(xi)+w2ρB1(xi)



Note that we have replaced the notation for the probability of a hypothesis with P(Bj) on the wj. This is due to the corresponding designation on the previously designated course. Further we will use just such a notation.

Now we are 100% complete and ready to solve the problem.

We look at the code

Import libraries
#  , 
import numpy as np
import random
import math
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import accuracy_score


Functions Used in the Second Example
#      ,    
#      : .,  . 
def gaus_func_01(mu,sigma,x):
    return math.e**(-(x-mu)**2/(2*sigma**2)) / (sigma*(2*math.pi)**0.5)

#        
def proba_x(X, w1, w2, mu_1, mu_2, sigma_1, sigma_2):
    for i in X:
        P1_x = gaus_func_01(mu_1,sigma_1,i)
        P2_x = gaus_func_01(mu_2,sigma_2,i)
        P_x = w1*P1_x + w2*P2_x
        P_x_1 = (w1*P1_x)/P_x
        P_x_2 = (w2*P2_x)/P_x
        proba_temp = []
        proba_temp.append(P_x_1)
        proba_temp.append(P_x_2)
        proba_X.append(proba_temp)
    return proba_X

#        
def pred_x(proba_X, limit_proba):
    pred_X = []
    for x in proba_X:
        if x[0] >= limit_proba:
            pred_X.append(1)
        else:
            pred_X.append(2)
    return np.array(pred_X)

#    
def graph_01(X, pred_X, mu_1, sigma_1, mu_2, sigma_2):
    true_pred = []
    false_pred_1 = []
    false_pred_2 = []
    for i in range(X.shape[0]):
        if pred_X[i] == y[i]:
            true_pred.append([X[i], -0.025])
        else:
            if y[i] == 1:
                false_pred_1.append([X[i], -0.0075])
            else:
                false_pred_2.append([X[i], -0.015])

    false_pred_1 = np.array(false_pred_1)            
    false_pred_2 = np.array(false_pred_2)
    true_pred = np.array(true_pred)

    x_theory = np.linspace(42, 85, 20000)
    y_theory_1 = []
    for x in x_theory:
        y_theory_1.append(gaus_func_01(mu_1,sigma_1,x))
    y_theory_2 = []
    for x in x_theory:
        y_theory_2.append(gaus_func_01(mu_2,sigma_2,x))

    plt.figure(figsize=(18, 8))    
    plt.plot(
        x_theory, y_theory_1, color = 'green', lw = 2, label = 'Theoretical probability density for machine 1')
    plt.plot(
        x_theory, y_theory_2, color = 'firebrick', lw = 2, label = 'Theoretical probability density for machine 2')
    plt.hist(
        X[:N1], bins = 'auto', color='#539caf', normed = True, alpha = 0.35, label = 'machine tool products 1')
    plt.hist(
        X[N1:N], bins = 'auto', color='sandybrown', normed = True, alpha = 0.75, label = 'machine tool products 2')
    plt.plot(mu_1, 0, 'o', markersize = 11, color = 'blue', label = 'Mu 1')
    plt.plot(mu_2, 0, 'o', markersize = 11, color = 'red', label = 'Mu 2')

    plt.plot([mu_1 - sigma_1, mu_1 - sigma_1], [0,0.85*np.max(y_theory_1)],
             ':', lw = 3, color = 'blue', alpha = 0.55, label = 'Mu1 - sigma1')
    plt.plot([mu_1 + sigma_1, mu_1 + sigma_1], [0,0.85*np.max(y_theory_1)],
             ':', lw = 3, color = 'blue', alpha = 0.55, label = 'Mu1 + sigma1')
    plt.plot([mu_2 - sigma_2, mu_2 - sigma_2], [0,0.85*np.max(y_theory_2)],
             ':', lw = 3, color = 'red', alpha = 0.55, label = 'Mu2 - sigma2')
    plt.plot([mu_2 + sigma_2, mu_2 + sigma_2], [0,0.85*np.max(y_theory_2)],
             ':', lw = 3, color = 'red', alpha = 0.55, label = 'Mu2 + sigma2')

    plt.plot([mu_1 - 2 * sigma_1, mu_1 - 2 * sigma_1], [0, 0.9*0.5 * np.max(y_theory_1)],
             ':', lw = 2.5, color = 'blue', alpha = 0.35, label = 'Mu1 - 2*sigma1')
    plt.plot([mu_1 + 2 * sigma_1, mu_1 + 2 * sigma_1], [0, 0.9*0.5 * np.max(y_theory_1)],
             ':', lw = 2.5, color = 'blue', alpha = 0.35, label = 'Mu1 + 2*sigma1')
    plt.plot([mu_2 - 2 * sigma_2, mu_2 - 2 * sigma_2], [0, 0.9*0.5 * np.max(y_theory_2)],
             ':', lw = 2.5, color = 'red', alpha = 0.35, label = 'Mu2 - 2*sigma2')
    plt.plot([mu_2 + 2 * sigma_2, mu_2 + 2 * sigma_2], [0, 0.9*0.5 * np.max(y_theory_2)],
             ':', lw = 2.5, color = 'red', alpha = 0.35, label = 'Mu2 + 2*sigma2')

    plt.plot(false_pred_1[:,0], false_pred_1[:,1], 'o', markersize = 2.5, color = 'blue', alpha = 0.2, label = 'errors1')
    plt.plot(false_pred_2[:,0], false_pred_2[:,1], 'o', markersize = 2.5, color = 'red', alpha = 0.3, label = 'errors2')
    plt.plot(true_pred[:,0], true_pred[:,1], 'o', markersize = 3, color = 'green', alpha = 0.2, label = 'right answers')

    plt.xlabel('Caliber')
    plt.ylabel('Probability density')
    plt.legend()
    plt.show()


Specify the parameters
#    
#      №1
N1 = 6000
#      №2
N2 = 4000
#      
N = N1+N2

#    №1
mu_1 = 64.
#        №1
sigma_1 = 3.5

#    №2
mu_2 = 52
#        №2
sigma_2 = 2.


We will calculate
X = np.zeros((N))
np.random.seed(seed=42)
#    ,   №1
X[:N1] = np.random.normal(loc=mu_1, scale=sigma_1, size=N1)
#  ,   №2
X[N1:N] = np.random.normal(loc=mu_2, scale=sigma_2, size=N2)

#   
y = np.zeros((N))
y[:N1] = np.array((1))
y[N1:N] = np.array((2))

#     ,    №1
w1 = float(N1)/N
#     ,    №2
w2 = float(N2)/N

#           
proba_X = []
proba_X = proba_x(X, w1, w2, mu_1, mu_2, sigma_1, sigma_2)

#   ,   ,        
limit_proba = 0.5

#     
pred_X = []
pred_X = pred_x(proba_X, limit_proba)

#    
print '   :', round(accuracy_score(y, pred_X),3)
print

print ' №1'
graph_01(X, pred_X, mu_1, sigma_1, mu_2, sigma_2)


Chart # 1



What have we just done? We generated in a pseudo-random way 10,000 values ​​described by the normal distribution, of which 6,000 values ​​with a mean of 64 mm and a standard deviation of 4 mm, 4,000 values ​​with a mean of 52 mm and a standard deviation of 2 mm. Further, in accordance with the above algorithm, for each part, the probability of its production on machine No. 1 or No. 2 was determined. After that, we chose the hypothesis about the production of the product on one or another machine, depending on which hypothesis is most likely. Finally, we compared the results of our algorithm with the true answers.

At the output, we got a share of correct answers - 0.986. In general, this is very good, given that we conducted the training without using true answers.

Let's look at the chart. What is recommended to pay attention to? See where the products that are not correctly identified by the algorithm are located.

Firstly, we see algorithm errors only in the area where products manufactured on different machines have the same diameter. It looks quite logical.

Secondly, we see that the algorithm is mistaken only in those objects that are most distant from the true average value of the objects and at the same time are quite close to the false center.

But the most interesting thing is that the problems begin mainly after crossing approximately equal valuesμ+-2σ, specifically in our case at the intersection μ1-2σ1 and μ2+2σ2

Those who wish can "twist" the parameters and see how the quality of the algorithm changes. This will be useful for better assimilation of the material.

Well, we move on to the next example

The third example. Two-dimensional case


This time we will consider the case when we have data not only on the diameter of the products, but also, for example, on the weight of each product. Let machine No. 1 produce parts weighing 12 g and a standard deviation of 1 g, machine No. 2 produces products weighing 10 g and a standard deviation of 0.8 mm.

For the rest, we will use the data from the previous example, we will write them down.

N1=6000,μ11=64,σ11=4,μ12=14,σ12=1

N2=4000,μ21=52,σ21=2,μ22=10,σ22=0.9

The situation is the same - all the details were mixed in one big pile and we need to sort them out and determine for each part the probability of its production on machine No. 1 and machine No. 2.

Solution

In general, the solution approach to the previous example does not change, with the exception of the formula for determining the probability density of a random variable. For the multidimensional case, the following formula is used:

pBj(xi)=1(2π)k/2Σj1/2exp(-12(xi-μj)TΣ-1j(xi-μj))



where Σj Is a matrix of covariances of random variables jhypothesis (in our case, the matrix of covariances of the parameters of products produced on jmachine tool)

Σj Is the determinant of the covariance matrix of random variables jhypothesis

μj - mathematical expectation of random variables jhypothesis (in our case, the average value of products manufactured on jmachine tool)

xi - random variable (in our case, the parameters i-th product)

This time, to understand the code, a respected reader will need knowledge about operations with matrices

Functions Used in the Third Example
#      ,    
#      : .,  . 
def gaus_func_02(k, m, x, mu, sigma):
    pj_xi = []
    for j in range(k):
        det_sigma_j = np.linalg.det(sigma[j])
        factor_1 = 1 / (((2 * math.pi)**(k/2)) * ((det_sigma_j)**0.5))
        factor_2 = []
        for i in x:
#    ,            
#          
#   .    ,      
#     
            factor_2.append(math.e**float(
            -0.5 * np.matrix(i - mu[j]) * np.matrix(np.linalg.inv(sigma[j])) * np.matrix(i - mu[j]).T))
        pj_xi.append(factor_1 * np.array(factor_2))
    return np.array(pj_xi)

#     ,       №1  №2
def proba_func_02(pjxi, w, k):
    #          
    P_X = []
    for j in range(k):
        P_X.append(w[j] * pjxi[j])
    P_X = np.sum(np.array(P_X), axis = 0)
    #    ,       №1  №2
    P_J_X = []
    for j in range(k):
        P_J_X.append(w[j] * pjxi[j] / P_X)
    return np.array(P_J_X)

#        
def pred_x_02(proba_X, limit_proba):
    pred_X = []
    for x in proba_X[0]:
        if x >= limit_proba:
            pred_X.append(1)
        else:
            pred_X.append(2)
    return np.array(pred_X)

#             
def graph_02_algorithm(pred_X, mu):
    #   
    pred_X = np.array(pred_X)

    #   ,         
    answers_1 = []
    answers_2 = []

    for i in range(pred_X.shape[0]):
        if pred_X[i] == 1:
            answers_1.append(X[i])
        else:
            answers_2.append(X[i])
    
    print ' "     "'
    plt.figure(figsize=(16, 6))  
    plt.plot(
        np.array(answers_1)[:,0], np.array(answers_1)[:,1], 'o', alpha = 0.7, color='sandybrown', label = 'Produced on machine #1')
    plt.plot(
        np.array(answers_2)[:,0], np.array(answers_2)[:,1], 'o', alpha = 0.45, color = 'darkblue', label = 'Produced on machine #2')
    plt.plot(mu[0][0], mu[0][1], 'o', markersize = 16, color = 'red', label = 'Mu 1')
    plt.plot(mu[1][0], mu[1][1], 'o',  markersize = 16, color = 'slateblue', label = 'Mu 2')
    plt.xlabel('Diameter')
    plt.ylabel('Weight')
    plt.legend()
    plt.show()
    
#          
def graph_02_true(X, mu):
    print ' "  "'
    plt.figure(figsize=(16, 6))  
    plt.plot(
        X[0:N1,0], X[0:N1,1], 'o', alpha = 0.7, color='sandybrown', label = 'Produced on machine #1')
    plt.plot(
        X[N1:N,0], X[N1:N,1], 'o', alpha = 0.45, color = 'darkblue', label = 'Produced on machine #2')
    plt.plot(mu[0][0], mu[0][1], 'o', markersize = 16, color = 'red', label = 'Mu 1')
    plt.plot(mu[1][0], mu[1][1], 'o',  markersize = 16, color = 'slateblue', label = 'Mu 2')
    plt.xlabel('Diameter')
    plt.ylabel('Weight')
    plt.legend()
    plt.show()


Specify the parameters
#  
k = 2

#      №1
N1 = 6000
#      №2
N2 = 4000
N = N1+N2

#   (  )
m = 2

#    №1
mu_1_1 = 64.
#    №1
mu_1_2 = 14.
#  
sigma_1_1 = 3.5
sigma_1_2 = 1.

#    №2
mu_2_1 = 52.
#    №2
mu_2_2 = 9.5
#  
sigma_2_1 = 2.
sigma_2_2 = 0.7


We will calculate
X = np.zeros((N, m))
np.random.seed(seed=42)
#    ,   №1
X[:N1, 0] = np.random.normal(loc=mu_1_1, scale=sigma_1_1, size=N1)
X[:N1, 1] = np.random.normal(loc=mu_1_2, scale=sigma_1_2, size=N1)
#  ,   №2
X[N1:N, 0] = np.random.normal(loc=mu_2_1, scale=sigma_2_1, size=N2)
X[N1:N, 1] = np.random.normal(loc=mu_2_2, scale=sigma_2_2, size=N2)

#    (   ,    )
y = np.zeros((N))
y[:N1] = np.array((1))
y[N1:N] = np.array((2))

#           (  )
mu  = np.array(([mu_1_1, mu_1_2], [mu_2_1, mu_2_2]))

#        (  )
sigma = np.array(([sigma_1_1, 0.],[0., sigma_1_2], [sigma_2_1, 0.],[0., sigma_2_2]))
sigma = sigma.reshape(k, m, m)

#     ,    №1  №2
w = np.array([float(1./k), float(1./k)])

#   
pj_xi = gaus_func_02(k, m, X, mu, sigma)
proba_X = proba_func_02(pj_xi, w, m)

#   ,   ,        
limit_proba = 0.5

pred_X = pred_x_02(proba_X, limit_proba)
        
#    
print '   :', round(accuracy_score(y, pred_X),3)
print

graph_02_algorithm(pred_X, mu)

graph_02_true(X, mu)


Graph No. 2.1 “Distribution of products in accordance with the algorithm”


Graph No. 2.2 “True distribution of products”


By analogy with the previous example, we generated 10,000 values ​​in accordance with the above parametersμ and σ, wrote several functions for the operation of our algorithm and launched it. The fundamental difference between this code and the code from the previous example is that this time we used matrix expressions for calculations.

As a result, our algorithm shows the proportion of correct answers equal to 0.998, which is actually quite good.

The problems, as we see, are all the same - errors in the details that were produced on different machines and at the same time have similar dimensions and weights.

The code is written so that the reader can substitute any parameter valuesμ and σand see how the algorithm will work: in which cases the quality will deteriorate, in which it will improve. The main thing is not to forget to study the schedule. Well, we are moving to our final point in parsing the EM algorithm, in fact the EM algorithm itself.

Meet the EM algorithm


We continue our example with the parts that arrived at the warehouse. But this time we will only know that the products were manufactured on two different machines, there are 10,000 of them, each part has a diameter and size, and we don’t know anything else. But the task has not changed - we, as before, from the whole large pile, randomly mixed items, we will need to determine which machine this or that part belongs to.

At first glance, it sounds almost unrealistic, but in fact we have in our hands a powerful toolkit: the Bayes formula and the probability density formula of a random variable. Let us take all this goodness.

Solution

What will we do? As it should be in the EM-algorithm, we first initialize the parameters: The

probability of the hypothesis is to extract the part produced on the machine No. 1 -w1 we determine the likelihood of a hypothesis to extract the part produced on the machine No. 2 - w2. There are only two hypotheses, which means that each of them in the first step will be equal to 0.5.

Mathematical expectation of random variablesμdefine as follows. We mix all the products using the random function, divide the set equally into two parts, for each part for each parameter (diameter, weight) we determine the average value.

Standard deviationσtake what is called from the ceiling - set it equal to unity in all respects. We write in the format of the covariance matrix.

We are ready to make the first E-step of the algorithm. Using the initialized parameters of random variables, we determine the probability of each part being assigned to machine No. 1 or machine No. 2.

Actually, in this way, we took the first E-step.

Now it's up to the M-step. Everything is simple here. After we have determined the probability of each part being produced on a particular machine, we can recalculate the probability of each hypothesis -w1, w2, and μ and σ.

We will do 15 such iterations, in two steps each.

We look at the code.
Functions Used in the EM Algorithm
#   E-
def e_step(x, k, m, n, w, mu, sigma):
    #      i-     j-  
    pj_xi = []
    for j in range(k):
        det_sigma_j = np.linalg.det(sigma[j])
        factor_1 = 1 / (((2 * math.pi)**(k/2)) * ((det_sigma_j)**0.5))
        factor_2 = []
        for i in x:
            factor_2.append(math.e**float(
                -0.5 * np.matrix(i - mu[j]) * np.matrix(np.linalg.inv(sigma[j])) * np.matrix(i - mu[j]).T))
        pj_xi.append(factor_1 * np.array(factor_2))
    pj_xi = np.array(pj_xi)
    
    #     ,  i-    j- 
    pj_xi_w = []
    for j in range(k):
        pj_xi_w.append(pj_xi[j] * w[j])
    pj_xi_w = np.array(pj_xi_w)
    
    #     i-      
    sum_pj_xi_w = np.sum(pj_xi_w, axis = 0)
    
    #         
    proba_xi = []
    for j in range(k):
        proba_xi.append(pj_xi_w[j] / sum_pj_xi_w)
    
    return np.array(proba_xi)

#  ,    ,            ,
#       
def x_new(proba_xi):
    X1_new_ind = []
    X2_new_ind = []
    X_answers = []

    count = 0
    for x in proba_xi[0]:
        if x >= 0.5:
            X1_new_ind.append(count)
            X_answers.append(1)
        else:
            X2_new_ind.append(count)
            X_answers.append(2)
        count += 1
    #      ,    №1, №2   
    return X1_new_ind, X2_new_ind, X_answers


#   M-
def m_step(x, proba_xi,N):
    w_new = np.sum(proba_xi, axis = 1) / N
    
    #   
    mu_new = (np.array((np.matrix(proba_xi) * np.matrix(X))).T / np.sum(proba_xi, axis = 1)).T
    
    #  
    cov_new = []
    for mu in range(mu_new.shape[0]):
        X_cd = []
        X_cd_proba = []
        count = 0
        for x_i in x:
            cd = np.array(x_i - mu_new[mu])
            X_cd.append(cd)
            X_cd_proba.append(cd * proba_xi[mu][count])
            count += 1
        X_cd = np.array(X_cd)
        X_cd = X_cd.reshape(N, m)
        X_cd_proba = np.array(X_cd_proba)
        X_cd_proba = X_cd_proba.reshape(N, m)

        cov_new.append(np.matrix(X_cd.T) * np.matrix(X_cd_proba))
    cov_new = np.array((np.array(cov_new) / (np.sum(proba_xi, axis = 1)-1)))
    #             ,    
    #       ,        
    if cov_new[0][0][1] < 0:
        cov_new[0][0][1] = 0
    if cov_new[0][1][0] < 0:
        cov_new[0][1][0] = 0
    
    if cov_new[1][0][1] < 0:
        cov_new[1][0][1] = 0
    if cov_new[1][1][0] < 0:
        cov_new[1][1][0] = 0
    
    #   
    sigma_new = cov_new**0.5
    return w_new, mu_new, sigma_new


Specify the parameters
#   
#      №1 ( 1)
N1 = 6000
#      №2 ( 2)
N2 = 4000
#       
N = N1 + N2

#  
k = 2

#    №1
mu_samples_1_1 = 64.
#    №1
mu_samples_1_2 = 14.

#    №2
mu_samples_2_1 = 52.
#    №2
mu_samples_2_2 = 9.5

#      №1
sigma_samples_1_1 = 3.5
#      №1
sigma_samples_1_2 = 1.

#      №2
sigma_samples_2_1 = 2.
#      №2
sigma_samples_2_2 = 0.7


We will calculate
X = np.zeros((N, 2))

np.random.seed(seed=42)
#    ,    №1
X[:N1, 0] = np.random.normal(loc=mu_samples_1_1, scale=sigma_samples_1_1, size=N1)
X[:N1, 1] = np.random.normal(loc=mu_samples_1_2, scale=sigma_samples_1_2, size=N1)
#    ,    №2
X[N1:N, 0] = np.random.normal(loc=mu_samples_2_1, scale=sigma_samples_2_1, size=N2)
X[N1:N, 1] = np.random.normal(loc=mu_samples_2_2, scale=sigma_samples_2_2, size=N2)

#   
m = X.shape[1]

#   
n = X.shape[0]

#        (   )
y = np.zeros((N))
y[:N1] = np.array((1))
y[N1:N] = np.array((2))

#     ,    №1  №2
w = np.array([float(1./k), float(1./k)])

np.random.seed(seed = None)
#        (   )
mu  = np.array(
    (np.mean(X[np.random.choice(n, n/k)], axis = 0), np.mean(X[np.random.choice(n, n/k)], axis = 0)))
# mu = np.array(([mu_samples_1_1, mu_samples_1_2],[mu_samples_2_1, mu_samples_2_2]))

#         (    )
sigma = np.array(([1., 0.],[0., 1.], [1., 0.],[0., 1.]))
# sigma = np.array(([sigma_samples_1_1, 0.],[0., sigma_samples_1_2], [sigma_samples_2_1, 0.],[0., sigma_samples_2_2]))
sigma = sigma.reshape(k, m, m)

#    EM-
steps = 15
#   EM-
for i in range(steps):
    proba_xi = e_step(X, k, m, n, w, mu, sigma)
    w, mu, sigma = m_step(X, proba_xi,N)
    X1_new_ind, X2_new_ind, X_answers = x_new(proba_xi)
    print ' №', i+1
    print
    print '   '
    print mu
    print
    print '   '
    print sigma
    print
    print '   '
    print round(accuracy_score(y, X_answers),3)
    
    plt.figure(figsize=(16, 6))  
    plt.plot(
        X[X1_new_ind,0], X[X1_new_ind,1], 'o', alpha = 0.7, color='sandybrown', label = 'Produced on machine #1')
    plt.plot(
        X[X2_new_ind,0], X[X2_new_ind,1], 'o', alpha = 0.45, color = 'darkblue', label = 'Produced on machine #2')
    plt.plot(mu[0][0], mu[0][1], 'o', markersize = 16, color = 'red', label = 'Mu 1')
    plt.plot(mu[1][0], mu[1][1], 'o',  markersize = 16, color = 'purple', label = 'Mu 2')
    plt.xlabel('Diameter')
    plt.ylabel('Weight')
    plt.legend()
    plt.show()


Chart 3.1 “Distribution of parts at each iteration of the EM-algorithm”

All iterations
















What happened in the end? And it turned out the following, having no preliminary information about the likelihood of hypotheses for the extraction of parts produced on machine No. 1 -w1 or machine number 2 w2, mathematical expectation and standard deviation of random variables - μ and σ, we achieved a quality comparable to the previous example, in which all of the above parameters were known.

Compare the parameters that we set when generating the set of products, the so-called true parameters and the parameters that the EM algorithm determined for us:

True parameters

μeleven=64,μ12=14,σeleven=3.5,σ12=1.

μ21=52,μ12=9.5,σ21=2.0,σ22=0.7

Specific parameters

μeleven64.03,μ1213.99,σeleven3.44,σ121.23

μ2152.06,μ129.54,σ211.65,σ220.77

On this article comes to an end. In conclusion, I’ll add that for the purposes of data analysis, it does not make sense to reinvent the wheel and write an EM algorithm on your own, you can simply use the provided sklearn library function.

Let's see how the sklearn GaussianMixture library works out our example.

Import module
from sklearn.mixture import GaussianMixture as GMM


Specify the parameters
#   
#      №1 ( 1)
N1 = 6000
#      №2 ( 2)
N2 = 4000
#       
N = N1 + N2

#    №1
mu_samples_1_1 = 64.
#    №1
mu_samples_1_2 = 14.

#    №2
mu_samples_2_1 = 52.
#    №2
mu_samples_2_2 = 9.5

#      №1
sigma_samples_1_1 = 3.5
#      №1
sigma_samples_1_2 = 1.

#      №2
sigma_samples_2_1 = 2.
#      №2
sigma_samples_2_2 = 0.7


#  
k = 2

X = np.zeros((N, 2))

np.random.seed(seed=42)
#    ,    №1
X[:N1, 0] = np.random.normal(loc=mu_samples_1_1, scale=sigma_samples_1_1, size=N1)
X[:N1, 1] = np.random.normal(loc=mu_samples_1_2, scale=sigma_samples_1_2, size=N1)
#    ,    №2
X[N1:N, 0] = np.random.normal(loc=mu_samples_2_1, scale=sigma_samples_2_1, size=N2)
X[N1:N, 1] = np.random.normal(loc=mu_samples_2_2, scale=sigma_samples_2_2, size=N2)

#        (   )
y = np.zeros((N))
y[:N1] = np.array((1))
y[N1:N] = np.array((2))


Run the EM algorithm
np.random.seed(1)
model = GMM(n_components=k, covariance_type='full')
model.fit(X)


temp_predict_X = model.predict(X)
X_answers = []
for i in range(X.shape[0]):
    if temp_predict_X[i] == 0:
        X_answers.append(1)
    else:
        X_answers.append(2)
        

print '   '
print round(accuracy_score(y, X_answers),3)


Worked with a bang! No mistakes! As you would expect, the GaussianMixture sklearn library is faster and better than our algorithm. However, the purpose of the article, as we recall, was to understand how EM algorithms work. I hope that the goal has been achieved :)

Sources of information


Literature

1) Applied statistics: Classifications and dimensionality reduction, S.A. Ayvazyan, V.M. Buchstaber, I.S. Enyukov, L.D. Meshalkin, Moscow, Finance and Statistics, 1989

Internet resources

1) Article 'A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models', International Computer Science Institute, Jeff A. Bilmes, 1998

2) Presentation 'Mixture Models & EM-algorithm', NY University, David Sontag

3) Course 'Search for data structures'

4) Formula for full probability and Bayes formulas

5) Function 'sklearn.mixture.GaussianMixture'

6) Normal distribution, Wikipedia

7) Multidimensional normal distribution, Wikipedia

8)github.com/diefimov/MTH594_MachineLearning/blob/master/ipython/Lecture10.ipynb

9) An article about the EM algorithm

10) Another article about the EM algorithm

11) Github github.com/AlexanderPetrenko83/Articles

All Articles