EM algorithm for clustering

The EM algorithm is a useful data modeling tool when maximizing likelihood "face-on", through differentiation, is not possible. Clustering is one of the tasks where this algorithm comes to the rescue. The article provides a general conclusion of the EM algorithm for clustering.


Task


Many points X={xi,i1..N}need to be broken down into Kclusters.


Solution idea


We compose a probabilistic model of the distribution of points across clusters. Let us find the model parameters for which the probability of observing the setXmaximum. With these parameters, we will be able to determine which cluster the most likely point belongs to.x.


Data model


We introduce a series of notation borrowed from the course .


p(x)- the probability of observing a point x.


p(X)=i=1Np(xi)- the probability of observing many X.


pj(x)=φ(x;θj)- probability to meet a point xin a cluster j. This distribution is parameterized by a parameter (or parameter vector)θjspecific to the cluster j.


wj- cluster probability j, i.e. the probability that a randomly selected point belongs to a clusterj. A randomly selected point exactly refers to a cluster, soj=1Kwj=1.


From the definitions above it follows that p(x)=j=1Kwjpj(x)=j=1Kwjφ(x;θj), i.e. distribution of points is modeled as a mixture of cluster distributions.


As a result, the probabilistic model of the set of points X:


p(X)=i=1N(j=1Kwjφ(xi;θj))


Parameter Search


Model Parameters wand θ, as discussed above, should provide the maximum probability for our data:


w,θ=argmax p(X)=argmax logp(X)=argmaxw,θi=1Nlog(j=1Kwjφ(xi;θj))


The sum under the logarithm sign interferes with solving the problem analytically. Limitationj=1Kwj=1 (, TensorFlow PyTorch).


:


L :=   log p(X)
while log p(X)  :
     L  log p(X)
    w, theta = argmax L

, logp(X), . L:


  1. L: "" , logp(X).
  2. wθ, L.

, "" , .


L


:


logp(X)=i=1Nlog(j=1Kwjφ(xi;θj))


. gixi:


gi(j)p( j| i)


:


i=1Nlog(j=1Kwjφ(xi;θj))=i=1Nlog(j=1Kgi(j)gi(j)wjφ(xi;θj))


. :


log(iqixi)iqilogxi


, qi1.


gi(j): j=1Kgi(j)=1. :


i=1Nlog(j=1Kgi(j)gi(j)wjφ(xi;θj))i=1Nj=1Kgi(j)log(wjφ(xi;θj)gi(j))


:


L(g,w,θ)i=1Nj=1Kgi(j)log(wjφ(xi;θj)gi(j))


L(E-)


L(g,w,θ)logp(X). wθ, Lg.


logp(X)L, , :


logp(X)L(g,w,θ)=i=1Nlogp(xi)i=1Nj=1Kgi(j)log(wjφ(xi;θj)gi(j))=


=i=1N(logp(xi)j=1Kgi(j)j=1Kgi(j)logwjφ(xi;θj)gi(j))=i=1Nj=1Kgi(j)logp(xi)gi(j)wjφ(xi;θj)


, j:


p(j|xi)=φ(xi;θj)wjp(xi)


:


i=1Nj=1Kgi(j)logp(xi)gi(j)wjφ(xi;θj)=i=1Nj=1Kgi(j)loggi(j)p(j|xi)=i=1NEgigip(j|xi)


: . - ( KL-) "" .


, logp(X)L— KL-:


logp(X)L(g,w,θ)=i=1NKL(gi||p(j|xi))


KL- , , — KL- . : KL- , — . gi(j)p(j|xi):


gi(j)=p(j|xi)=wjφ(xi;θj)p(xi)


gi(j)Llogp(X).


L(M-)


: . :


  • gfixed;
  • parameters wand θsubject to optimization.

Simplify before optimization L:


L(g,θ)=i=1N(j=1Kgi(j)logwjp(xi;θj)gi(j))=


=i=1Nj=1Kgi(j)log(wjp(xi;θj))i=1Nj=1Kgi(j)loggi(j)


The second term is independent of the parameters wand θ, therefore, we will further optimize only the first term:


w,θ=argmaxw,θi=1Nj=1Kgi(j)log(wjφ(xi;θj))


We decompose the product logarithm into the sum of the logarithms and get:


w=argmaxwi=1Nj=1Kgi(j)logwj,   j=1wj=1


θj=argmaxi=1Ngi(j)logφ(xi;θj)


The first problem is solved by the Lagrange multiplier method. Result:


wj=1Ni=1Ngi(j)


The solution to the second problem depends on the specific type of cluster distribution φ(xi;θj). As you can see, for its solution, you no longer have to deal with the sum under the sign of the logarithm, therefore, for example, for Gaussian distributions, the solution can be written analytically.


Total


We found out the essence of the iterations of the EM algorithm for clustering, and saw how their formulas are derived in a general way.


All Articles