The EM algorithm is a useful data modeling tool when maximizing likelihood "face-on", through differentiation, is not possible. Clustering is one of the tasks where this algorithm comes to the rescue. The article provides a general conclusion of the EM algorithm for clustering.

Task

Many points $X= \{ x_i, i\in1..N \}$ need to be broken down into $K$ clusters.

Solution idea

We compose a probabilistic model of the distribution of points across clusters. Let us find the model parameters for which the probability of observing the set $X$ maximum. With these parameters, we will be able to determine which cluster the most likely point belongs to. $x$ .

Data model

We introduce a series of notation borrowed from the course .

$p(x)$ - the probability of observing a point $x$ .

$p(X) = \prod_{i=1}^{N}p(x_i)$ - the probability of observing many $X$ .

$p_j (x) = \varphi(x; \theta_j)$ - probability to meet a point $x$ in a cluster $j$ . This distribution is parameterized by a parameter (or parameter vector) $\theta_j$ specific to the cluster $j$ .

$w_j$ - cluster probability $j$ , i.e. the probability that a randomly selected point belongs to a cluster $j$ . A randomly selected point exactly refers to a cluster, so $\sum_{j=1}^K w_j = 1$ .

From the definitions above it follows that $p(x) = \sum_{j=1}^K w_j p_j(x) = \sum_{j=1}^K w_j \varphi(x; \theta_j)$ , i.e. distribution of points is modeled as a mixture of cluster distributions.

As a result, the probabilistic model of the set of points $X$ :

p (X) = \prod_{i = 1}^{N} (\sum_{j = 1}^{K} w_{j} φ (x_{i}; θ_{j}))

$p(X) = \prod_{i=1}^{N}\left(\sum_{j=1}^K w_j \varphi(x_i; \theta_j)\right)$

Parameter Search

Model Parameters $w$ and $\theta$ , as discussed above, should provide the maximum probability for our data:

w, θ = argmax p (X) = argmax \log p (X) = {argmax}_{w, θ} \sum_{i = 1}^{N} \log (\sum_{j = 1}^{K} w_{j} φ (x_{i}; θ_{j}))

$w, \theta = \textrm{argmax} \ p(X) = \textrm{argmax} \ \log p(X) = \textrm{argmax}_{w, \theta} \sum_{i=1}^{N} \log \left(\sum_{j=1}^K w_j \varphi(x_i; \theta_j)\right)$

The sum under the logarithm sign interferes with solving the problem analytically. Limitation $\sum_{j=1}^K w_j = 1$ (, TensorFlow PyTorch).

L :=   log p(X)
while log p(X)  :
     L  log p(X)
    w, theta = argmax L

, $\log p(X)$ , . $\mathcal{L}$ :

$\mathcal{L}$ : "" , $\log p(X)$ .
$w$ $\theta$ , $\mathcal{L}$ .

, "" , .

$\mathcal{L}$

\log p (X) = \sum_{i = 1}^{N} \log (\sum_{j = 1}^{K} w_{j} φ (x_{i}; θ_{j}))

$\log p(X) = \sum_{i=1}^{N} \log \left(\sum_{j=1}^K w_j \varphi(x_i; \theta_j)\right)$

. $g_i$ $x_i$ :

g_{i} (j) \equiv p (быть в кластере j | это точка i)

$g_i(j) \equiv p(\textrm{ } \ j| \textrm{ } \ i)$

\sum_{i = 1}^{N} \log (\sum_{j = 1}^{K} w_{j} φ (x_{i}; θ_{j})) = \sum_{i = 1}^{N} \log (\sum_{j = 1}^{K} \frac{g_{i} (j)}{g_{i} (j)} w_{j} φ (x_{i}; θ_{j}))

$\sum_{i=1}^{N} \log \left(\sum_{j=1}^K w_j \varphi(x_i; \theta_j)\right) =\sum_{i=1}^{N} \log \left(\sum_{j=1}^K \frac{ g_i(j) }{ g_i(j) } w_j \varphi(x_i; \theta_j)\right)$

. :

\log (\sum_{i} q_{i} x_{i}) \geq \sum_{i} q_{i} \log x_{i}

$\log \left(\sum_i q_i x_i \right) \geq \sum_i q_i \log x_i$

, $q_i$ $1$ .

$g_i(j)$ : $\sum_{j=1}^K g_i(j) = 1$ . :

\sum_{i = 1}^{N} \log (\sum_{j = 1}^{K} \frac{g_{i} (j)}{g_{i} (j)} w_{j} φ (x_{i}; θ_{j})) \geq \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log (\frac{w_{j} φ (x_{i}; θ_{j})}{g_{i} (j)})

$\sum_{i=1}^{N} \log \left(\sum_{j=1}^K \frac{ g_i(j) }{ g_i(j) } w_j \varphi(x_i; \theta_j)\right) \geq \sum_{i=1}^{N} \sum_{j=1}^K g_i(j) \log \left(\frac{ w_j \varphi(x_i; \theta_j) }{ g_i(j) }\right)$

L (g, w, θ) \equiv \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log (\frac{w_{j} φ (x_{i}; θ_{j})}{g_{i} (j)})

$\mathcal{L}(g, w, \theta) \equiv \sum_{i=1}^{N} \sum_{j=1}^K g_i(j) \log \left(\frac{ w_j \varphi(x_i; \theta_j) }{ g_i(j) }\right)$

$\mathcal{L}$ (E-)

$\mathcal{L}(g, w, \theta)$ $\log p(X)$ . $w$ $\theta$ , $\mathcal{L}$ $g$ .

$\log p(X)$ $\mathcal{L}$ , , :

\log p (X) - L (g, w, θ) = \sum_{i = 1}^{N} \log p (x_{i}) - \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log (\frac{w_{j} φ (x_{i}; θ_{j})}{g_{i} (j)}) =

$\log p(X) - \mathcal{L}(g, w, \theta) = \sum_{i=1}^N \log p(x_i) - \sum_{i=1}^{N} \sum_{j=1}^K g_i(j) \log \left(\frac{ w_j \varphi(x_i; \theta_j) }{ g_i(j) }\right)=$

= \sum_{i = 1}^{N} (\log p (x_{i}) \sum_{j = 1}^{K} g_{i} (j) - \sum_{j = 1}^{K} g_{i} (j) \log \frac{w_{j} φ (x_{i}; θ_{j})}{g_{i} (j)}) = \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log \frac{p (x_{i}) g_{i} (j)}{w_{j} φ (x_{i}; θ_{j})}

$= \sum_{i=1}^N \left(\log p(x_i) \sum_{j=1}^K g_i(j) - \sum_{j=1}^K g_i(j) \log \frac{w_j \varphi(x_i; \theta_j)}{g_i(j)} \right) = \sum_{i=1}^N \sum_{j=1}^K g_i(j) \log \frac{p(x_i) g_i(j)}{w_j \varphi(x_i; \theta_j)}$

, $j$ :

p (j | x_{i}) = \frac{φ (x_{i}; θ_{j}) w_{j}}{p (x_{i})}

$p(j|x_i) = \frac{\varphi(x_i; \theta_j) w_j}{p(x_i)}$

\sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log \frac{p (x_{i}) g_{i} (j)}{w_{j} φ (x_{i}; θ_{j})} = \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log \frac{g_{i} (j)}{p (j | x_{i})} = \sum_{i = 1}^{N} E_{g_{i}} \frac{g_{i}}{p (j | x_{i})}

$\sum_{i=1}^N \sum_{j=1}^K g_i(j) \log \frac{p(x_i) g_i(j)}{w_j \varphi(x_i; \theta_j)} = \sum_{i=1}^N \sum_{j=1}^K g_i(j) \log \frac{g_i(j)}{p(j|x_i)}= \sum_{i=1}^N \mathbb{E}_{g_i} \frac{g_{i}}{p(j|x_i)}$

: . - ( KL-) "" .

, $\log p(X)$ $\mathcal{L}$ — KL-:

\log p (X) - L (g, w, θ) = \sum_{i = 1}^{N} K L (g_{i} | | p (j | x_{i}))

$\log p(X) - \mathcal{L}(g, w, \theta) = \sum_{i=1}^N KL(g_i || p(j|x_i))$

KL- , , — KL- . : KL- , — . $g_i(j)$ $p(j|x_i)$ :

g_{i} (j) = p (j | x_{i}) = \frac{w_{j} φ (x_{i}; θ_{j})}{p (x_{i})}

$g_i(j) = p(j|x_i) = \frac{w_j \varphi(x_i; \theta_j)}{p(x_i)}$

$g_i(j)$ $\mathcal{L}$ $\log p(X)$ .

$\mathcal{L}$ (M-)

: . :

$g$ fixed;
parameters $w$ and $\theta$ subject to optimization.

Simplify before optimization $\mathcal{L}$ :

L (g, θ) = \sum_{i = 1}^{N} (\sum_{j = 1}^{K} g_{i} (j) \log \frac{w_{j} p (x_{i}; θ_{j})}{g_{i} (j)}) =

$\mathcal{L}(g, \theta) = \sum_{i=1}^N\left( \sum_{j=1}^K g_i(j) \log \frac{w_j p(x_i; \theta_j)}{g_i(j)} \right) =$

= \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log (w_{j} p (x_{i}; θ_{j})) - \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log g_{i} (j)

$= \sum_{i=1}^N \sum_{j=1}^K g_i(j) \log \left( w_j p(x_i; \theta_j) \right) -\sum_{i=1}^N \sum_{j=1}^K g_i(j) \log g_i(j)$

The second term is independent of the parameters $w$ and $\theta$ , therefore, we will further optimize only the first term:

w, θ = {argmax}_{w, θ} \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log (w_{j} φ (x_{i}; θ_{j}))

$w, \theta = \textrm{argmax}_{w, \theta}\sum_{i=1}^N \sum_{j=1}^K g_i(j) \log \left( w_j \varphi(x_i; \theta_j) \right)$

We decompose the product logarithm into the sum of the logarithms and get:

w = {argmax}_{w} \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log w_{j}, при условии \sum_{j = 1} w_{j} = 1

$w = \textrm{argmax}_{w}\sum_{i=1}^N \sum_{j=1}^K g_i(j) \log w_j, \textrm{ }\sum_{j=1} w_j = 1$

θ_{j} = argmax \sum_{i = 1}^{N} g_{i} (j) \log φ (x_{i}; θ_{j})

$\theta_j = \textrm{argmax} \sum_{i=1}^N g_i(j) \log \varphi (x_i; \theta_j)$

The first problem is solved by the Lagrange multiplier method. Result:

w_{j} = \frac{1}{N} \sum_{i = 1}^{N} g_{i} (j)

$w_j = \frac{1}{N} \sum_{i=1}^N g_i(j)$

The solution to the second problem depends on the specific type of cluster distribution $\varphi (x_i; \theta_j)$ . As you can see, for its solution, you no longer have to deal with the sum under the sign of the logarithm, therefore, for example, for Gaussian distributions, the solution can be written analytically.

Total

We found out the essence of the iterations of the EM algorithm for clustering, and saw how their formulas are derived in a general way.

EM algorithm for clustering

Task

Solution idea

Data model

Parameter Search

L\mathcal{L}

L\mathcal{L}(E-)

L\mathcal{L}(M-)

Total

More articles:

$\mathcal{L}$

$\mathcal{L}$ (E-)

$\mathcal{L}$ (M-)