L'algorithme EM est un outil de modélisation de données utile lorsqu'il n'est pas possible de maximiser la probabilité "sur le front" grâce à la différenciation. Le clustering est l'une des tâches où cet algorithme vient à la rescousse. L'article fournit une conclusion générale de l'algorithme EM pour le clustering.

Tâche

Beaucoup de points $X= \{ x_i, i\in1..N \}$ doivent être décomposés en $K$ clusters.

Idée de solution

Nous composons un modèle probabiliste de la distribution des points à travers les grappes. Trouvons les paramètres du modèle pour lesquels la probabilité d'observer l'ensemble $X$ maximum. Avec ces paramètres, nous serons en mesure de déterminer à quel cluster appartient le point le plus probable. $x$ .

Modèle de données

Nous introduisons une série de notation empruntée au cours .

$p(x)$ - la probabilité d'observer un point $x$ .

$p(X) = \prod_{i=1}^{N}p(x_i)$ - la probabilité d’observer plusieurs $X$ .

$p_j (x) = \varphi(x; \theta_j)$ - probabilité de rencontrer un point $x$ dans un cluster $j$ . Cette distribution est paramétrée par un paramètre (ou vecteur de paramètre) $\theta_j$ spécifique au cluster $j$ .

$w_j$ - probabilité de cluster $j$ , c'est à dire. la probabilité qu'un point sélectionné au hasard appartient à un cluster $j$ . Un point sélectionné au hasard fait exactement référence à un cluster, donc $\sum_{j=1}^K w_j = 1$ .

Il ressort des définitions ci-dessus que $p(x) = \sum_{j=1}^K w_j p_j(x) = \sum_{j=1}^K w_j \varphi(x; \theta_j)$ , c'est à dire. la distribution des points est modélisée comme un mélange de distributions de grappes.

En conséquence, le modèle probabiliste de l'ensemble de points $X$ :

p (X) = \prod_{i = 1}^{N} (\sum_{j = 1}^{K} w_{j} φ (x_{i}; θ_{j}))

$p(X) = \prod_{i=1}^{N}\left(\sum_{j=1}^K w_j \varphi(x_i; \theta_j)\right)$

Recherche de paramètres

Paramètres du modèle $w$ et $\theta$ , comme indiqué ci-dessus, devrait fournir la probabilité maximale pour nos données:

w, θ = argmax p (X) = argmax \log p (X) = {argmax}_{w, θ} \sum_{i = 1}^{N} \log (\sum_{j = 1}^{K} w_{j} φ (x_{i}; θ_{j}))

$w, \theta = \textrm{argmax} \ p(X) = \textrm{argmax} \ \log p(X) = \textrm{argmax}_{w, \theta} \sum_{i=1}^{N} \log \left(\sum_{j=1}^K w_j \varphi(x_i; \theta_j)\right)$

La somme sous le signe du logarithme interfère avec la résolution analytique du problème. Limitation $\sum_{j=1}^K w_j = 1$ (, TensorFlow PyTorch).

L :=   log p(X)
while log p(X)  :
     L  log p(X)
    w, theta = argmax L

, $\log p(X)$ , . $\mathcal{L}$ :

$\mathcal{L}$ : "" , $\log p(X)$ .
$w$ $\theta$ , $\mathcal{L}$ .

, "" , .

$\mathcal{L}$

\log p (X) = \sum_{i = 1}^{N} \log (\sum_{j = 1}^{K} w_{j} φ (x_{i}; θ_{j}))

$\log p(X) = \sum_{i=1}^{N} \log \left(\sum_{j=1}^K w_j \varphi(x_i; \theta_j)\right)$

. $g_i$ $x_i$ :

g_{i} (j) \equiv p (быть в кластере j | это точка i)

$g_i(j) \equiv p(\textrm{ } \ j| \textrm{ } \ i)$

\sum_{i = 1}^{N} \log (\sum_{j = 1}^{K} w_{j} φ (x_{i}; θ_{j})) = \sum_{i = 1}^{N} \log (\sum_{j = 1}^{K} \frac{g_{i} (j)}{g_{i} (j)} w_{j} φ (x_{i}; θ_{j}))

$\sum_{i=1}^{N} \log \left(\sum_{j=1}^K w_j \varphi(x_i; \theta_j)\right) =\sum_{i=1}^{N} \log \left(\sum_{j=1}^K \frac{ g_i(j) }{ g_i(j) } w_j \varphi(x_i; \theta_j)\right)$

. :

\log (\sum_{i} q_{i} x_{i}) \geq \sum_{i} q_{i} \log x_{i}

$\log \left(\sum_i q_i x_i \right) \geq \sum_i q_i \log x_i$

, $q_i$ $1$ .

$g_i(j)$ : $\sum_{j=1}^K g_i(j) = 1$ . :

\sum_{i = 1}^{N} \log (\sum_{j = 1}^{K} \frac{g_{i} (j)}{g_{i} (j)} w_{j} φ (x_{i}; θ_{j})) \geq \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log (\frac{w_{j} φ (x_{i}; θ_{j})}{g_{i} (j)})

$\sum_{i=1}^{N} \log \left(\sum_{j=1}^K \frac{ g_i(j) }{ g_i(j) } w_j \varphi(x_i; \theta_j)\right) \geq \sum_{i=1}^{N} \sum_{j=1}^K g_i(j) \log \left(\frac{ w_j \varphi(x_i; \theta_j) }{ g_i(j) }\right)$

L (g, w, θ) \equiv \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log (\frac{w_{j} φ (x_{i}; θ_{j})}{g_{i} (j)})

$\mathcal{L}(g, w, \theta) \equiv \sum_{i=1}^{N} \sum_{j=1}^K g_i(j) \log \left(\frac{ w_j \varphi(x_i; \theta_j) }{ g_i(j) }\right)$

$\mathcal{L}$ (E-)

$\mathcal{L}(g, w, \theta)$ $\log p(X)$ . $w$ $\theta$ , $\mathcal{L}$ $g$ .

$\log p(X)$ $\mathcal{L}$ , , :

\log p (X) - L (g, w, θ) = \sum_{i = 1}^{N} \log p (x_{i}) - \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log (\frac{w_{j} φ (x_{i}; θ_{j})}{g_{i} (j)}) =

$\log p(X) - \mathcal{L}(g, w, \theta) = \sum_{i=1}^N \log p(x_i) - \sum_{i=1}^{N} \sum_{j=1}^K g_i(j) \log \left(\frac{ w_j \varphi(x_i; \theta_j) }{ g_i(j) }\right)=$

= \sum_{i = 1}^{N} (\log p (x_{i}) \sum_{j = 1}^{K} g_{i} (j) - \sum_{j = 1}^{K} g_{i} (j) \log \frac{w_{j} φ (x_{i}; θ_{j})}{g_{i} (j)}) = \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log \frac{p (x_{i}) g_{i} (j)}{w_{j} φ (x_{i}; θ_{j})}

$= \sum_{i=1}^N \left(\log p(x_i) \sum_{j=1}^K g_i(j) - \sum_{j=1}^K g_i(j) \log \frac{w_j \varphi(x_i; \theta_j)}{g_i(j)} \right) = \sum_{i=1}^N \sum_{j=1}^K g_i(j) \log \frac{p(x_i) g_i(j)}{w_j \varphi(x_i; \theta_j)}$

, $j$ :

p (j | x_{i}) = \frac{φ (x_{i}; θ_{j}) w_{j}}{p (x_{i})}

$p(j|x_i) = \frac{\varphi(x_i; \theta_j) w_j}{p(x_i)}$

\sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log \frac{p (x_{i}) g_{i} (j)}{w_{j} φ (x_{i}; θ_{j})} = \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log \frac{g_{i} (j)}{p (j | x_{i})} = \sum_{i = 1}^{N} E_{g_{i}} \frac{g_{i}}{p (j | x_{i})}

$\sum_{i=1}^N \sum_{j=1}^K g_i(j) \log \frac{p(x_i) g_i(j)}{w_j \varphi(x_i; \theta_j)} = \sum_{i=1}^N \sum_{j=1}^K g_i(j) \log \frac{g_i(j)}{p(j|x_i)}= \sum_{i=1}^N \mathbb{E}_{g_i} \frac{g_{i}}{p(j|x_i)}$

: . - ( KL-) "" .

, $\log p(X)$ $\mathcal{L}$ — KL-:

\log p (X) - L (g, w, θ) = \sum_{i = 1}^{N} K L (g_{i} | | p (j | x_{i}))

$\log p(X) - \mathcal{L}(g, w, \theta) = \sum_{i=1}^N KL(g_i || p(j|x_i))$

KL- , , — KL- . : KL- , — . $g_i(j)$ $p(j|x_i)$ :

g_{i} (j) = p (j | x_{i}) = \frac{w_{j} φ (x_{i}; θ_{j})}{p (x_{i})}

$g_i(j) = p(j|x_i) = \frac{w_j \varphi(x_i; \theta_j)}{p(x_i)}$

$g_i(j)$ $\mathcal{L}$ $\log p(X)$ .

$\mathcal{L}$ (M-)

: . :

$g$ fixé;
paramètres $w$ et $\theta$ sous réserve d'optimisation.

Simplifiez avant l'optimisation $\mathcal{L}$ :

L (g, θ) = \sum_{i = 1}^{N} (\sum_{j = 1}^{K} g_{i} (j) \log \frac{w_{j} p (x_{i}; θ_{j})}{g_{i} (j)}) =

$\mathcal{L}(g, \theta) = \sum_{i=1}^N\left( \sum_{j=1}^K g_i(j) \log \frac{w_j p(x_i; \theta_j)}{g_i(j)} \right) =$

= \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log (w_{j} p (x_{i}; θ_{j})) - \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log g_{i} (j)

$= \sum_{i=1}^N \sum_{j=1}^K g_i(j) \log \left( w_j p(x_i; \theta_j) \right) -\sum_{i=1}^N \sum_{j=1}^K g_i(j) \log g_i(j)$

Le deuxième terme est indépendant des paramètres $w$ et $\theta$ Par conséquent, nous n'optimiserons davantage que le premier terme:

w, θ = {argmax}_{w, θ} \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log (w_{j} φ (x_{i}; θ_{j}))

$w, \theta = \textrm{argmax}_{w, \theta}\sum_{i=1}^N \sum_{j=1}^K g_i(j) \log \left( w_j \varphi(x_i; \theta_j) \right)$

Nous décomposons le logarithme du produit en la somme des logarithmes et obtenons:

w = {argmax}_{w} \sum_{i = 1}^{N} \sum_{j = 1}^{K} g_{i} (j) \log w_{j}, при условии \sum_{j = 1} w_{j} = 1

$w = \textrm{argmax}_{w}\sum_{i=1}^N \sum_{j=1}^K g_i(j) \log w_j, \textrm{ }\sum_{j=1} w_j = 1$

θ_{j} = argmax \sum_{i = 1}^{N} g_{i} (j) \log φ (x_{i}; θ_{j})

$\theta_j = \textrm{argmax} \sum_{i=1}^N g_i(j) \log \varphi (x_i; \theta_j)$

Le premier problème est résolu par la méthode du multiplicateur de Lagrange. Résultat:

w_{j} = \frac{1}{N} \sum_{i = 1}^{N} g_{i} (j)

$w_j = \frac{1}{N} \sum_{i=1}^N g_i(j)$

La solution au deuxième problème dépend du type spécifique de distribution de cluster $\varphi (x_i; \theta_j)$ . Comme vous pouvez le voir, pour sa solution, vous n'avez plus à traiter la somme sous le signe du logarithme, donc, par exemple, pour les distributions gaussiennes, la solution peut être écrite analytiquement.

Total

Nous avons découvert l'essence des itérations de l'algorithme EM pour le clustering et vu comment leurs formules sont dérivées de manière générale.

Algorithme EM pour le clustering

Tâche

Idée de solution

Modèle de données

Recherche de paramètres

$\mathcal{L}$

$\mathcal{L}$ (E-)

$\mathcal{L}$ (M-)

Total

More articles:

Algorithme EM pour le clustering

Tâche

Idée de solution

Modèle de données

Recherche de paramètres

L\mathcal{L}

L\mathcal{L}(E-)

L\mathcal{L}(M-)

Total

More articles:

$\mathcal{L}$

$\mathcal{L}$ (E-)

$\mathcal{L}$ (M-)