🧘🏻 📮 🎂 Bewertung der Clusterqualität: Eigenschaften, Metriken, GitHub-Code 🧚🏽 🃏 👨‍❤️‍👨

Clustering ist eine magische Sache: Es verwandelt eine große Menge unstrukturierter Daten in eine potenziell sichtbare Gruppe von Clustern, deren Analyse es uns ermöglicht, Rückschlüsse auf den Inhalt dieser Daten zu ziehen.

Es gibt viele Anwendungen für Clustering-Methoden. Zum Beispiel gruppieren wir Suchabfragen, um die Generalisierungsfähigkeit von Ranking-Algorithmen zu verbessern: Alle Statistiken, die aus einer Gruppe ähnlicher Abfragen berechnet werden, sind zuverlässiger als dieselben Statistiken, die für eine separate Abfrage berechnet wurden. Durch Clustering können Sie die Qualität von Abfragen mit selten vorkommenden Formulierungen verbessern. Ein weiteres klares Beispiel ist Yandex.News, das automatisch Nachrichten generiert.

Bereits 2013 hatte ich das Glück, an der Entwicklung eines sehr komplexen Clustering-Algorithmus mitzuwirken. Es war notwendig, Hunderttausende von Objekten mit sehr hoher Qualität zu gruppieren und dies schnell zu erledigen: in Zehntelsekunden auf einer Maschine. Der erste Schritt bestand darin, ein Qualitätsbewertungssystem aufzubauen, und in diesem Artikel werde ich darüber sprechen.

, , , .

, : , , , /- . , . — .

1.

, . .

, . Amigo et. al, 2009. A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints , , .

1.1. Cluster Homogeneity,

1.2. Cluster Completeness,

, . .

1.3. Rag Bag

: , , («»), . , , , .

, . , . «» : , .

1.4. Size vs Quantity

2.

$D=\{d_1,...,d_N\}$ . . , $E \subseteq 2^D$ — ,

$\forall e_1, e_2 \in E:(e_1 \cap e_2 \ne \emptyset) \Leftrightarrow (e_1 = e_2)$

: . , .

, : $T=\{t_1,...,t_n\}$ $C=\{c_1,...,c_m\}$ . $t(d)$ $c(d)$ , , , $d$ .

2.1. , ()

. , . , .

. 3 («»), 3 («») 2 («»). : «» , «» «».

$8 \cdot 8 = 64$ , :
— $3 \cdot 3 + 3 \cdot 3 + 2 \cdot 2 = 22$ ;
— $64 - 22 = 42$ .

«» «» , - «» .

, :
— $TP = 18$ ;
— $FP = 6$ , ;
— $TN = 36$ ;
— $FN = 4$ , .

, :

$P = TP / (TP + FP) = 0.75$

$R = TP / (TP + FN) = 0.82$

$F_1 = \frac{2PR}{P + R} = 0.78$

, : .

. , : .

pairwise- — . - pairwise- .

2.2. ,

$P(c, t) = \frac{ | c \cap t | }{| c |}$

$R(c, t) = \frac{ | c \cap t | }{| t |}$

: «» (y) «» (v). 25 10 . 25/35, — 10/35. . , 30 12 , 25/30, — 10/12.

: , . :

$P(c, t) = R(t, c)$

F-:

$F_1(c, t) = \frac{2\cdot P(c, t) \cdot R(c,t)}{ P(c, t) + R(c,t) }$

. «» , Purity:

$Purity = \sum_{i=1}^{m}{\frac{|c_i|}{m}{\max_{1 \le j \le n}{P(c_i, t_j)}}}$

, :

$IversePurity = \sum_{j=1}^{n}{\frac{|t_j|}{n}{\max_{1 \le i \le m}{P(c_i, t_j)}}}$

$Purity$ , . $IversePurity$ — . , ? , . :

$|c_2| \approx |t_2| \gg |t_1| \gg |c_1|$

$|c_2| = 1000$ , $|c_1| = 2$ , $|t_1| = 20$ , $|t_2| = 982$ . $c_1$ $t_1$ , :

$|c_1 \cap t_1| = 2, |c_1 \cap t_2| = 0,$

$|c_2 \cap t_1| = 18, |c_2 \cap t_2| = 982$

$P(c_1, t_1) = |c_1 \cap t_1| / |c_1| = 2/2 = 1$

$P(c_2, t_2) = |c_2 \cap t_2| / |c_2| = 982/1000 = 0.982$

$P(t_1, c_2) = |t_1 \cap c_2| / |t_1| = 18/20 = 0.9$

$P(t_2, c_2) = |c_2 \cap t_2| / |t_2| = 982/982 = 1$

, Purity InversePurity . :

$Purity =0.9982$

$IversePurity = 0.9823$

, $t_1$ . , . . Purity $t_1$ $c_1$ , InversePurity — $c_2$ . !

F-:

$F = \sum_{j=1}^{n}{\frac{|t_j|}{n}{\max_{1 \le i \le m}{F(c_i, t_j)}}}$

, , . , , , .

2.3. BCubed-

. , . , , ? ?

, BCubed- . :

$BCP(d) = \frac{|c(d) \cap t(d)|}{|c(d)|}$

$BCR(d) = \frac{|c(d) \cap t(d)|}{|t(d)|}$

$BCP = \frac{1}{N}\sum_{d \in D}{BCP(d)}$

$BCR = \frac{1}{N}\sum_{d \in D}{BCR(d)}$

BCubed Precision Cluster Homogeneity Rag Bag, BCubed Recall — Cluster Completeness Size vs Quantity. — F- — :

$BCF = \frac{2 \cdot BCP \cdot BCR}{BCP + BCR}$

3.

, . , .

, , . $D=\{a,b,c,d,e,f,g,h,i\}$ , $t_1 = \{a,b,c,d,e\}$ $t_2=\{f,g,h,i\}$ («» «» ).

: $c_1 = \{a,b,c,d,g\}$ , $c_2=\{e,f,h,i\}$ .

3.1. BCubed

BCubed- :

$BCP(t_i)=\frac{1}{|t_i|} \sum_{d \in t_i}{BCP(d)}$

$BCR(t_i)=\frac{1}{|t_i|} \sum_{d \in t_i}{BCR(d)}$

$BCP(a)=BCP(b)=BCP(c)=BCP(d)=\frac{|t_1 \cap c_1|}{|c_1|}=0.8, BCP(e)=\frac{|t_1 \cap c_2|}{|c_2|}=0.25$

$BCP(f)=BCP(h)=BCP(i)=\frac{|t_2 \cap c_2|}{|c_2|}=0.75, BCP(g)=\frac{|t_2 \cap c_1|}{|c_1|}=0.2$

$BCP(t_1)=\frac{BCP(a)+BCP(b)+BCP(c)+BCP(d)+BCP(e)}{5}=\frac{4\cdot 0.8 + 0.25}{5}=0.69$

$BCP(t_2)=\frac{BCP(f)+BCP(g)+BCP(h)+BCP(i)}{4}=\frac{3\cdot 0.75 + 0.2}{4}=0.6125$

$BCR(a)=BCR(b)=BCR(c)=BCR(d)=\frac{|t_1 \cap c_1|}{|t_1|}=0.8, BCR(e)=\frac{|t_1 \cap c_2|}{|t_1|}=0.2$

$BCR(f)=BCR(h)=BCR(i)=\frac{|t_2 \cap c_2|}{|t_2|}=0.75, BCP(g)=\frac{|t_2 \cap c_1|}{|t_2|}=0.25$

$BCR(t_1)=\frac{BCR(a)+BCR(b)+BCR(c)+BCR(d)+BCR(e)}{5}=\frac{4\cdot 0.8 + 0.2}{5}=0.68$

$BCR(t_2)=\frac{BCR(f)+BCR(g)+BCR(h)+BCR(i)}{4}=\frac{3\cdot 0.75 + 0.25}{4}=0.625$

BCP BCR :

$BCP=\frac{BCP(t_1) + BCP(t_2)}{2}=0.65125$

$BCR=\frac{BCR(t_1) + BCR(t_2)}{2}=0.6525$

3.2. Expected Cluster Completeness

. , , , , - .

— F- — . , ? .

, , «». . — . : , , .

— :

$f:C \rightarrow T$

$f$ (cluster completeness):

${CC}_f(t) = \max_{c \in f^{-1}(t)}{\frac{|c \cap t|}{|t|}}$

${CC}_f(T) = \frac{1}{T} \sum_{t \in T} {CC}_f(t)$

— , $f$ . : .

. , $f(c)=t$ ,

$P\Big(f(c)=t\Big) = P(c, t) = \frac{|c \cap t|}{|c|}$

$f$ :

$P(f) = \prod_{c \in C}{\frac{|c \cap f(c)|}{|c|}}$

$ECC(t) = \sum_{f: C \rightarrow T}{P(f) \cdot CC_f(t)}$

$ECC = \frac{1}{|T|} \sum_{t \in T}{ECC(t)}$

$ECC = \sum_{f: C \rightarrow T}{P(f) \cdot {CC}_f(T)}$

. $t$ $c_1$ , $c_2$ , ..., $c_k$ , :

$|t \cap c_1| \ge |t \cap c_2| \ge ... \ge |t \cap c_k|$

$c_1$ $t$

$P\Big(f(c_1) = t\Big) = \frac{|c_1 \cap t|}{|c_1|}$

$R(c_1, t) = \frac{ | c_1 \cap t | }{| t |}$

, , . $R(c_1, t)$ .

$c_1$ , $p(c_2, t)$ $t$ $c_2$ . $R(c_2, t)$ . , ,

$P(c_1, t) \cdot R(c_1, t) + P(c_2, t) \cdot R (c_2, t) \cdot \Big(1 - P(c_1,t)\Big) + ...$

ECC . sortedMatchings , , . Precision() Recall() . :

double ExpectedClusterCompleteness(const TMatchings& sortedMatchings) {
    double expectedClusterCompleteness = 0.;
    double probability = 1.;
    for (const TMatching& matching : sortedMatchings) {
        expectedClusterCompleteness += matching.Recall() * matching.Precision() * probability;
        probability *= 1. - matching.Precision();
    }
    return expectedClusterCompleteness;
}

, $F_1$ -, !

ECC , 3. $t_1$ $c_1$ , $t_2$ — $c_2$ . :

$ECC(t_1) = R(c_1, t_2) \cdot P(c_1, t_1) + R(c_2, t_1) \cdot P(c_2, t_1) \cdot \Big(1 - P(c_1, t_1)\Big)$
$ECC(t_1) = {0.8}^2 + 0.2 \cdot 0.25 \cdot (1 - 0.8) = 0.65$

$ECC(t_2) = R(c_2, t_2) \cdot P(c_2, t_2) + R(c_1, t_2) \cdot P(c_1, t_2) \cdot \Big(1 - P(c_2, t_2)\Big)$
$ECC(t_2) = {0.75}^2 + 0.25 \cdot 0.2 \cdot (1 - 0.75) = 0.575$

, :

$ECC = \frac{ECC(t_1) + ECC(t_2)}{2} = 0.6125$

4.

, 3, C++ MIT GitHub. Linux, Windows.

git clone https://github.com/yandex/cluster_metrics/ .
cmake .
cmake --build .

, 3. !

./cluster_metrics samples/sample_markup.tsv samples/sample_clusters.tsv
ECC   0.61250 (0.61250)
BCP   0.65125 (0.65125)
BCR   0.65250 (0.65250)
BCF1  0.65187 (0.65187)

«» , , . , , . , , . — .

5. ,

, , , .

. , , . , , :

: , . , , . , , , — . — ECC!

, , — . .

, , , . .

, , , ECC! ECC - , , , 100% . .

, 1 . : — .

^{_{Cluster Completeness}}

, , . , Cluster Homogeneity, Cluster Completeness, Rag Bag , Size vs Quantity .

, : , , , .

, 3, , .

, , : . , , .

Bewertung der Clusterqualität: Eigenschaften, Metriken, GitHub-Code

1.

2.

3.

4.

5. ,

More articles: