🌽 ✨ 🏁 Ein einfaches Beispiel für eine Clusteranalyse der Alkoholpräferenzen nach Ländern für R. 👧🏻 🖖 🧑‍🤝‍🧑

Hallo Habr! Heute möchte ich ein kleines Beispiel für die Durchführung von Clusteranalysen geben. In diesem Beispiel findet der Leser keine neuronalen Netze und andere modische Richtungen. Dieses Beispiel kann als Referenzpunkt dienen, um eine kleine und vollständige Clusteranalyse für andere Daten durchzuführen. Alle Interessierten - willkommen bei cat.

Wenn Sie sofort einen Vorbehalt machen, erhebt dieser Artikel in keiner Weise den Anspruch, in seiner Gesamtheit, der Einzigartigkeit der erzielten Ergebnisse oder der Vollständigkeit der Berichterstattung über das Thema akademisch zu sein. Der Artikel soll die grundlegenden Schritte der klassischen Clusteranalyse demonstrieren, die für einfache und aussagekräftige (möglicherweise vor einer detaillierteren) Studie verwendet werden können. Korrekturen, Kommentare und Ergänzungen in der Sache sind willkommen.

Die Daten sind eine Stichprobe des Alkoholkonsums pro Land pro Kopf nach Art der alkoholischen Getränke (Bier, Wein, Spirituosen usw.) für 2010 als Prozentsatz des Pro-Kopf-Alkoholkonsums. Die Daten enthalten außerdem: den durchschnittlichen täglichen Alkoholkonsum pro Kopf in Gramm reinem Alkohol und den gesamten (erfassten + nicht erfassten) Alkoholkonsum pro Kopf (nur Trinker in Litern reinem Alkohol).

Gleichzeitig gehört jedes Land bedingt zu einer der geografischen Gruppen: Ost, Mitte und West. Die Aufteilung ist aus verschiedenen Gründen sehr willkürlich und sehr kontrovers, aber wir werden von dem ausgehen, was wir haben. Datenquelle - Globaler Statusbericht zu Alkohol und Gesundheit 2014, S. 289-364

(Handbemalt, es kann Fehler geben, aber die allgemeine Idee, denke ich, ist verständlich)

Voruntersuchung

Verbinden Sie die verwendeten Bibliotheken.

library(rgl)
library(heplots)
library(MVN)
library(klaR)
library('Morpho')
library(caret)
library(mclust)
library(ggplot2)
library(GGally)
library(plyr)
library(psych)
library(GPArotation)
library(ggpubr)

, .

#    
data <- read.table("alcohol_data.csv", header=TRUE,  sep=",")
#      
rownames(data) <- make.names(data[,1], unique = TRUE)
#     ,   
data <- data[,-1]
data <- na.omit(data)
#    
head(data)

	Beer	Wine	Spirit	Other	Total	Average_daily	Group
Albania	31.8	19.8	48.4	0.0	13.0	27.5	center
Armenia	9.7	5.3	84.9	0.0	8.3	17.9	east
Austria	50.4	35.5	14.0	0.0	13.8	29.6	center
Azerbaijan	28.7	7.6	63.3	0.0	5.2	11.1	east
Belarus	17.3	5.2	46.6	30.9	22.1	48.0	east
Belgium	49.2	36.3	14.4	0.1	12.8	27.7	center
...	...	...	...	...	...	...	...

summary(data)

, . , Other , , , , . , , , , . , . - .

, , , .

options(rgl.useNULL=TRUE)
open3d()
mfrow3d(2,2)
levelColors <- c('west'='blue', 'east'='red', 'center'='yellow')
plot3d(data$Beer, data$Wine, data$Spirit, xlab="Beer", ylab="Wine", zlab="Spirit", col = levelColors[data$Group], size=3)

widget <- rglwidget()
widget

, . , .

ggpairs(
  data,
  mapping = ggplot2::aes(color = data$Group),
  upper = list(continuous = wrap("cor", alpha = 0.5), combo = "box"),
  lower = list(continuous = wrap("points", alpha = 0.3), combo = wrap("dot", alpha = 0.4)),
  diag = list(continuous = wrap("densityDiag",alpha = 0.5)),
  title = "Alcohol"
)

Average Total , Average.

data <- data[, -6]

, , , , . .

data[data$Wine>60,]

	Beer	Wine	Spirit	Other	Total	Group
Italy	23	65.6	11.5	0	9.9	west

, , , , - , , .

data[data$Spirit>70,]
data[data$Spirit<10,]

	Beer	Wine	Spirit	Other	Total	Group
Armenia	9.7	5.3	84.9	0	8.3	east

	Beer	Wine	Spirit	Other	Total	Group
Slovenia	44.5	46.9	8.6	0	17.2	west

, , .

split(data[,1:5],data$Group)

$center

	Beer	Wine	Spirit	Other	Total
Albania	31.8	19.8	48.4	0.0	13.0
Austria	50.4	35.5	14.0	0.0	13.8
Belgium	49.2	36.3	14.4	0.1	12.8
Bosnia.and.Herzegovina	73.3	9.7	17.0	0.0	12.3
Cyprus	40.9	24.7	33.7	0.7	10.8
Czech.Republic	53.5	20.5	26.0	0.0	14.6
Denmark	37.7	48.2	14.1	0.0	12.9
Finland	46.0	17.5	24.0	12.6	18.1
Germany	53.6	27.8	18.6	0.0	14.7
Hungary	36.3	29.4	34.3	0.0	16.3
Iceland	61.8	21.2	16.5	0.5	10.4
Ireland	48.1	26.1	18.7	7.7	14.7
Malta	39.4	32.7	27.2	0.7	11.5
Netherlands	46.8	36.4	16.9	0.0	11.2
Norway	44.2	34.7	19.0	2.1	9.0
Poland	55.1	9.3	35.5	0.0	24.2
Romania	50.0	28.9	21.1	0.0	21.3
Serbia	51.5	23.9	24.6	0.0	19.0
Sweden	37.0	46.6	15.1	1.4	13.3
Switzerland	31.8	49.4	17.6	1.2	12.1
Turkey	63.6	8.6	27.9	0.0	17.3
UK	36.9	33.8	21.8	7.5	13.8

$east

	Beer	Wine	Spirit	Other	Total
Armenia	9.7	5.3	84.9	0.0	8.3
Azerbaijan	28.7	7.6	63.3	0.0	5.2
Belarus	17.3	5.2	46.6	30.9	22.1
Bulgaria	39.3	16.5	44.1	0.1	16.9
Estonia	41.2	11.1	36.8	10.9	15.7
Georgia	17.0	49.8	33.2	0.1	21.2
Israel	44.0	6.2	49.5	0.3	5.4
Latvia	46.9	10.7	37.0	5.4	18.1
Lithuania	46.5	7.8	34.1	11.6	23.6
Republic.of.Moldova	30.4	5.1	64.5	0.0	25.4
Russian.Federation	37.6	11.4	51.0	0.0	22.3
Slovakia	30.1	18.3	46.2	5.5	19.8
Ukraine	40.5	9.0	48.0	2.6	20.3

$west

	Beer	Wine	Spirit	Other	Total
Croatia	39.5	44.8	15.4	0.2	15.1
France	18.8	56.4	23.1	1.7	12.9
Greece	28.1	47.3	24.2	0.4	15.6
Italy	23.0	65.6	11.5	0.0	9.9
Luxembourg	36.2	42.8	21.0	0.0	12.7
Portugal	30.8	55.5	10.9	2.8	22.6
Slovenia	44.5	46.9	8.6	0.0	17.2
Spain	49.7	20.1	28.2	1.8	16.4
Republic.of.Macedonia	47.4	39.9	12.6	0.0	11.7

ggpairs(
  data,
  mapping = ggplot2::aes(color = data$Group),
  diag=list(continuous="bar", alpha=0.4)
)

, , . Other, : , , , ( 10-12 , 45, , ). . , , , (). , , . Other .

, , — , — . , — , .
Total Other, . .

, Beer, Spirit Wine . , , , . , , , , , .

Total. , — .

data.group = data[,5]
data <- data[,-5]
data<- data[,-4]

Elbow method (“ ”, “ ”). , k, – W(K), .

library(factoextra)
fviz_nbclust(data, kmeans, method = "wss") +
  labs(subtitle = "Elbow method") +
  geom_vline(xintercept = 4, linetype = 2)

data.dist <- dist((data))
hc <- hclust(data.dist, method = "ward.D2")
plot(hc, cex = 0.7)

. .

colors=c('green', 'red', 'blue')
hcd = as.dendrogram(hc)
clusMember = cutree(hc, 4)
colLab <- function(n) {
    if (is.leaf(n)) {
        a <- attributes(n)
        labCol <- colors[data.group[n]]
        attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
    }
    n
}
clusDendro = dendrapply(hcd, colLab)
plot(clusDendro, main = "Cool Dendrogram", type = "triangle")

rect.hclust(hc, k = 4)

. , .
, , , 4 .

plot(clusDendro, main = "Cool Dendrogram", type = "triangle")
data.hclas_group <- factor(cutree(hc, k = 3))

rect.hclust(hc, k = 3)

, , .

library(FactoMineR)
res.pca <- PCA(data,scale.unit=T, graph = F)
fviz_pca_biplot(res.pca, 
                col = colors[data.hclas_group], palette = "jco", 
                label = "var",
                ellipse.level = 0.8,
                 addEllipses = T,
                col.var = "black",
                legend.title = "groups4")

, , . , , , , . , , , k-++.

library(flexclust)
data.kk <- kcca(data, k=3, family=kccaFamily("kmeans"),
control=list(initcent="kmeanspp"))

fviz_pca_biplot(res.pca, 
                col.ind =as.factor(data.kk@cluster), palette = "jco", 
                label = "var",
                ellipse.level = 0.8,
                 addEllipses = T,
                col.var = "black", repel = TRUE,
                legend.title = "clusters")

, k- . , , .

, , hclust. .

, , . . , .

. . , , , . , , . , .

Es wäre möglich, ein Clustering basierend auf der Annahme von Clustermodellen unter Verwendung von Informationskriterien durchzuführen ( hier die Beschreibung ) sowie die klassische Diskriminanzanalyse für diesen Datensatz zu versuchen. Wenn dieser Artikel nützlich war, plane ich, eine Fortsetzung zu veröffentlichen.

Ein einfaches Beispiel für eine Clusteranalyse der Alkoholpräferenzen nach Ländern für R.

Voruntersuchung

More articles: