A simple example of a cluster analysis of alcohol preferences by country for R

Hello, Habr! Today I want to share a small example of how to conduct cluster analysis. In this example, the reader will not find neural networks and other fashionable directions. This example can serve as a reference point in order to make a small and complete cluster analysis for other data. Anyone interested - welcome to cat.


Immediately make a reservation, this article in no way claims to be academic in its entirety, uniqueness of the results obtained, or completeness of coverage of the issue. The article is intended to demonstrate the basic steps of classical cluster analysis, which can be used for simple and meaningful (possibly preceding a more detailed) study. Any corrections, comments and additions on the merits are welcome.


The data is a sample of alcohol consumption by country per capita by type of alcoholic beverages (beer, wine, spirits, etc.) for 2010 as a percentage of per capita alcohol consumption. The data also contains: average daily alcohol consumption per capita in grams of pure alcohol and all (recorded + unaccounted) alcohol consumption per capita (only drinkers in liters of pure alcohol).


At the same time, each country conditionally belongs to one of the geographical groups: east, center and west. The division is very arbitrary and very controversial for various reasons, but we will proceed from what we have. Data Source - Global status report on alcohol and health 2014, S. 289-364



(Hand-painted, there may be errors, but the general idea, I think, is understandable)


Preliminary analysis


Connect the libraries used.


library(rgl)
library(heplots)
library(MVN)
library(klaR)
library('Morpho')
library(caret)
library(mclust)
library(ggplot2)
library(GGally)
library(plyr)
library(psych)
library(GPArotation)
library(ggpubr)

, .


#    
data <- read.table("alcohol_data.csv", header=TRUE,  sep=",")
#      
rownames(data) <- make.names(data[,1], unique = TRUE)
#     ,   
data <- data[,-1]
data <- na.omit(data)
#    
head(data)

BeerWineSpiritOtherTotalAverage_dailyGroup
Albania31.819.848.40.013.027.5center
Armenia9.75.384.90.08.317.9east
Austria50.435.514.00.013.829.6center
Azerbaijan28.77.663.30.05.211.1east
Belarus17.35.246.630.922.148.0east
Belgium49.236.314.40.112.827.7center
........................


summary(data)


, . , Other , , , , . , , , , . , . - .


, , , .


options(rgl.useNULL=TRUE)
open3d()
mfrow3d(2,2)
levelColors <- c('west'='blue', 'east'='red', 'center'='yellow')
plot3d(data$Beer, data$Wine, data$Spirit, xlab="Beer", ylab="Wine", zlab="Spirit", col = levelColors[data$Group], size=3)

widget <- rglwidget()
widget

, . , .




ggpairs(
  data,
  mapping = ggplot2::aes(color = data$Group),
  upper = list(continuous = wrap("cor", alpha = 0.5), combo = "box"),
  lower = list(continuous = wrap("points", alpha = 0.3), combo = wrap("dot", alpha = 0.4)),
  diag = list(continuous = wrap("densityDiag",alpha = 0.5)),
  title = "Alcohol"
)


Average Total , Average.


data <- data[, -6]

, , , , . .


data[data$Wine>60,]

BeerWineSpiritOtherTotalGroup
Italy2365.611.509.9west

, , , , - , , .


data[data$Spirit>70,]
data[data$Spirit<10,]

BeerWineSpiritOtherTotalGroup
Armenia9.75.384.908.3east

BeerWineSpiritOtherTotalGroup
Slovenia44.546.98.6017.2west

, , .


,


split(data[,1:5],data$Group)

$center


BeerWineSpiritOtherTotal
Albania31.819.848.40.013.0
Austria50.435.514.00.013.8
Belgium49.236.314.40.112.8
Bosnia.and.Herzegovina73.39.717.00.012.3
Cyprus40.924.733.70.710.8
Czech.Republic53.520.526.00.014.6
Denmark37.748.214.10.012.9
Finland46.017.524.012.618.1
Germany53.627.818.60.014.7
Hungary36.329.434.30.016.3
Iceland61.821.216.50.510.4
Ireland48.126.118.77.714.7
Malta39.432.727.20.711.5
Netherlands46.836.416.90.011.2
Norway44.234.719.02.19.0
Poland55.19.335.50.024.2
Romania50.028.921.10.021.3
Serbia51.523.924.60.019.0
Sweden37.046.615.11.413.3
Switzerland31.849.417.61.212.1
Turkey63.68.627.90.017.3
UK36.933.821.87.513.8

$east


BeerWineSpiritOtherTotal
Armenia9.75.384.90.08.3
Azerbaijan28.77.663.30.05.2
Belarus17.35.246.630.922.1
Bulgaria39.316.544.10.116.9
Estonia41.211.136.810.915.7
Georgia17.049.833.20.121.2
Israel44.06.249.50.35.4
Latvia46.910.737.05.418.1
Lithuania46.57.834.111.623.6
Republic.of.Moldova30.45.164.50.025.4
Russian.Federation37.611.451.00.022.3
Slovakia30.118.346.25.519.8
Ukraine40.59.048.02.620.3

$west


BeerWineSpiritOtherTotal
Croatia39.544.815.40.215.1
France18.856.423.11.712.9
Greece28.147.324.20.415.6
Italy23.065.611.50.09.9
Luxembourg36.242.821.00.012.7
Portugal30.855.510.92.822.6
Slovenia44.546.98.60.017.2
Spain49.720.128.21.816.4
Republic.of.Macedonia47.439.912.60.011.7

ggpairs(
  data,
  mapping = ggplot2::aes(color = data$Group),
  diag=list(continuous="bar", alpha=0.4)
)


, , . Other, : , , , ( 10-12 , 45, , ). . , , , (). , , . Other .


, , — , — . , — , .
Total Other, . .


, Beer, Spirit Wine . , , , . , , , , , .



Total. , — .


data.group = data[,5]
data <- data[,-5]
data<- data[,-4]

Elbow method (“ ”, “ ”). , k, – W(K), .


library(factoextra)
fviz_nbclust(data, kmeans, method = "wss") +
  labs(subtitle = "Elbow method") +
  geom_vline(xintercept = 4, linetype = 2)


data.dist <- dist((data))
hc <- hclust(data.dist, method = "ward.D2")
plot(hc, cex = 0.7)


. .


colors=c('green', 'red', 'blue')
hcd = as.dendrogram(hc)
clusMember = cutree(hc, 4)
colLab <- function(n) {
    if (is.leaf(n)) {
        a <- attributes(n)
        labCol <- colors[data.group[n]]
        attr(n, "nodePar") <- c(a$nodePar, lab.col = labCol)
    }
    n
}
clusDendro = dendrapply(hcd, colLab)
plot(clusDendro, main = "Cool Dendrogram", type = "triangle")

rect.hclust(hc, k = 4)


. , .
, , , 4 .


plot(clusDendro, main = "Cool Dendrogram", type = "triangle")
data.hclas_group <- factor(cutree(hc, k = 3))

rect.hclust(hc, k = 3)


, , .


library(FactoMineR)
res.pca <- PCA(data,scale.unit=T, graph = F)
fviz_pca_biplot(res.pca, 
                col = colors[data.hclas_group], palette = "jco", 
                label = "var",
                ellipse.level = 0.8,
                 addEllipses = T,
                col.var = "black",
                legend.title = "groups4")


, , . , , , , . , , , k-++.


library(flexclust)
data.kk <- kcca(data, k=3, family=kccaFamily("kmeans"),
control=list(initcent="kmeanspp"))

fviz_pca_biplot(res.pca, 
                col.ind =as.factor(data.kk@cluster), palette = "jco", 
                label = "var",
                ellipse.level = 0.8,
                 addEllipses = T,
                col.var = "black", repel = TRUE,
                legend.title = "clusters")


, k- . , , .


, , hclust. .



, , . . , .


. . , , , . , , . , .


It would be possible to carry out clustering based on the assumption of cluster models using information criteria ( here is the description ), as well as try the classical discriminant analysis for this data set. If this article was useful, I plan to publish a sequel.


All Articles