pacman::p_load(dplyr, ggplot2, readr, FactoMineR, factoextra, dendextend)wholesales datasetW = read.csv('data/wholesales.csv')
W$Channel = factor( paste0("Ch",W$Channel) )
W$Region = factor( paste0("Reg",W$Region) )
W[3:8] = lapply(W[3:8], log, base=10)
summary(W) Channel Region Fresh Milk Grocery
Ch1:298 Reg1: 77 Min. :0.477 Min. :1.74 Min. :0.477
Ch2:142 Reg2: 47 1st Qu.:3.495 1st Qu.:3.19 1st Qu.:3.333
Reg3:316 Median :3.930 Median :3.56 Median :3.677
Mean :3.792 Mean :3.53 Mean :3.666
3rd Qu.:4.229 3rd Qu.:3.86 3rd Qu.:4.028
Max. :5.050 Max. :4.87 Max. :4.968
Frozen Detergents_Paper Delicassen
Min. :1.40 Min. :0.477 Min. :0.477
1st Qu.:2.87 1st Qu.:2.409 1st Qu.:2.611
Median :3.18 Median :2.912 Median :2.985
Mean :3.17 Mean :2.947 Mean :2.895
3rd Qu.:3.55 3rd Qu.:3.594 3rd Qu.:3.260
Max. :4.78 Max. :4.611 Max. :4.681
Clustering: Group
The most common used Methods of Clustering :
💡 Steps of Hierarchical Cluster
Analysis:
■ scale() : Standardize the
Variable
■ dist() : Calculate Distance Matrix
■
hclust() : Call hclust Function
■
plot() : Make Deprogram
■ rect.hclust()
: Cut the Dendrogram
■ cutree() : Obtain the
Clustering Vector
For simplicity, let’s start with two clutering variables
hc = W[,3:4] %>% scale %>% dist %>% hclustThe result of the cultering analysis is returned and kept in the data
object hc.
Make and Interpreting the Dendrogram Determining the number of groups and Cut Dendrogram
plot(hc)
k=6; rect.hclust(hc, k=k, border="red")Obtain and Save the Clustering Vector
W$group = cutree(hc, k=8) %>% factorSave it as an categorical variable, so it won’t be interpreted as numerics.
Plot the subjects in the Variable Space
ggplot(W, aes(x=Fresh, y=Milk, col=group)) +
geom_point(size=3, alpha=0.5)
hc = W[,3:7] %>% scale %>% dist %>% hclust
plot(hc)
k = 6; rect.hclust(hc, k, border="red")W$group = cutree(hc, k) %>% factorFor better looks …
hc %>% as.dendrogram %>% color_branches(k) %>% color_labels(k,col='white') %>% plot
Dimension Reduction: Compress the space of many variables into a low dimension space for easier observation
The most used Methods of Dimension Reduction:
PCA()cmdscale()W[,3:8] %>% PCA(graph=F) %>% fviz_pca_biplot()W[,3:8] %>% PCA(graph=FALSE) %>% fviz_pca_biplot(
col.ind=W$group, #
label="var", pointshape=19, mean.point=F,
addEllipses=T, ellipse.level=0.7,
ellipse.type = "convex", palette="ucscgb",
repel=T
)
💡 Key Learnings:
■ The
Concept and Purpose of Clustering Analysis
■ Clustering Analysis
in a pipeline
■
df %>% scale %>% dist %>% hclust
■ Making,
Interpreting and Cutting Dendrogram
■ The Concept and Purpose
of Dimension Reduction
■ Combining Dimension Reduction and
Clustering Analysis
■ Visualize the subject/groups in an Reduced
Variable Space.