UNIT07A：Clusting Analysis & Dimemnsion Reduction

pacman::p_load(dplyr, ggplot2, readr, FactoMineR, factoextra, dendextend)

【A】The `wholesales` dataset

W = read.csv('data/wholesales.csv')
W$Channel = factor( paste0("Ch",W$Channel) )
W$Region = factor( paste0("Reg",W$Region) )
W[3:8] = lapply(W[3:8], log, base=10) 
summary(W)

 Channel    Region        Fresh            Milk         Grocery     
 Ch1:298   Reg1: 77   Min.   :0.477   Min.   :1.74   Min.   :0.477  
 Ch2:142   Reg2: 47   1st Qu.:3.495   1st Qu.:3.19   1st Qu.:3.333  
           Reg3:316   Median :3.930   Median :3.56   Median :3.677  
                      Mean   :3.792   Mean   :3.53   Mean   :3.666  
                      3rd Qu.:4.229   3rd Qu.:3.86   3rd Qu.:4.028  
                      Max.   :5.050   Max.   :4.87   Max.   :4.968  
     Frozen     Detergents_Paper   Delicassen   
 Min.   :1.40   Min.   :0.477    Min.   :0.477  
 1st Qu.:2.87   1st Qu.:2.409    1st Qu.:2.611  
 Median :3.18   Median :2.912    Median :2.985  
 Mean   :3.17   Mean   :2.947    Mean   :2.895  
 3rd Qu.:3.55   3rd Qu.:3.594    3rd Qu.:3.260  
 Max.   :4.78   Max.   :4.611    Max.   :4.681

【B】Clustering Analysis

Clustering: Group similar subjects for easier observation and operation

subjects： a retail store in each row
variables： each column is the sales amount (normalized) of a product category

The most common used Methods of Clustering :

Hierarchical Clustering
Kmean Clustering
…

💡 Steps of Hierarchical Cluster Analysis：
■ scale() : Standardize the Variable
■ dist() : Calculate Distance Matrix
■ hclust() : Call hclust Function
■ plot() : Make Deprogram
■ rect.hclust() : Cut the Dendrogram
■ cutree() : Obtain the Clustering Vector

For simplicity, let’s start with two clutering variables

hc = W[,3:4] %>% scale %>% dist %>% hclust

The result of the cultering analysis is returned and kept in the data object hc.

Make and Interpreting the Dendrogram Determining the number of groups and Cut Dendrogram

plot(hc)
k=6; rect.hclust(hc, k=k, border="red")

Obtain and Save the Clustering Vector

W$group = cutree(hc, k=8) %>% factor

Save it as an categorical variable, so it won’t be interpreted as numerics.

Plot the subjects in the Variable Space

ggplot(W, aes(x=Fresh, y=Milk, col=group)) +
  geom_point(size=3, alpha=0.5)

【C】Clustering with 5 Variables

hc = W[,3:7] %>% scale %>% dist %>% hclust
plot(hc)
k = 6; rect.hclust(hc, k, border="red")

W$group = cutree(hc, k) %>% factor

For better looks …

hc %>% as.dendrogram %>% color_branches(k) %>% color_labels(k,col='white') %>% plot

【D】Dimension Reduction

Dimension Reduction: Compress the space of many variables into a low dimension space for easier observation

Why doing Dimension Reduction？
The underlying logic of Dimension Reduction
The pro and con of Dimension Reduction

The most used Methods of Dimension Reduction：

Principle Component Analysis - PCA()
Multi-Dimensional Scaling - cmdscale()
…

W[,3:8] %>% PCA(graph=F) %>% fviz_pca_biplot()

W[,3:8] %>% PCA(graph=FALSE) %>% fviz_pca_biplot(
  col.ind=W$group,  # 
  label="var", pointshape=19, mean.point=F,
  addEllipses=T, ellipse.level=0.7,
  ellipse.type = "convex", palette="ucscgb",
  repel=T
  )

💡 Key Learnings：
■ The Concept and Purpose of Clustering Analysis
■ Clustering Analysis in a pipeline
■ df %>% scale %>% dist %>% hclust
■ Making, Interpreting and Cutting Dendrogram

■ The Concept and Purpose of Dimension Reduction
■ Combining Dimension Reduction and Clustering Analysis
■ Visualize the subject/groups in an Reduced Variable Space.

UNIT07A：Clusting Analysis & Dimemnsion Reduction

Tony Chuo, NSYSU Taiwan

2022-10-27 10:08:21

【A】The `wholesales` dataset

【B】Clustering Analysis

【C】Clustering with 5 Variables

【D】Dimension Reduction

UNIT07A：Clusting Analysis & Dimemnsion Reduction

Tony Chuo, NSYSU Taiwan

2022-10-27 10:08:21

【A】The wholesales dataset

【B】Clustering Analysis

【C】Clustering with 5 Variables

【D】Dimension Reduction

【A】The `wholesales` dataset