pacman::p_load(dplyr, ggplot2, readr, FactoMineR, factoextra, dendextend)


【A】The wholesales dataset

W = read.csv('data/wholesales.csv')
W$Channel = factor( paste0("Ch",W$Channel) )
W$Region = factor( paste0("Reg",W$Region) )
W[3:8] = lapply(W[3:8], log, base=10) 
summary(W)
 Channel    Region        Fresh            Milk         Grocery     
 Ch1:298   Reg1: 77   Min.   :0.477   Min.   :1.74   Min.   :0.477  
 Ch2:142   Reg2: 47   1st Qu.:3.495   1st Qu.:3.19   1st Qu.:3.333  
           Reg3:316   Median :3.930   Median :3.56   Median :3.677  
                      Mean   :3.792   Mean   :3.53   Mean   :3.666  
                      3rd Qu.:4.229   3rd Qu.:3.86   3rd Qu.:4.028  
                      Max.   :5.050   Max.   :4.87   Max.   :4.968  
     Frozen     Detergents_Paper   Delicassen   
 Min.   :1.40   Min.   :0.477    Min.   :0.477  
 1st Qu.:2.87   1st Qu.:2.409    1st Qu.:2.611  
 Median :3.18   Median :2.912    Median :2.985  
 Mean   :3.17   Mean   :2.947    Mean   :2.895  
 3rd Qu.:3.55   3rd Qu.:3.594    3rd Qu.:3.260  
 Max.   :4.78   Max.   :4.611    Max.   :4.681  


【B】Clustering Analysis

Clustering: Group similar subjects for easier observation and operation

The most common used Methods of Clustering :

💡 Steps of Hierarchical Cluster Analysis:
  ■ scale() : Standardize the Variable
  ■ dist() : Calculate Distance Matrix
  ■ hclust() : Call hclust Function
  ■ plot() : Make Deprogram
  ■ rect.hclust() : Cut the Dendrogram
  ■ cutree() : Obtain the Clustering Vector

For simplicity, let’s start with two clutering variables

hc = W[,3:4] %>% scale %>% dist %>% hclust

The result of the cultering analysis is returned and kept in the data object hc.

Make and Interpreting the Dendrogram Determining the number of groups and Cut Dendrogram

plot(hc)
k=6; rect.hclust(hc, k=k, border="red")

Obtain and Save the Clustering Vector

W$group = cutree(hc, k=8) %>% factor

Save it as an categorical variable, so it won’t be interpreted as numerics.

Plot the subjects in the Variable Space

ggplot(W, aes(x=Fresh, y=Milk, col=group)) +
  geom_point(size=3, alpha=0.5)



【C】Clustering with 5 Variables

hc = W[,3:7] %>% scale %>% dist %>% hclust
plot(hc)
k = 6; rect.hclust(hc, k, border="red")