pacman::p_load(dplyr)


【A】The Airline Customer dataset

A = read.csv('data/AirlinesCluster.csv')
summary(A)
    Balance          QualMiles       BonusMiles       BonusTrans  
 Min.   :      0   Min.   :    0   Min.   :     0   Min.   : 0.0  
 1st Qu.:  18528   1st Qu.:    0   1st Qu.:  1250   1st Qu.: 3.0  
 Median :  43097   Median :    0   Median :  7171   Median :12.0  
 Mean   :  73601   Mean   :  144   Mean   : 17145   Mean   :11.6  
 3rd Qu.:  92404   3rd Qu.:    0   3rd Qu.: 23800   3rd Qu.:17.0  
 Max.   :1704838   Max.   :11148   Max.   :263685   Max.   :86.0  
  FlightMiles     FlightTrans    DaysSinceEnroll
 Min.   :    0   Min.   : 0.00   Min.   :   2   
 1st Qu.:    0   1st Qu.: 0.00   1st Qu.:2330   
 Median :    0   Median : 0.00   Median :4096   
 Mean   :  460   Mean   : 1.37   Mean   :4119   
 3rd Qu.:  311   3rd Qu.: 1.00   3rd Qu.:5790   
 Max.   :30817   Max.   :53.00   Max.   :8296   

Variables in each Mileage Account:
  ■ Balance: the regular mileage remaining
  ■ QualMiles: the high-class mileage remaining
  ■ BonusMiles: mileage obtained by non-flight transactions in the past 12 months
  ■ BonusTrans: number of non-flight transactions in the past 12 months
  ■ FlightMiles: mileage obtained by flight in the past 12 months
  ■ FlightTrans: number of flights in the past 12 months
  ■ DaysSinceEnroll: days since enrolled

【B】Standardization

💡 scale(df) Standardize: re-scale a numeric variable so its mean/sd equal to 0/1

🗿 Q: Why do we standardize before clustering and/or dimension reduction?


Standardize the variables in A and save the result in AN

AN = scale(A) %>% data.frame

Mean/Sd’s of every variables in AN equal to 0/1.

sapply(AN, mean) %>% round(4)
        Balance       QualMiles      BonusMiles      BonusTrans     FlightMiles 
              0               0               0               0               0 
    FlightTrans DaysSinceEnroll 
              0               0 
sapply(AN, sd)
        Balance       QualMiles      BonusMiles      BonusTrans     FlightMiles 
              1               1               1               1               1 
    FlightTrans DaysSinceEnroll 
              1               1 



【C】層級式集群分析 Hirarchical Clustering

  1. Distance Matrix
d = dist(AN, method="euclidean")
  1. Hierarchical Clustering
hc = hclust(d, method='ward.D')
  1. Make Dendrogram
plot(hc)


In a Dendrogram …

🗿 Q: How to determine the number of groups by observing a dendrogram


  1. Split the dendrogram and create the clustering vector
kg = cutree(hc, k=5)



【D】Exmaine the Group Characteristics

First we count the group size …

table(kg)
kg
   1    2    3    4    5 
 776  519  494  868 1342 

Then we observe the Group Characteristics by group averages.

sapply(split(A,kg), colMeans) %>% round(2)   # original scale 
                       1         2         3        4        5
Balance         57866.90 110669.27 198191.57 52335.91 36255.91
QualMiles           0.64   1065.98     30.35     4.85     2.51
BonusMiles      10360.12  22881.76  55795.86 20788.77  2264.79
BonusTrans         10.82     18.23     19.66    17.09     2.97
FlightMiles        83.18   2613.42    327.68   111.57   119.32
FlightTrans         0.30      7.40      1.07     0.34     0.44
DaysSinceEnroll  6235.36   4402.41   5615.71  2840.82  3060.08

The original scale is easier to understand in a table.

sapply(split(AN,kg), colMeans) %>% round(2)  # standardized scale 
                    1    2     3     4     5
Balance         -0.16 0.37  1.24 -0.21 -0.37
QualMiles       -0.19 1.19 -0.15 -0.18 -0.18
BonusMiles      -0.28 0.24  1.60  0.15 -0.62
BonusTrans      -0.08 0.69  0.84  0.57 -0.90
FlightMiles     -0.27 1.54 -0.09 -0.25 -0.24
FlightTrans     -0.28 1.59 -0.08 -0.27 -0.25
DaysSinceEnroll  1.03 0.14  0.72 -0.62 -0.51

the standardized variables make it possible to compare variables of different scale in a chart

par(cex=0.8)
split(AN,kg) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))


🗿 DISCUSSION 1 :
  ■ What are the group means stand for?
  ■ What are the pro and con’s of the ‘original’ versus the ‘standardized’ scales?
  ■ When should we use the ‘original’ scale?
  ■ When should we use the ‘standardized’ scales?


🗿 DISCUSSION 2 :
  ■ In business practice, is distance/similarity the only criterion for clustering?
  ■ When doing segmentation, what else should we consider?
  ■ Practically, what are the critiria for a good clustering?


🗿 DISCUSSION 3:
  ■ Based on their characteristics, pick a name for each of these 5 groups.
  ■ Design a marketing strategy for each of them.