::p_load(dplyr) pacman
= read.csv('data/AirlinesCluster.csv')
A summary(A)
Balance QualMiles BonusMiles BonusTrans
Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
1st Qu.: 18528 1st Qu.: 0 1st Qu.: 1250 1st Qu.: 3.0
Median : 43097 Median : 0 Median : 7171 Median :12.0
Mean : 73601 Mean : 144 Mean : 17145 Mean :11.6
3rd Qu.: 92404 3rd Qu.: 0 3rd Qu.: 23800 3rd Qu.:17.0
Max. :1704838 Max. :11148 Max. :263685 Max. :86.0
FlightMiles FlightTrans DaysSinceEnroll
Min. : 0 Min. : 0.00 Min. : 2
1st Qu.: 0 1st Qu.: 0.00 1st Qu.:2330
Median : 0 Median : 0.00 Median :4096
Mean : 460 Mean : 1.37 Mean :4119
3rd Qu.: 311 3rd Qu.: 1.00 3rd Qu.:5790
Max. :30817 Max. :53.00 Max. :8296
Variables in each Mileage Account:
■ Balance
: the regular mileage remaining
■ QualMiles
: the high-class mileage remaining
■ BonusMiles
: mileage obtained by non-flight transactions in the past 12 months
■ BonusTrans
: number of non-flight transactions in the past 12 months
■ FlightMiles
: mileage obtained by flight in the past 12 months
■ FlightTrans
: number of flights in the past 12 months
■ DaysSinceEnroll
: days since enrolled
💡 scale(df)
Standardize: re-scale a numeric variable so its mean/sd equal to 0/1
🗿 Q:
Why do we standardize before clustering and/or dimension reduction?
Standardize the variables in A
and save the result in AN
= scale(A) %>% data.frame AN
Mean/Sd’s of every variables in AN
equal to 0/1.
sapply(AN, mean) %>% round(4)
Balance QualMiles BonusMiles BonusTrans FlightMiles
0 0 0 0 0
FlightTrans DaysSinceEnroll
0 0
sapply(AN, sd)
Balance QualMiles BonusMiles BonusTrans FlightMiles
1 1 1 1 1
FlightTrans DaysSinceEnroll
1 1
= dist(AN, method="euclidean") d
= hclust(d, method='ward.D') hc
plot(hc)
In a Dendrogram …
🗿 Q:
How to determine the number of groups by observing a dendrogram
= cutree(hc, k=5) kg
First we count the group size …
table(kg)
kg
1 2 3 4 5
776 519 494 868 1342
Then we observe the Group Characteristics by group averages.
sapply(split(A,kg), colMeans) %>% round(2) # original scale
1 2 3 4 5
Balance 57866.90 110669.27 198191.57 52335.91 36255.91
QualMiles 0.64 1065.98 30.35 4.85 2.51
BonusMiles 10360.12 22881.76 55795.86 20788.77 2264.79
BonusTrans 10.82 18.23 19.66 17.09 2.97
FlightMiles 83.18 2613.42 327.68 111.57 119.32
FlightTrans 0.30 7.40 1.07 0.34 0.44
DaysSinceEnroll 6235.36 4402.41 5615.71 2840.82 3060.08
The original scale is easier to understand in a table.
sapply(split(AN,kg), colMeans) %>% round(2) # standardized scale
1 2 3 4 5
Balance -0.16 0.37 1.24 -0.21 -0.37
QualMiles -0.19 1.19 -0.15 -0.18 -0.18
BonusMiles -0.28 0.24 1.60 0.15 -0.62
BonusTrans -0.08 0.69 0.84 0.57 -0.90
FlightMiles -0.27 1.54 -0.09 -0.25 -0.24
FlightTrans -0.28 1.59 -0.08 -0.27 -0.25
DaysSinceEnroll 1.03 0.14 0.72 -0.62 -0.51
the standardized variables make it possible to compare variables of different scale in a chart
par(cex=0.8)
split(AN,kg) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
🗿 DISCUSSION 1 :
■ What are the group means stand for?
■ What are the pro and con’s of the ‘original’ versus the ‘standardized’ scales?
■ When should we use the ‘original’ scale?
■ When should we use the ‘standardized’ scales?
🗿 DISCUSSION 2 :
■ In business practice, is distance/similarity the only criterion for clustering?
■ When doing segmentation, what else should we consider?
■ Practically, what are the critiria for a good clustering?
🗿 DISCUSSION 3:
■ Based on their characteristics, pick a name for each of these 5 groups.
■ Design a marketing strategy for each of them.