UNIT07C CASE - Decathlon Athletes

💡 Key Learnings：
■ The Concept of Dimension Reduction
§ Variance as Information
§ Variance Decomposition
§ Principle Components?
§ Eigenvalue & Variance Decomposition
■ The Tool for Principle Component Analysis (FactoMiner)
§ Coordinates of Variables and Subjects
§ CO2 - Level of Representation
■ PCA Visualization
§ Visualization Tool (factorextra)
■ The Application of PCA
■ The Synergy of PCA and Clustering

REFERENCE: Statistical tools for high-throughput data analysis

pacman::p_load(dplyr, FactoMineR, factoextra)

§ The Decathlon dataset

D = decathlon2
head(D)

          X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69  43.75
CLAY      10.76      7.40    14.26      1.86 49.37        14.05  50.72
BERNARD   11.02      7.23    14.25      1.92 48.93        14.99  40.87
YURKOV    11.34      7.09    15.19      2.10 50.42        15.31  46.26
ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17  45.67
McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38  44.41
          Pole.vault Javeline X1500m Rank Points Competition
SEBRLE          5.02    63.19  291.7    1   8217    Decastar
CLAY            4.92    60.15  301.5    2   8122    Decastar
BERNARD         5.32    62.77  280.1    4   8067    Decastar
YURKOV          4.72    63.44  276.4    5   8036    Decastar
ZSIVOCZKY       4.42    55.37  268.0    7   8004    Decastar
McMULLEN        4.42    56.37  285.1    8   7995    Decastar

【A】PCA - Priciple Component Analysis

Comparing to the hclust tool, the PCA function is more friendly

We simply call PCA(), usually the default argument is good enough
Before analysis, it’d do standardization by default
After analysis, it does basic visualization automatically

pca = PCA(D[,1:10])

The representation of variables differs from the plot of the observations
The observations are represented by their projections,
but the variables are represented by their correlations (Abdi and Williams 2010).

§ the `pca` object

PCA() returns a object of class PCA. We names it pca.
pca is a list composed of many sub-elements.

pca

**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 27 individuals, described by 10 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"

§ Information in Variables

Variance can be treated as Information.

D[,1:10] %>% sapply(var)

       X100m    Long.jump     Shot.put    High.jump        X400m X110m.hurdle 
   0.0793410    0.0866567    0.6995923    0.0091387    0.9550439    0.2210692 
      Discus   Pole.vault     Javeline       X1500m 
  11.7989311    0.0678872   27.3651718  104.1582652

The scale() function unifies the variances, so that all variables have the same weight in the analysis.

D[,1:10] %>% scale %>% apply(2,var)

       X100m    Long.jump     Shot.put    High.jump        X400m X110m.hurdle 
           1            1            1            1            1            1 
      Discus   Pole.vault     Javeline       X1500m 
           1            1            1            1

Correlations imply that, among the variables, some information are overlaid.

D[,1:10] %>% cor %>% heatmaply_cor(
  cexRow=0.8, cexCol=0.8, k_row=3, k_col=3,
  hide_colorbar=T)

§ Principle Components and their Eigenvalues

Principle Components - Dimensions without Correlation

Performing PCA on 10 variables produces 10 Principle Components (PC’s)
Variables usually correlate. I.e., the info they carry are overlaid upon each other.
PC’s are independent to each other. There’s no redundant info in the PC space.
I.e., PCA remove the redundant information among the variables.
So we can compare all variables of the subjects in a low dimension space.

Eigenvalues - The amount of information in the PC’s

Eigenvalues represent the amount of information carried in the PC’s.
PC’s are sorted in descending eigenvalues.
The first two PC’s carry the maximum info that can be displayed in a 2-d space.

get_eigenvalue(pca)

       eigenvalue variance.percent cumulative.variance.percent
Dim.1     3.74997          37.4997                      37.500
Dim.2     1.74517          17.4517                      54.951
Dim.3     1.51783          15.1783                      70.130
Dim.4     1.03220          10.3220                      80.452
Dim.5     0.61784           6.1784                      86.630
Dim.6     0.42829           4.2829                      90.913
Dim.7     0.32591           3.2591                      94.172
Dim.8     0.27938           2.7938                      96.966
Dim.9     0.19111           1.9111                      98.877
Dim.10    0.11230           1.1230                     100.000

by default, it keeps 5 PC’s carrying 86.63% of all of the information.

pca$ind$coord %>% dim

[1] 27  5

PCA maps the subjects into a space where the correlations among their coordinates equals to 0.

pca$ind$coord %>% cor  %>% round(8)

      Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Dim.1     1     0     0     0     0
Dim.2     0     1     0     0     0
Dim.3     0     0     1     0     0
Dim.4     0     0     0     1     0
Dim.5     0     0     0     0     1

【B】Variables in the PC Space

§ `pca$var$coord`: coordinates of the variables

pca$var$coord

                 Dim.1     Dim.2      Dim.3     Dim.4    Dim.5
X100m        -0.818952  0.342779  0.1008645  0.101342 -0.21981
Long.jump     0.758899 -0.381493 -0.0062613 -0.185424  0.26371
Shot.put      0.715078  0.282117  0.4738546  0.036104 -0.27864
High.jump     0.608493  0.611354  0.0046060  0.071244  0.30059
X400m        -0.643848  0.148422  0.5157594  0.269785  0.19924
X110m.hurdle -0.716420  0.297552  0.4164510 -0.159781  0.16102
Discus        0.716888  0.204398  0.2703222  0.397623 -0.33949
Pole.vault   -0.221417 -0.737548  0.4030836 -0.251549 -0.26259
Javeline      0.355176  0.098531  0.6954337 -0.485559  0.13342
X1500m        0.069712 -0.568120  0.3527578  0.652461  0.25368

§ `pca$var$cos2`: fraction of information carried in each PC

pca$var$cos2

                 Dim.1     Dim.2       Dim.3     Dim.4    Dim.5
X100m        0.6706825 0.1174973 0.010173655 0.0102702 0.048315
Long.jump    0.5759270 0.1455370 0.000039203 0.0343821 0.069545
Shot.put     0.5113370 0.0795898 0.224538173 0.0013035 0.077639
High.jump    0.3702640 0.3737540 0.000021215 0.0050756 0.090352
X400m        0.4145404 0.0220292 0.266007740 0.0727841 0.039696
X110m.hurdle 0.5132580 0.0885371 0.173431450 0.0255299 0.025928
Discus       0.5139286 0.0417785 0.073074093 0.1581041 0.115255
Pole.vault   0.0490256 0.5439768 0.162476407 0.0632771 0.068955
Javeline     0.1261497 0.0097083 0.483627983 0.2357675 0.017802
X1500m       0.0048598 0.3227600 0.124438033 0.4257058 0.064353

§ `fviz_pca_var`: plot variables in the PC space

fviz_pca_var(pca)

Correlations between the variables and the Dim’s are used as the coordinates to draw the variables.
The closer a variable is to the unity circle, the better its information are displayed in the space.

【C】Observations in the PC Space

§ `pca$ind$coord`: coordinates of each observations

pca$ind$coord

                Dim.1     Dim.2     Dim.3     Dim.4     Dim.5
SEBRLE       0.277958 -0.536434  1.585239  0.105823  1.074623
CLAY         0.904854 -2.094280  0.840685  1.850718 -0.408645
BERNARD     -1.372266 -1.348116  0.961932 -1.493072 -0.182667
YURKOV      -0.928205  2.281744  1.942688  0.096823  0.190927
ZSIVOCZKY   -0.103817  1.089822 -2.098908  0.071906 -0.032938
McMULLEN     0.239858  0.939092 -0.818136  1.201893  1.830199
MARTINEAU   -2.537291  1.801094  0.051975  0.374306 -2.285411
HERNU       -1.902843 -0.330277  1.288682  0.766505  0.239465
BARRAS      -1.805625  0.302590 -0.592810  0.656526 -0.244039
NOOL        -2.881737  0.863854 -1.402448 -1.491195  1.358726
BOURGUIGNON -4.505530 -0.485422  1.202704  0.951363  0.508865
Sebrle       3.567756  0.068007  1.911216 -1.042363 -0.300596
Clay         3.472177 -0.705599  1.607029 -0.696108  0.741815
Karpov       4.328761  0.160789 -1.152529  0.407689 -0.772485
Macey        1.944475  2.523948 -0.260304 -0.079809 -0.025024
Warners      1.552082 -1.488634 -1.414196 -0.549665  0.102190
Zsivoczky    0.475153  1.971763  0.900183 -0.725288  0.171696
Hernu        0.280841  0.822696 -0.905794 -0.782389 -0.771389
Bernard      1.533280  1.085832 -1.245717  0.534722  1.042879
Schwarzl    -0.677974 -1.134257 -0.422180 -0.609851 -0.100686
Pogorelov   -0.077879 -0.333658  0.607951  1.446999  0.204623
Schoenbeck  -0.487405 -0.860688  0.866712 -0.173226 -0.432985
Barras      -0.413081  1.366893  0.227296 -0.753324 -0.861778
KARPOV       0.967748 -0.995599 -0.476019  2.501069 -0.661349
WARNERS     -0.280043 -0.912158 -1.416982 -0.147161  0.036162
Nool        -0.535389 -2.135638  0.616053 -1.808079 -0.319954
Drews       -1.035856 -1.917364 -2.404320 -0.614812 -0.102226

§ `pca$ind$cos2`: the fraction of information carried in each PC

pca$ind$cos2

                Dim.1      Dim.2      Dim.3      Dim.4       Dim.5
SEBRLE      0.0154465 0.05753138 0.50241310 0.00223886 0.230878821
CLAY        0.0655742 0.35127414 0.05660346 0.27431969 0.013374224
BERNARD     0.2322366 0.22413434 0.11411498 0.27492584 0.004115041
YURKOV      0.0848122 0.51251263 0.37151534 0.00092284 0.003588447
ZSIVOCZKY   0.0015087 0.16625497 0.61666612 0.00072377 0.000151861
McMULLEN    0.0082682 0.12674108 0.09619490 0.20760230 0.481390643
MARTINEAU   0.4048095 0.20397763 0.00016986 0.00880975 0.328426736
HERNU       0.3626087 0.01092418 0.16631227 0.05883861 0.005742729
BARRAS      0.6179591 0.01735462 0.06660942 0.08169748 0.011288151
NOOL        0.5144396 0.04622816 0.12184266 0.13775107 0.114364103
BOURGUIGNON 0.8414496 0.00976733 0.05995894 0.03751710 0.010733484
Sebrle      0.6742810 0.00024499 0.19349512 0.05755578 0.004786484
Clay        0.6864011 0.02834588 0.14703531 0.02758845 0.031330374
Karpov      0.8267406 0.00114065 0.05860654 0.00733332 0.026328251
Macey       0.3497653 0.58929505 0.00626806 0.00058922 0.000057928
Warners     0.3365404 0.30958768 0.27940052 0.04220890 0.001458908
Zsivoczky   0.0354688 0.61078646 0.12730376 0.08264196 0.004631289
Hernu       0.0247329 0.21224209 0.25728350 0.19195436 0.186594955
Bernard     0.3368420 0.16893075 0.22234247 0.04096756 0.155830032
Schwarzl    0.1319041 0.36919405 0.05114794 0.10672826 0.002909200
Pogorelov   0.0010328 0.01895803 0.06294013 0.35655465 0.007130132
Schoenbeck  0.0643656 0.20070808 0.20352752 0.00813015 0.050794887
Barras      0.0272368 0.29823283 0.00824647 0.09058360 0.118543191
KARPOV      0.0868773 0.09194984 0.02101993 0.58027427 0.040573584
WARNERS     0.0166843 0.17700968 0.42715467 0.00460727 0.000278197
Nool        0.0312477 0.49720414 0.04137291 0.35638092 0.011159741
Drews       0.0963567 0.33013525 0.51911955 0.03394434 0.000938440

§ `fviz_pca_ind` plot the obervations in the PC space

fviz_pca_ind(pca)

【D】Visualize PCA in Biplots

§ `fviz_pca_biplot` plot variables and observations in the same chart

fviz_pca_biplot(
  pca, pointsize="cos2", repel=T, labelsize=3,
  col.var="red", col.ind="#E7B800", alpha.ind=0.3)

The longer (bigger) a variable (observation), the better it is represented in this 2-d space.

【E】Clustering Analysis with Kmeans

§ Kmeans vs Hierarchical Clusterting

Hierarchical Clustering

need to make distance matrix
so, it cannot handle to many observations
do not need to specify the number of group
can watch the dendrogram before determining the number of group

Kmeans Clustering

distance matrix is not required
can handle a large number of observations
need to specify the number of group
may have a different result in each run

§ Kmeans vs Hierarchical Clusterting

set.seed(444)    # set random seed if we want to have the same result
kmg = kmeans(scale(D[,1:10]), 5)$cluster %>% factor
table(kmg)

kmg
1 2 3 4 5 
7 6 6 3 5

§ Visualize PCA with Clustering

fviz_pca_biplot(
  pca, repel=T, col.var="black", labelsize=3, col.ind=kmg, 
  alpha.ind=0.6, pointshape=16, mean.point=F, 
  addEllipses = T, ellipse.type="convex") + 
  theme(legend.position = "none")

💡 KEY LEARNINGS We can see three essential information in a PCA biplot …
■ The correlation among the variables
■ The similarity among subjects
■ The relationship between the observation and the variables

§ Choosing the Dimensions

fviz_pca_biplot(
  pca, axes=c(1,3),   # plotting the space of Dim1-Dim3
  repel=T, col.var="black", labelsize=3, col.ind=kmg, 
  alpha.ind=0.6, pointshape=16, mean.point=F, 
  addEllipses = T, ellipse.type="convex") + 
  theme(legend.position = "none")

UNIT07C CASE - Decathlon Athletes

Tony Chuo, NSYSU. Taiwan

2022-10-27 10:09:32

§ The Decathlon dataset

【A】PCA - Priciple Component Analysis

§ the `pca` object

§ Information in Variables

§ Principle Components and their Eigenvalues

【B】Variables in the PC Space

§ `pca$var$coord`: coordinates of the variables

§ `pca$var$cos2`: fraction of information carried in each PC

§ `fviz_pca_var`: plot variables in the PC space

【C】Observations in the PC Space

§ `pca$ind$coord`: coordinates of each observations

§ `pca$ind$cos2`: the fraction of information carried in each PC

§ `fviz_pca_ind` plot the obervations in the PC space

【D】Visualize PCA in Biplots

§ `fviz_pca_biplot` plot variables and observations in the same chart

【E】Clustering Analysis with Kmeans

§ Kmeans vs Hierarchical Clusterting

§ Kmeans vs Hierarchical Clusterting

§ Visualize PCA with Clustering

§ Choosing the Dimensions

UNIT07C CASE - Decathlon Athletes

Tony Chuo, NSYSU. Taiwan

2022-10-27 10:09:32

§ The Decathlon dataset

【A】PCA - Priciple Component Analysis

§ the pca object

§ Information in Variables

§ Principle Components and their Eigenvalues

【B】Variables in the PC Space

§ pca$var$coord: coordinates of the variables

§ pca$var$cos2: fraction of information carried in each PC

§ fviz_pca_var: plot variables in the PC space

【C】Observations in the PC Space

§ pca$ind$coord: coordinates of each observations

§ pca$ind$cos2: the fraction of information carried in each PC

§ fviz_pca_ind plot the obervations in the PC space

【D】Visualize PCA in Biplots

§ fviz_pca_biplot plot variables and observations in the same chart

【E】Clustering Analysis with Kmeans

§ Kmeans vs Hierarchical Clusterting

§ Kmeans vs Hierarchical Clusterting

§ Visualize PCA with Clustering

§ Choosing the Dimensions

§ the `pca` object

§ `pca$var$coord`: coordinates of the variables

§ `pca$var$cos2`: fraction of information carried in each PC

§ `fviz_pca_var`: plot variables in the PC space

§ `pca$ind$coord`: coordinates of each observations

§ `pca$ind$cos2`: the fraction of information carried in each PC

§ `fviz_pca_ind` plot the obervations in the PC space

§ `fviz_pca_biplot` plot variables and observations in the same chart