💡 Key Learnings:
  ■ The Concept of Dimension Reduction
    § Variance as Information
    § Variance Decomposition
    § Principle Components?
    § Eigenvalue & Variance Decomposition
  ■ The Tool for Principle Component Analysis (FactoMiner)
    § Coordinates of Variables and Subjects
    § CO2 - Level of Representation
  ■ PCA Visualization
    § Visualization Tool (factorextra)
  ■ The Application of PCA
  ■ The Synergy of PCA and Clustering


REFERENCE: Statistical tools for high-throughput data analysis


pacman::p_load(dplyr, FactoMineR, factoextra)
§ The Decathlon dataset
D = decathlon2
head(D)
          X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69  43.75
CLAY      10.76      7.40    14.26      1.86 49.37        14.05  50.72
BERNARD   11.02      7.23    14.25      1.92 48.93        14.99  40.87
YURKOV    11.34      7.09    15.19      2.10 50.42        15.31  46.26
ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17  45.67
McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38  44.41
          Pole.vault Javeline X1500m Rank Points Competition
SEBRLE          5.02    63.19  291.7    1   8217    Decastar
CLAY            4.92    60.15  301.5    2   8122    Decastar
BERNARD         5.32    62.77  280.1    4   8067    Decastar
YURKOV          4.72    63.44  276.4    5   8036    Decastar
ZSIVOCZKY       4.42    55.37  268.0    7   8004    Decastar
McMULLEN        4.42    56.37  285.1    8   7995    Decastar


【A】PCA - Priciple Component Analysis

Comparing to the hclust tool, the PCA function is more friendly

pca = PCA(D[,1:10])

§ the pca object
  • PCA() returns a object of class PCA. We names it pca.
  • pca is a list composed of many sub-elements.
pca
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 27 individuals, described by 10 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"          


§ Information in Variables

Variance can be treated as Information.

D[,1:10] %>% sapply(var) 
       X100m    Long.jump     Shot.put    High.jump        X400m X110m.hurdle 
   0.0793410    0.0866567    0.6995923    0.0091387    0.9550439    0.2210692 
      Discus   Pole.vault     Javeline       X1500m 
  11.7989311    0.0678872   27.3651718  104.1582652 

The scale() function unifies the variances, so that all variables have the same weight in the analysis.

D[,1:10] %>% scale %>% apply(2,var)
       X100m    Long.jump     Shot.put    High.jump        X400m X110m.hurdle 
           1            1            1            1            1            1 
      Discus   Pole.vault     Javeline       X1500m 
           1            1            1            1 

Correlations imply that, among the variables, some information are overlaid.

D[,1:10] %>% cor %>% heatmaply_cor(
  cexRow=0.8, cexCol=0.8, k_row=3, k_col=3,
  hide_colorbar=T)


§ Principle Components and their Eigenvalues

Principle Components - Dimensions without Correlation

  • Performing PCA on 10 variables produces 10 Principle Components (PC’s)
  • Variables usually correlate. I.e., the info they carry are overlaid upon each other.
  • PC’s are independent to each other. There’s no redundant info in the PC space.
  • I.e., PCA remove the redundant information among the variables.
  • So we can compare all variables of the subjects in a low dimension space.

Eigenvalues - The amount of information in the PC’s

  • Eigenvalues represent the amount of information carried in the PC’s.
  • PC’s are sorted in descending eigenvalues.
  • The first two PC’s carry the maximum info that can be displayed in a 2-d space.
get_eigenvalue(pca)
       eigenvalue variance.percent cumulative.variance.percent
Dim.1     3.74997          37.4997                      37.500
Dim.2     1.74517          17.4517                      54.951
Dim.3     1.51783          15.1783                      70.130
Dim.4     1.03220          10.3220                      80.452
Dim.5     0.61784           6.1784                      86.630
Dim.6     0.42829           4.2829                      90.913
Dim.7     0.32591           3.2591                      94.172
Dim.8     0.27938           2.7938                      96.966
Dim.9     0.19111           1.9111                      98.877
Dim.10    0.11230           1.1230                     100.000

by default, it keeps 5 PC’s carrying 86.63% of all of the information.

pca$ind$coord %>% dim  
[1] 27  5

PCA maps the subjects into a space where the correlations among their coordinates equals to 0.

pca$ind$coord %>% cor  %>% round(8)
      Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Dim.1     1     0     0     0     0
Dim.2     0     1     0     0     0
Dim.3     0     0     1     0     0
Dim.4     0     0     0     1     0
Dim.5     0     0     0     0     1


【B】Variables in the PC Space

§ pca$var$coord: coordinates of the variables
pca$var$coord
                 Dim.1     Dim.2      Dim.3     Dim.4    Dim.5
X100m        -0.818952  0.342779  0.1008645  0.101342 -0.21981
Long.jump     0.758899 -0.381493 -0.0062613 -0.185424  0.26371
Shot.put      0.715078  0.282117  0.4738546  0.036104 -0.27864
High.jump     0.608493  0.611354  0.0046060  0.071244  0.30059
X400m        -0.643848  0.148422  0.5157594  0.269785  0.19924
X110m.hurdle -0.716420  0.297552  0.4164510 -0.159781  0.16102
Discus        0.716888  0.204398  0.2703222  0.397623 -0.33949
Pole.vault   -0.221417 -0.737548  0.4030836 -0.251549 -0.26259
Javeline      0.355176  0.098531  0.6954337 -0.485559  0.13342
X1500m        0.069712 -0.568120  0.3527578  0.652461  0.25368
§ pca$var$cos2: fraction of information carried in each PC
pca$var$cos2
                 Dim.1     Dim.2       Dim.3     Dim.4    Dim.5
X100m        0.6706825 0.1174973 0.010173655 0.0102702 0.048315
Long.jump    0.5759270 0.1455370 0.000039203 0.0343821 0.069545
Shot.put     0.5113370 0.0795898 0.224538173 0.0013035 0.077639
High.jump    0.3702640 0.3737540 0.000021215 0.0050756 0.090352
X400m        0.4145404 0.0220292 0.266007740 0.0727841 0.039696
X110m.hurdle 0.5132580 0.0885371 0.173431450 0.0255299 0.025928
Discus       0.5139286 0.0417785 0.073074093 0.1581041 0.115255
Pole.vault   0.0490256 0.5439768 0.162476407 0.0632771 0.068955
Javeline     0.1261497 0.0097083 0.483627983 0.2357675 0.017802
X1500m       0.0048598 0.3227600 0.124438033 0.4257058 0.064353
§ fviz_pca_var: plot variables in the PC space
fviz_pca_var(pca)

  • Correlations between the variables and the Dim’s are used as the coordinates to draw the variables.
  • The closer a variable is to the unity circle, the better its information are displayed in the space.


【C】Observations in the PC Space

§ pca$ind$coord: coordinates of each observations
pca$ind$coord
                Dim.1     Dim.2     Dim.3     Dim.4     Dim.5
SEBRLE       0.277958 -0.536434  1.585239  0.105823  1.074623
CLAY         0.904854 -2.094280  0.840685  1.850718 -0.408645
BERNARD     -1.372266 -1.348116  0.961932 -1.493072 -0.182667
YURKOV      -0.928205  2.281744  1.942688  0.096823  0.190927
ZSIVOCZKY   -0.103817  1.089822 -2.098908  0.071906 -0.032938
McMULLEN     0.239858  0.939092 -0.818136  1.201893  1.830199
MARTINEAU   -2.537291  1.801094  0.051975  0.374306 -2.285411
HERNU       -1.902843 -0.330277  1.288682  0.766505  0.239465
BARRAS      -1.805625  0.302590 -0.592810  0.656526 -0.244039
NOOL        -2.881737  0.863854 -1.402448 -1.491195  1.358726
BOURGUIGNON -4.505530 -0.485422  1.202704  0.951363  0.508865
Sebrle       3.567756  0.068007  1.911216 -1.042363 -0.300596
Clay         3.472177 -0.705599  1.607029 -0.696108  0.741815
Karpov       4.328761  0.160789 -1.152529  0.407689 -0.772485
Macey        1.944475  2.523948 -0.260304 -0.079809 -0.025024
Warners      1.552082 -1.488634 -1.414196 -0.549665  0.102190
Zsivoczky    0.475153  1.971763  0.900183 -0.725288  0.171696
Hernu        0.280841  0.822696 -0.905794 -0.782389 -0.771389
Bernard      1.533280  1.085832 -1.245717  0.534722  1.042879
Schwarzl    -0.677974 -1.134257 -0.422180 -0.609851 -0.100686
Pogorelov   -0.077879 -0.333658  0.607951  1.446999  0.204623
Schoenbeck  -0.487405 -0.860688  0.866712 -0.173226 -0.432985
Barras      -0.413081  1.366893  0.227296 -0.753324 -0.861778
KARPOV       0.967748 -0.995599 -0.476019  2.501069 -0.661349
WARNERS     -0.280043 -0.912158 -1.416982 -0.147161  0.036162
Nool        -0.535389 -2.135638  0.616053 -1.808079 -0.319954
Drews       -1.035856 -1.917364 -2.404320 -0.614812 -0.102226
§ pca$ind$cos2: the fraction of information carried in each PC
pca$ind$cos2
                Dim.1      Dim.2      Dim.3      Dim.4       Dim.5
SEBRLE      0.0154465 0.05753138 0.50241310 0.00223886 0.230878821
CLAY        0.0655742 0.35127414 0.05660346 0.27431969 0.013374224
BERNARD     0.2322366 0.22413434 0.11411498 0.27492584 0.004115041
YURKOV      0.0848122 0.51251263 0.37151534 0.00092284 0.003588447
ZSIVOCZKY   0.0015087 0.16625497 0.61666612 0.00072377 0.000151861
McMULLEN    0.0082682 0.12674108 0.09619490 0.20760230 0.481390643
MARTINEAU   0.4048095 0.20397763 0.00016986 0.00880975 0.328426736
HERNU       0.3626087 0.01092418 0.16631227 0.05883861 0.005742729
BARRAS      0.6179591 0.01735462 0.06660942 0.08169748 0.011288151
NOOL        0.5144396 0.04622816 0.12184266 0.13775107 0.114364103
BOURGUIGNON 0.8414496 0.00976733 0.05995894 0.03751710 0.010733484
Sebrle      0.6742810 0.00024499 0.19349512 0.05755578 0.004786484
Clay        0.6864011 0.02834588 0.14703531 0.02758845 0.031330374
Karpov      0.8267406 0.00114065 0.05860654 0.00733332 0.026328251
Macey       0.3497653 0.58929505 0.00626806 0.00058922 0.000057928
Warners     0.3365404 0.30958768 0.27940052 0.04220890 0.001458908
Zsivoczky   0.0354688 0.61078646 0.12730376 0.08264196 0.004631289
Hernu       0.0247329 0.21224209 0.25728350 0.19195436 0.186594955
Bernard     0.3368420 0.16893075 0.22234247 0.04096756 0.155830032
Schwarzl    0.1319041 0.36919405 0.05114794 0.10672826 0.002909200
Pogorelov   0.0010328 0.01895803 0.06294013 0.35655465 0.007130132
Schoenbeck  0.0643656 0.20070808 0.20352752 0.00813015 0.050794887
Barras      0.0272368 0.29823283 0.00824647 0.09058360 0.118543191
KARPOV      0.0868773 0.09194984 0.02101993 0.58027427 0.040573584
WARNERS     0.0166843 0.17700968 0.42715467 0.00460727 0.000278197
Nool        0.0312477 0.49720414 0.04137291 0.35638092 0.011159741
Drews       0.0963567 0.33013525 0.51911955 0.03394434 0.000938440
§ fviz_pca_ind plot the obervations in the PC space
fviz_pca_ind(pca)



【D】Visualize PCA in Biplots

§ fviz_pca_biplot plot variables and observations in the same chart
fviz_pca_biplot(
  pca, pointsize="cos2", repel=T, labelsize=3,
  col.var="red", col.ind="#E7B800", alpha.ind=0.3)

  • The longer (bigger) a variable (observation), the better it is represented in this 2-d space.

【E】Clustering Analysis with Kmeans

§ Kmeans vs Hierarchical Clusterting

Hierarchical Clustering

  • need to make distance matrix
  • so, it cannot handle to many observations
  • do not need to specify the number of group
  • can watch the dendrogram before determining the number of group

Kmeans Clustering

  • distance matrix is not required
  • can handle a large number of observations
  • need to specify the number of group
  • may have a different result in each run
§ Kmeans vs Hierarchical Clusterting
set.seed(444)    # set random seed if we want to have the same result
kmg = kmeans(scale(D[,1:10]), 5)$cluster %>% factor
table(kmg)
kmg
1 2 3 4 5 
7 6 6 3 5 
§ Visualize PCA with Clustering
fviz_pca_biplot(
  pca, repel=T, col.var="black", labelsize=3, col.ind=kmg, 
  alpha.ind=0.6, pointshape=16, mean.point=F, 
  addEllipses = T, ellipse.type="convex") + 
  theme(legend.position = "none")


💡 KEY LEARNINGS We can see three essential information in a PCA biplot …
  ■ The correlation among the variables
  ■ The similarity among subjects
  ■ The relationship between the observation and the variables



§ Choosing the Dimensions
fviz_pca_biplot(
  pca, axes=c(1,3),   # plotting the space of Dim1-Dim3
  repel=T, col.var="black", labelsize=3, col.ind=kmg, 
  alpha.ind=0.6, pointshape=16, mean.point=F, 
  addEllipses = T, ellipse.type="convex") + 
  theme(legend.position = "none")