💡 Key Learnings:
■ The
Concept of Dimension Reduction
§ Variance as Information
§ Variance Decomposition
§ Principle Components?
§
Eigenvalue & Variance Decomposition
■ The Tool for Principle
Component Analysis (FactoMiner
)
§ Coordinates of
Variables and Subjects
§ CO2 - Level of Representation
■
PCA Visualization
§ Visualization Tool
(factorextra
)
■ The Application of PCA
■ The
Synergy of PCA and Clustering
REFERENCE: Statistical tools for high-throughput data analysis
::p_load(dplyr, FactoMineR, factoextra) pacman
= decathlon2
D head(D)
X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75
CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72
BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87
YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26
ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17 45.67
McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38 44.41
Pole.vault Javeline X1500m Rank Points Competition
SEBRLE 5.02 63.19 291.7 1 8217 Decastar
CLAY 4.92 60.15 301.5 2 8122 Decastar
BERNARD 5.32 62.77 280.1 4 8067 Decastar
YURKOV 4.72 63.44 276.4 5 8036 Decastar
ZSIVOCZKY 4.42 55.37 268.0 7 8004 Decastar
McMULLEN 4.42 56.37 285.1 8 7995 Decastar
Comparing to the hclust
tool, the PCA
function is more friendly
PCA()
, usually the default argument is
good enough= PCA(D[,1:10]) pca
pca
objectPCA()
returns a object of class PCA
. We
names it pca
.pca
is a list composed of many sub-elements. pca
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 27 individuals, described by 10 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
Variance can be treated as Information.
1:10] %>% sapply(var) D[,
X100m Long.jump Shot.put High.jump X400m X110m.hurdle
0.0793410 0.0866567 0.6995923 0.0091387 0.9550439 0.2210692
Discus Pole.vault Javeline X1500m
11.7989311 0.0678872 27.3651718 104.1582652
The scale()
function unifies the variances, so that all
variables have the same weight in the analysis.
1:10] %>% scale %>% apply(2,var) D[,
X100m Long.jump Shot.put High.jump X400m X110m.hurdle
1 1 1 1 1 1
Discus Pole.vault Javeline X1500m
1 1 1 1
Correlations imply that, among the variables, some information are overlaid.
1:10] %>% cor %>% heatmaply_cor(
D[,cexRow=0.8, cexCol=0.8, k_row=3, k_col=3,
hide_colorbar=T)
Principle Components - Dimensions without Correlation
Eigenvalues - The amount of information in the PC’s
get_eigenvalue(pca)
eigenvalue variance.percent cumulative.variance.percent
Dim.1 3.74997 37.4997 37.500
Dim.2 1.74517 17.4517 54.951
Dim.3 1.51783 15.1783 70.130
Dim.4 1.03220 10.3220 80.452
Dim.5 0.61784 6.1784 86.630
Dim.6 0.42829 4.2829 90.913
Dim.7 0.32591 3.2591 94.172
Dim.8 0.27938 2.7938 96.966
Dim.9 0.19111 1.9111 98.877
Dim.10 0.11230 1.1230 100.000
by default, it keeps 5 PC’s carrying 86.63% of all of the information.
$ind$coord %>% dim pca
[1] 27 5
PCA maps the subjects into a space where the correlations among their coordinates equals to 0.
$ind$coord %>% cor %>% round(8) pca
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
Dim.1 1 0 0 0 0
Dim.2 0 1 0 0 0
Dim.3 0 0 1 0 0
Dim.4 0 0 0 1 0
Dim.5 0 0 0 0 1
pca$var$coord
: coordinates of the variables$var$coord pca
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
X100m -0.818952 0.342779 0.1008645 0.101342 -0.21981
Long.jump 0.758899 -0.381493 -0.0062613 -0.185424 0.26371
Shot.put 0.715078 0.282117 0.4738546 0.036104 -0.27864
High.jump 0.608493 0.611354 0.0046060 0.071244 0.30059
X400m -0.643848 0.148422 0.5157594 0.269785 0.19924
X110m.hurdle -0.716420 0.297552 0.4164510 -0.159781 0.16102
Discus 0.716888 0.204398 0.2703222 0.397623 -0.33949
Pole.vault -0.221417 -0.737548 0.4030836 -0.251549 -0.26259
Javeline 0.355176 0.098531 0.6954337 -0.485559 0.13342
X1500m 0.069712 -0.568120 0.3527578 0.652461 0.25368
pca$var$cos2
: fraction of information carried in each
PC$var$cos2 pca
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
X100m 0.6706825 0.1174973 0.010173655 0.0102702 0.048315
Long.jump 0.5759270 0.1455370 0.000039203 0.0343821 0.069545
Shot.put 0.5113370 0.0795898 0.224538173 0.0013035 0.077639
High.jump 0.3702640 0.3737540 0.000021215 0.0050756 0.090352
X400m 0.4145404 0.0220292 0.266007740 0.0727841 0.039696
X110m.hurdle 0.5132580 0.0885371 0.173431450 0.0255299 0.025928
Discus 0.5139286 0.0417785 0.073074093 0.1581041 0.115255
Pole.vault 0.0490256 0.5439768 0.162476407 0.0632771 0.068955
Javeline 0.1261497 0.0097083 0.483627983 0.2357675 0.017802
X1500m 0.0048598 0.3227600 0.124438033 0.4257058 0.064353
fviz_pca_var
: plot variables in the PC spacefviz_pca_var(pca)
pca$ind$coord
: coordinates of each observations$ind$coord pca
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
SEBRLE 0.277958 -0.536434 1.585239 0.105823 1.074623
CLAY 0.904854 -2.094280 0.840685 1.850718 -0.408645
BERNARD -1.372266 -1.348116 0.961932 -1.493072 -0.182667
YURKOV -0.928205 2.281744 1.942688 0.096823 0.190927
ZSIVOCZKY -0.103817 1.089822 -2.098908 0.071906 -0.032938
McMULLEN 0.239858 0.939092 -0.818136 1.201893 1.830199
MARTINEAU -2.537291 1.801094 0.051975 0.374306 -2.285411
HERNU -1.902843 -0.330277 1.288682 0.766505 0.239465
BARRAS -1.805625 0.302590 -0.592810 0.656526 -0.244039
NOOL -2.881737 0.863854 -1.402448 -1.491195 1.358726
BOURGUIGNON -4.505530 -0.485422 1.202704 0.951363 0.508865
Sebrle 3.567756 0.068007 1.911216 -1.042363 -0.300596
Clay 3.472177 -0.705599 1.607029 -0.696108 0.741815
Karpov 4.328761 0.160789 -1.152529 0.407689 -0.772485
Macey 1.944475 2.523948 -0.260304 -0.079809 -0.025024
Warners 1.552082 -1.488634 -1.414196 -0.549665 0.102190
Zsivoczky 0.475153 1.971763 0.900183 -0.725288 0.171696
Hernu 0.280841 0.822696 -0.905794 -0.782389 -0.771389
Bernard 1.533280 1.085832 -1.245717 0.534722 1.042879
Schwarzl -0.677974 -1.134257 -0.422180 -0.609851 -0.100686
Pogorelov -0.077879 -0.333658 0.607951 1.446999 0.204623
Schoenbeck -0.487405 -0.860688 0.866712 -0.173226 -0.432985
Barras -0.413081 1.366893 0.227296 -0.753324 -0.861778
KARPOV 0.967748 -0.995599 -0.476019 2.501069 -0.661349
WARNERS -0.280043 -0.912158 -1.416982 -0.147161 0.036162
Nool -0.535389 -2.135638 0.616053 -1.808079 -0.319954
Drews -1.035856 -1.917364 -2.404320 -0.614812 -0.102226
pca$ind$cos2
: the fraction of information carried in
each PC$ind$cos2 pca
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
SEBRLE 0.0154465 0.05753138 0.50241310 0.00223886 0.230878821
CLAY 0.0655742 0.35127414 0.05660346 0.27431969 0.013374224
BERNARD 0.2322366 0.22413434 0.11411498 0.27492584 0.004115041
YURKOV 0.0848122 0.51251263 0.37151534 0.00092284 0.003588447
ZSIVOCZKY 0.0015087 0.16625497 0.61666612 0.00072377 0.000151861
McMULLEN 0.0082682 0.12674108 0.09619490 0.20760230 0.481390643
MARTINEAU 0.4048095 0.20397763 0.00016986 0.00880975 0.328426736
HERNU 0.3626087 0.01092418 0.16631227 0.05883861 0.005742729
BARRAS 0.6179591 0.01735462 0.06660942 0.08169748 0.011288151
NOOL 0.5144396 0.04622816 0.12184266 0.13775107 0.114364103
BOURGUIGNON 0.8414496 0.00976733 0.05995894 0.03751710 0.010733484
Sebrle 0.6742810 0.00024499 0.19349512 0.05755578 0.004786484
Clay 0.6864011 0.02834588 0.14703531 0.02758845 0.031330374
Karpov 0.8267406 0.00114065 0.05860654 0.00733332 0.026328251
Macey 0.3497653 0.58929505 0.00626806 0.00058922 0.000057928
Warners 0.3365404 0.30958768 0.27940052 0.04220890 0.001458908
Zsivoczky 0.0354688 0.61078646 0.12730376 0.08264196 0.004631289
Hernu 0.0247329 0.21224209 0.25728350 0.19195436 0.186594955
Bernard 0.3368420 0.16893075 0.22234247 0.04096756 0.155830032
Schwarzl 0.1319041 0.36919405 0.05114794 0.10672826 0.002909200
Pogorelov 0.0010328 0.01895803 0.06294013 0.35655465 0.007130132
Schoenbeck 0.0643656 0.20070808 0.20352752 0.00813015 0.050794887
Barras 0.0272368 0.29823283 0.00824647 0.09058360 0.118543191
KARPOV 0.0868773 0.09194984 0.02101993 0.58027427 0.040573584
WARNERS 0.0166843 0.17700968 0.42715467 0.00460727 0.000278197
Nool 0.0312477 0.49720414 0.04137291 0.35638092 0.011159741
Drews 0.0963567 0.33013525 0.51911955 0.03394434 0.000938440
fviz_pca_ind
plot the obervations in the PC
spacefviz_pca_ind(pca)
fviz_pca_biplot
plot variables and observations in
the same chartfviz_pca_biplot(
pointsize="cos2", repel=T, labelsize=3,
pca, col.var="red", col.ind="#E7B800", alpha.ind=0.3)
Hierarchical Clustering
Kmeans Clustering
set.seed(444) # set random seed if we want to have the same result
= kmeans(scale(D[,1:10]), 5)$cluster %>% factor
kmg table(kmg)
kmg
1 2 3 4 5
7 6 6 3 5
fviz_pca_biplot(
repel=T, col.var="black", labelsize=3, col.ind=kmg,
pca, alpha.ind=0.6, pointshape=16, mean.point=F,
addEllipses = T, ellipse.type="convex") +
theme(legend.position = "none")
💡 KEY LEARNINGS We can see three essential
information in a PCA biplot …
■ The correlation among the
variables
■ The similarity among subjects
■ The relationship
between the observation and the variables
fviz_pca_biplot(
axes=c(1,3), # plotting the space of Dim1-Dim3
pca, repel=T, col.var="black", labelsize=3, col.ind=kmg,
alpha.ind=0.6, pointshape=16, mean.point=F,
addEllipses = T, ellipse.type="convex") +
theme(legend.position = "none")