💡 學習重點:
■
尺度縮減的基本觀念
§ 資訊量與變異數 Variance as Information
§ 變異數(正交)分解 Variance Decomposition
§ 主成分 Principle
Components?
§ 特徵值 Eigenvalue & Variance Decomposition
■ 主成分分析工具:Principle Component Analysis Tool
(FactoMiner
)
§ 資料點與尺度的座標 Coordination
§ 資料點與尺度的代表性 CO2, Level of Representation
■
主成分分析的視覺化:PCA Visualization
§ 視覺化工具 Visualization
Tool (factorextra
)
■ 主成分分析的應用
■
主成分分析和集群分析的綜合應用
::p_load(dplyr, FactoMineR, factoextra) pacman
= decathlon2
D head(D)
X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75
CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72
BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87
YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26
ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17 45.67
McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38 44.41
Pole.vault Javeline X1500m Rank Points Competition
SEBRLE 5.02 63.19 291.7 1 8217 Decastar
CLAY 4.92 60.15 301.5 2 8122 Decastar
BERNARD 5.32 62.77 280.1 4 8067 Decastar
YURKOV 4.72 63.44 276.4 5 8036 Decastar
ZSIVOCZKY 4.42 55.37 268.0 7 8004 Decastar
McMULLEN 4.42 56.37 285.1 8 7995 Decastar
FactoMineR
套件的加強功能PCA()
,通常用預設參數就行= PCA(D[,1:10]) pca
pca
物件的內容PCA()
會回傳一個PCA
物件,我們叫它pca
pca
是一個List(序列),它裡面有很多子物件 pca
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 27 individuals, described by 10 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. for the variables"
4 "$var$cor" "correlations variables - dimensions"
5 "$var$cos2" "cos2 for the variables"
6 "$var$contrib" "contributions of the variables"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "summary statistics"
12 "$call$centre" "mean of the variables"
13 "$call$ecart.type" "standard error of the variables"
14 "$call$row.w" "weights for the individuals"
15 "$call$col.w" "weights for the variables"
pca$eig
: 各「主成分」的「資訊含量」get_eigenvalue(pca)
eigenvalue variance.percent cumulative.variance.percent
Dim.1 3.74997 37.4997 37.500
Dim.2 1.74517 17.4517 54.951
Dim.3 1.51783 15.1783 70.130
Dim.4 1.03220 10.3220 80.452
Dim.5 0.61784 6.1784 86.630
Dim.6 0.42829 4.2829 90.913
Dim.7 0.32591 3.2591 94.172
Dim.8 0.27938 2.7938 96.966
Dim.9 0.19111 1.9111 98.877
Dim.10 0.11230 1.1230 100.000
💡 學習重點:
主成分分析的基本觀念:
§
主成分分析:在變數空間之中找尋一套正交的尺度(主成分)
§ 變異數(Variance)可以代表變數的資訊
§ \(Var(A+B) = Var(A) + Var(B) + 2
Cov(A,B)\)
§ 原始尺度之間通常存在相關性(\(Cov(A,B)=0\)),所以它們所攜帶的資訊會有部分重疊
§ 主成分之間沒有相關性(\(Cov(A,B)=0\)),所以它們所攜帶的資訊不會重疊
§ 前兩個主成分所構成的平面,就是資訊含量最大的平面
pca$var$coord
: 各變數在各尺度的座標$var$coord pca
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
X100m -0.818952 0.342779 0.1008645 0.101342 -0.21981
Long.jump 0.758899 -0.381493 -0.0062613 -0.185424 0.26371
Shot.put 0.715078 0.282117 0.4738546 0.036104 -0.27864
High.jump 0.608493 0.611354 0.0046060 0.071244 0.30059
X400m -0.643848 0.148422 0.5157594 0.269785 0.19924
X110m.hurdle -0.716420 0.297552 0.4164510 -0.159781 0.16102
Discus 0.716888 0.204398 0.2703222 0.397623 -0.33949
Pole.vault -0.221417 -0.737548 0.4030836 -0.251549 -0.26259
Javeline 0.355176 0.098531 0.6954337 -0.485559 0.13342
X1500m 0.069712 -0.568120 0.3527578 0.652461 0.25368
pca$var$coord
: 各變數在各尺度呈現的資訊比率$var$cos2 pca
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
X100m 0.6706825 0.1174973 0.010173655 0.0102702 0.048315
Long.jump 0.5759270 0.1455370 0.000039203 0.0343821 0.069545
Shot.put 0.5113370 0.0795898 0.224538173 0.0013035 0.077639
High.jump 0.3702640 0.3737540 0.000021215 0.0050756 0.090352
X400m 0.4145404 0.0220292 0.266007740 0.0727841 0.039696
X110m.hurdle 0.5132580 0.0885371 0.173431450 0.0255299 0.025928
Discus 0.5139286 0.0417785 0.073074093 0.1581041 0.115255
Pole.vault 0.0490256 0.5439768 0.162476407 0.0632771 0.068955
Javeline 0.1261497 0.0097083 0.483627983 0.2357675 0.017802
X1500m 0.0048598 0.3227600 0.124438033 0.4257058 0.064353
fviz_pca_var(pca)
pca$ind$coord
: 個體在各尺度的座標$ind$coord pca
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
SEBRLE 0.277958 -0.536434 1.585239 0.105823 1.074623
CLAY 0.904854 -2.094280 0.840685 1.850718 -0.408645
BERNARD -1.372266 -1.348116 0.961932 -1.493072 -0.182667
YURKOV -0.928205 2.281744 1.942688 0.096823 0.190927
ZSIVOCZKY -0.103817 1.089822 -2.098908 0.071906 -0.032938
McMULLEN 0.239858 0.939092 -0.818136 1.201893 1.830199
MARTINEAU -2.537291 1.801094 0.051975 0.374306 -2.285411
HERNU -1.902843 -0.330277 1.288682 0.766505 0.239465
BARRAS -1.805625 0.302590 -0.592810 0.656526 -0.244039
NOOL -2.881737 0.863854 -1.402448 -1.491195 1.358726
BOURGUIGNON -4.505530 -0.485422 1.202704 0.951363 0.508865
Sebrle 3.567756 0.068007 1.911216 -1.042363 -0.300596
Clay 3.472177 -0.705599 1.607029 -0.696108 0.741815
Karpov 4.328761 0.160789 -1.152529 0.407689 -0.772485
Macey 1.944475 2.523948 -0.260304 -0.079809 -0.025024
Warners 1.552082 -1.488634 -1.414196 -0.549665 0.102190
Zsivoczky 0.475153 1.971763 0.900183 -0.725288 0.171696
Hernu 0.280841 0.822696 -0.905794 -0.782389 -0.771389
Bernard 1.533280 1.085832 -1.245717 0.534722 1.042879
Schwarzl -0.677974 -1.134257 -0.422180 -0.609851 -0.100686
Pogorelov -0.077879 -0.333658 0.607951 1.446999 0.204623
Schoenbeck -0.487405 -0.860688 0.866712 -0.173226 -0.432985
Barras -0.413081 1.366893 0.227296 -0.753324 -0.861778
KARPOV 0.967748 -0.995599 -0.476019 2.501069 -0.661349
WARNERS -0.280043 -0.912158 -1.416982 -0.147161 0.036162
Nool -0.535389 -2.135638 0.616053 -1.808079 -0.319954
Drews -1.035856 -1.917364 -2.404320 -0.614812 -0.102226
pca$ind$coord
: 個體在各尺度呈現的資訊比率$ind$cos2 pca
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
SEBRLE 0.0154465 0.05753138 0.50241310 0.00223886 0.230878821
CLAY 0.0655742 0.35127414 0.05660346 0.27431969 0.013374224
BERNARD 0.2322366 0.22413434 0.11411498 0.27492584 0.004115041
YURKOV 0.0848122 0.51251263 0.37151534 0.00092284 0.003588447
ZSIVOCZKY 0.0015087 0.16625497 0.61666612 0.00072377 0.000151861
McMULLEN 0.0082682 0.12674108 0.09619490 0.20760230 0.481390643
MARTINEAU 0.4048095 0.20397763 0.00016986 0.00880975 0.328426736
HERNU 0.3626087 0.01092418 0.16631227 0.05883861 0.005742729
BARRAS 0.6179591 0.01735462 0.06660942 0.08169748 0.011288151
NOOL 0.5144396 0.04622816 0.12184266 0.13775107 0.114364103
BOURGUIGNON 0.8414496 0.00976733 0.05995894 0.03751710 0.010733484
Sebrle 0.6742810 0.00024499 0.19349512 0.05755578 0.004786484
Clay 0.6864011 0.02834588 0.14703531 0.02758845 0.031330374
Karpov 0.8267406 0.00114065 0.05860654 0.00733332 0.026328251
Macey 0.3497653 0.58929505 0.00626806 0.00058922 0.000057928
Warners 0.3365404 0.30958768 0.27940052 0.04220890 0.001458908
Zsivoczky 0.0354688 0.61078646 0.12730376 0.08264196 0.004631289
Hernu 0.0247329 0.21224209 0.25728350 0.19195436 0.186594955
Bernard 0.3368420 0.16893075 0.22234247 0.04096756 0.155830032
Schwarzl 0.1319041 0.36919405 0.05114794 0.10672826 0.002909200
Pogorelov 0.0010328 0.01895803 0.06294013 0.35655465 0.007130132
Schoenbeck 0.0643656 0.20070808 0.20352752 0.00813015 0.050794887
Barras 0.0272368 0.29823283 0.00824647 0.09058360 0.118543191
KARPOV 0.0868773 0.09194984 0.02101993 0.58027427 0.040573584
WARNERS 0.0166843 0.17700968 0.42715467 0.00460727 0.000278197
Nool 0.0312477 0.49720414 0.04137291 0.35638092 0.011159741
Drews 0.0963567 0.33013525 0.51911955 0.03394434 0.000938440
fviz_pca_ind(pca)
fviz_pca_biplot(
pointsize="cos2", repel=T, labelsize=3,
pca, col.var="red", col.ind="#E7B800", alpha.ind=0.3)
set.seed(444)
= kmeans(scale(D[,1:10]), 5)$cluster %>% factor
kmg table(kmg)
kmg
1 2 3 4 5
7 6 6 3 5
💡 學習重點:
■ 階層式集群分析
Hierarchical Clustering
§ 需要先做距離矩陣
§
資料點數不能太大
§ 不需要先設定群數
§
每一次的分尋結果都一樣
■ KMeans集群分析 KMeans Clustering
§ 不需要先做距離矩陣
§ 資料點數比較不受限制
§
需要先設定群數
§ 每一次的分尋結果都可能不一樣
fviz_pca_biplot(
repel=T, col.var="black", labelsize=3,
pca, col.ind=kmg, alpha.ind=0.6, pointshape=16,
addEllipses = T, ellipse.level = 0.65,
mean.point = FALSE) +
theme(legend.position = "none")
Too few points to calculate an ellipse
💡
從主成分空間之中,我們可以同時觀察到:
■
變數之間的相關性
■ 分析對象之間的相似性
■
各分析對象(或族群)的特徵
fviz_pca_biplot(
axes=c(1,3),
pca, repel=T, col.var="black", labelsize=3,
col.ind=kmg, alpha.ind=0.6, pointshape=16,
addEllipses = T, ellipse.level = 0.65,
mean.point = FALSE) +
theme(legend.position = "none")
Too few points to calculate an ellipse
FactoMineR
和factoextra
這兩個套件非常的強大,除了連續變數之外,它們也可以做類別變數、甚至於混合變數的主成分分析;他們的繪圖功能也非常靈活,除了投射本身的變數和個體之外,區隔變數以外的連續或類別變數,或者是不在原資料之中的新資料點,都可以投射到主成分空間裡面。