💡 學習重點:
  ■ 尺度縮減的基本觀念
    § 資訊量與變異數 Variance as Information
    § 變異數(正交)分解 Variance Decomposition
    § 主成分 Principle Components?
    § 特徵值 Eigenvalue & Variance Decomposition
  ■ 主成分分析工具:Principle Component Analysis Tool (FactoMiner)
    § 資料點與尺度的座標 Coordination
    § 資料點與尺度的代表性 CO2, Level of Representation
  ■ 主成分分析的視覺化:PCA Visualization
    § 視覺化工具 Visualization Tool (factorextra)
  ■ 主成分分析的應用
  ■ 主成分分析和集群分析的綜合應用


pacman::p_load(dplyr, FactoMineR, factoextra)
§ 十項運動資料集
D = decathlon2
head(D)
          X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69  43.75
CLAY      10.76      7.40    14.26      1.86 49.37        14.05  50.72
BERNARD   11.02      7.23    14.25      1.92 48.93        14.99  40.87
YURKOV    11.34      7.09    15.19      2.10 50.42        15.31  46.26
ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17  45.67
McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38  44.41
          Pole.vault Javeline X1500m Rank Points Competition
SEBRLE          5.02    63.19  291.7    1   8217    Decastar
CLAY            4.92    60.15  301.5    2   8122    Decastar
BERNARD         5.32    62.77  280.1    4   8067    Decastar
YURKOV          4.72    63.44  276.4    5   8036    Decastar
ZSIVOCZKY       4.42    55.37  268.0    7   8004    Decastar
McMULLEN        4.42    56.37  285.1    8   7995    Decastar


【A】主成分分析

pca = PCA(D[,1:10])

§ pca物件的內容
  • PCA()會回傳一個PCA物件,我們叫它pca
  • pca是一個List(序列),它裡面有很多子物件
pca
**Results for the Principal Component Analysis (PCA)**
The analysis was performed on 27 individuals, described by 10 variables
*The results are available in the following objects:

   name               description                          
1  "$eig"             "eigenvalues"                        
2  "$var"             "results for the variables"          
3  "$var$coord"       "coord. for the variables"           
4  "$var$cor"         "correlations variables - dimensions"
5  "$var$cos2"        "cos2 for the variables"             
6  "$var$contrib"     "contributions of the variables"     
7  "$ind"             "results for the individuals"        
8  "$ind$coord"       "coord. for the individuals"         
9  "$ind$cos2"        "cos2 for the individuals"           
10 "$ind$contrib"     "contributions of the individuals"   
11 "$call"            "summary statistics"                 
12 "$call$centre"     "mean of the variables"              
13 "$call$ecart.type" "standard error of the variables"    
14 "$call$row.w"      "weights for the individuals"        
15 "$call$col.w"      "weights for the variables"          


§ pca$eig: 各「主成分」的「資訊含量」
  • 10個變數的PCA會產生10個主成分(互相正交的尺度)
  • 特徵值代表每一個主成分所攜帶的資訊量(變異量)
  • 第一個主成分的特徵值最大,依次遞減
  • 所有的特徵值加起來正好會等於變數的個數
get_eigenvalue(pca)
       eigenvalue variance.percent cumulative.variance.percent
Dim.1     3.74997          37.4997                      37.500
Dim.2     1.74517          17.4517                      54.951
Dim.3     1.51783          15.1783                      70.130
Dim.4     1.03220          10.3220                      80.452
Dim.5     0.61784           6.1784                      86.630
Dim.6     0.42829           4.2829                      90.913
Dim.7     0.32591           3.2591                      94.172
Dim.8     0.27938           2.7938                      96.966
Dim.9     0.19111           1.9111                      98.877
Dim.10    0.11230           1.1230                     100.000

💡 學習重點:
主成分分析的基本觀念:
  § 主成分分析:在變數空間之中找尋一套正交的尺度(主成分)
  § 變異數(Variance)可以代表變數的資訊
  § \(Var(A+B) = Var(A) + Var(B) + 2 Cov(A,B)\)
  § 原始尺度之間通常存在相關性(\(Cov(A,B)=0\)),所以它們所攜帶的資訊會有部分重疊
  § 主成分之間沒有相關性(\(Cov(A,B)=0\)),所以它們所攜帶的資訊不會重疊
  § 前兩個主成分所構成的平面,就是資訊含量最大的平面



【B】縮減空間中的變數 (Variables)

§ pca$var$coord: 各變數在各尺度的座標
pca$var$coord
                 Dim.1     Dim.2      Dim.3     Dim.4    Dim.5
X100m        -0.818952  0.342779  0.1008645  0.101342 -0.21981
Long.jump     0.758899 -0.381493 -0.0062613 -0.185424  0.26371
Shot.put      0.715078  0.282117  0.4738546  0.036104 -0.27864
High.jump     0.608493  0.611354  0.0046060  0.071244  0.30059
X400m        -0.643848  0.148422  0.5157594  0.269785  0.19924
X110m.hurdle -0.716420  0.297552  0.4164510 -0.159781  0.16102
Discus        0.716888  0.204398  0.2703222  0.397623 -0.33949
Pole.vault   -0.221417 -0.737548  0.4030836 -0.251549 -0.26259
Javeline      0.355176  0.098531  0.6954337 -0.485559  0.13342
X1500m        0.069712 -0.568120  0.3527578  0.652461  0.25368
§ pca$var$coord: 各變數在各尺度呈現的資訊比率
pca$var$cos2
                 Dim.1     Dim.2       Dim.3     Dim.4    Dim.5
X100m        0.6706825 0.1174973 0.010173655 0.0102702 0.048315
Long.jump    0.5759270 0.1455370 0.000039203 0.0343821 0.069545
Shot.put     0.5113370 0.0795898 0.224538173 0.0013035 0.077639
High.jump    0.3702640 0.3737540 0.000021215 0.0050756 0.090352
X400m        0.4145404 0.0220292 0.266007740 0.0727841 0.039696
X110m.hurdle 0.5132580 0.0885371 0.173431450 0.0255299 0.025928
Discus       0.5139286 0.0417785 0.073074093 0.1581041 0.115255
Pole.vault   0.0490256 0.5439768 0.162476407 0.0632771 0.068955
Javeline     0.1261497 0.0097083 0.483627983 0.2357675 0.017802
X1500m       0.0048598 0.3227600 0.124438033 0.4257058 0.064353
§ 將變數投射到主成分空間
fviz_pca_var(pca)



【C】縮減空間中的個體 (Individuals)

§ pca$ind$coord: 個體在各尺度的座標
pca$ind$coord
                Dim.1     Dim.2     Dim.3     Dim.4     Dim.5
SEBRLE       0.277958 -0.536434  1.585239  0.105823  1.074623
CLAY         0.904854 -2.094280  0.840685  1.850718 -0.408645
BERNARD     -1.372266 -1.348116  0.961932 -1.493072 -0.182667
YURKOV      -0.928205  2.281744  1.942688  0.096823  0.190927
ZSIVOCZKY   -0.103817  1.089822 -2.098908  0.071906 -0.032938
McMULLEN     0.239858  0.939092 -0.818136  1.201893  1.830199
MARTINEAU   -2.537291  1.801094  0.051975  0.374306 -2.285411
HERNU       -1.902843 -0.330277  1.288682  0.766505  0.239465
BARRAS      -1.805625  0.302590 -0.592810  0.656526 -0.244039
NOOL        -2.881737  0.863854 -1.402448 -1.491195  1.358726
BOURGUIGNON -4.505530 -0.485422  1.202704  0.951363  0.508865
Sebrle       3.567756  0.068007  1.911216 -1.042363 -0.300596
Clay         3.472177 -0.705599  1.607029 -0.696108  0.741815
Karpov       4.328761  0.160789 -1.152529  0.407689 -0.772485
Macey        1.944475  2.523948 -0.260304 -0.079809 -0.025024
Warners      1.552082 -1.488634 -1.414196 -0.549665  0.102190
Zsivoczky    0.475153  1.971763  0.900183 -0.725288  0.171696
Hernu        0.280841  0.822696 -0.905794 -0.782389 -0.771389
Bernard      1.533280  1.085832 -1.245717  0.534722  1.042879
Schwarzl    -0.677974 -1.134257 -0.422180 -0.609851 -0.100686
Pogorelov   -0.077879 -0.333658  0.607951  1.446999  0.204623
Schoenbeck  -0.487405 -0.860688  0.866712 -0.173226 -0.432985
Barras      -0.413081  1.366893  0.227296 -0.753324 -0.861778
KARPOV       0.967748 -0.995599 -0.476019  2.501069 -0.661349
WARNERS     -0.280043 -0.912158 -1.416982 -0.147161  0.036162
Nool        -0.535389 -2.135638  0.616053 -1.808079 -0.319954
Drews       -1.035856 -1.917364 -2.404320 -0.614812 -0.102226
§ pca$ind$coord: 個體在各尺度呈現的資訊比率
pca$ind$cos2
                Dim.1      Dim.2      Dim.3      Dim.4       Dim.5
SEBRLE      0.0154465 0.05753138 0.50241310 0.00223886 0.230878821
CLAY        0.0655742 0.35127414 0.05660346 0.27431969 0.013374224
BERNARD     0.2322366 0.22413434 0.11411498 0.27492584 0.004115041
YURKOV      0.0848122 0.51251263 0.37151534 0.00092284 0.003588447
ZSIVOCZKY   0.0015087 0.16625497 0.61666612 0.00072377 0.000151861
McMULLEN    0.0082682 0.12674108 0.09619490 0.20760230 0.481390643
MARTINEAU   0.4048095 0.20397763 0.00016986 0.00880975 0.328426736
HERNU       0.3626087 0.01092418 0.16631227 0.05883861 0.005742729
BARRAS      0.6179591 0.01735462 0.06660942 0.08169748 0.011288151
NOOL        0.5144396 0.04622816 0.12184266 0.13775107 0.114364103
BOURGUIGNON 0.8414496 0.00976733 0.05995894 0.03751710 0.010733484
Sebrle      0.6742810 0.00024499 0.19349512 0.05755578 0.004786484
Clay        0.6864011 0.02834588 0.14703531 0.02758845 0.031330374
Karpov      0.8267406 0.00114065 0.05860654 0.00733332 0.026328251
Macey       0.3497653 0.58929505 0.00626806 0.00058922 0.000057928
Warners     0.3365404 0.30958768 0.27940052 0.04220890 0.001458908
Zsivoczky   0.0354688 0.61078646 0.12730376 0.08264196 0.004631289
Hernu       0.0247329 0.21224209 0.25728350 0.19195436 0.186594955
Bernard     0.3368420 0.16893075 0.22234247 0.04096756 0.155830032
Schwarzl    0.1319041 0.36919405 0.05114794 0.10672826 0.002909200
Pogorelov   0.0010328 0.01895803 0.06294013 0.35655465 0.007130132
Schoenbeck  0.0643656 0.20070808 0.20352752 0.00813015 0.050794887
Barras      0.0272368 0.29823283 0.00824647 0.09058360 0.118543191
KARPOV      0.0868773 0.09194984 0.02101993 0.58027427 0.040573584
WARNERS     0.0166843 0.17700968 0.42715467 0.00460727 0.000278197
Nool        0.0312477 0.49720414 0.04137291 0.35638092 0.011159741
Drews       0.0963567 0.33013525 0.51911955 0.03394434 0.000938440
§ 將變數投射到主成分空間
fviz_pca_ind(pca)



【D】同時投射個體和變數 (Bi-ploy)

§ 將個體和變數投射到主成分空間
fviz_pca_biplot(
  pca, pointsize="cos2", repel=T, labelsize=3,
  col.var="red", col.ind="#E7B800", alpha.ind=0.3)

【E】將個體分群

§ Kmeans集群分析
set.seed(444)
kmg = kmeans(scale(D[,1:10]), 5)$cluster %>% factor
table(kmg)
kmg
1 2 3 4 5 
7 6 6 3 5 

💡 學習重點:
  ■ 階層式集群分析 Hierarchical Clustering
    § 需要先做距離矩陣
    § 資料點數不能太大
    § 不需要先設定群數
    § 每一次的分尋結果都一樣
  ■ KMeans集群分析 KMeans Clustering
    § 不需要先做距離矩陣
    § 資料點數比較不受限制
    § 需要先設定群數
    § 每一次的分尋結果都可能不一樣


§ 將個體和變數投射到主成分空間
fviz_pca_biplot(
  pca, repel=T, col.var="black", labelsize=3,
  col.ind=kmg, alpha.ind=0.6, pointshape=16, 
  addEllipses = T, ellipse.level = 0.65, 
  mean.point = FALSE) + 
  theme(legend.position = "none")
Too few points to calculate an ellipse


💡 從主成分空間之中,我們可以同時觀察到:
  ■ 變數之間的相關性
  ■ 分析對象之間的相似性
  ■ 各分析對象(或族群)的特徵



§ 主成分空間的選擇
fviz_pca_biplot(
  pca, axes=c(1,3),
  repel=T, col.var="black", labelsize=3,
  col.ind=kmg, alpha.ind=0.6, pointshape=16, 
  addEllipses = T, ellipse.level = 0.65, 
  mean.point = FALSE) + 
  theme(legend.position = "none")
Too few points to calculate an ellipse


FactoMineRfactoextra這兩個套件非常的強大,除了連續變數之外,它們也可以做類別變數、甚至於混合變數的主成分分析;他們的繪圖功能也非常靈活,除了投射本身的變數和個體之外,區隔變數以外的連續或類別變數,或者是不在原資料之中的新資料點,都可以投射到主成分空間裡面。