🦋 學習重點


🌷 清理、彙總與統計


🌷 分群統計與比較


🌷 多變量分析




rm(list=ls(all=TRUE))
pacman::p_load(ggplot2,dplyr,heatmaply)
load("data/tf0.rdata")
sapply(list(cust=A0,tid=X0,items=Z0), nrow)
  cust    tid  items 
 32241 119328 817182 


品項與品類 的 營收與獲利

Z0 %>% summarise_at(vars(cust,prod,cat),n_distinct)
   cust  prod  cat
1 32256 23789 2007
品類分析

獲利貢獻(profit)最大的100個品類(cat)

col6 = c('seagreen','gold','orange',rep('red',3))
gg= group_by(Z0, cat) %>% summarise(
  solds = n(), qty = sum(qty), rev = sum(price), cost = sum(cost), 
  profit = rev - cost, margin = 100*profit/rev
  ) %>% 
  top_n(100, profit) %>% 
  ggplot(aes(x=margin, y=rev, col=profit, label=cat)) + 
  geom_point(size=2,alpha=0.8) + scale_y_log10() + 
  scale_color_gradientn(colors=col6) +
  theme_bw()
ggplotly(gg)
10203030000010000003000000
100000200000300000profitmarginrev

🚴 練習
利用以上的程式 …

  • 用圖形呈現,營收貢獻(rev)最大的100個品類(cat),排除負毛利的品類
  • 一般而言,營收較大的品類,獲利也比較大,對嗎?
  • 這兩個圖形的樣態是類似的嗎?
102030000010000003000000
100000200000300000profitmarginrev



品項分析

🚴 練習
利用以上的程式 …

  • 用圖形呈現,獲利貢獻(profit)最大的300個『品項(prod)』
  • 營收(rev)和獲利率之間有相關性嗎?
  • 用圖形呈現,營收貢獻(rev)最大的300個『品項(prod)』
  • 這兩個圖形的樣態是類似的嗎?
top-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profittop-300 profit102030100001000001000000top-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 revtop-300 rev0102030100001000001000000
200004000060000profit200004000060000profit



顧客 的 營收與獲利

Top500顧客
a500 = A0 %>% top_n(500, raw)
g = ggplot(a500, aes(x=m, y=f, col=raw)) + 
  geom_point(size=2, alpha=0.8) +
  scale_x_log10() + scale_color_gradientn(colors=col6) + 
  theme_bw()
ggplotly(g)



顧客族群 的 營收與獲利

依購買行為分群

找出最多人買的M=50個品類

M=50
cm = Z0 %>% group_by(cat) %>% summarise(r = n_distinct(cust)/nrow(A0)) %>% 
  arrange(desc(r)) %>% pull(cat) %>% head(M)

做出 顧客x品類 矩陣 - x

x = xtabs(~cust+cat, filter(Z0, cat%in%cm)) %>% as.data.frame.matrix
x = x[,order(-colSums(x))]
dim(x)
[1] 29474    50

用k-means做分群 - K=160

K=160
set.seed(1111)
kg = kmeans(x, K, iter.max=30)$cluster 
table(kg) %>% sort
kg
  25   35   38    6   49   75   62  129   26   33   43    1   32   89  124   73 
   1    1    1    3    4    4    5    5    7    7    7    8   12   13   14   15 
  88  102   27   50   93  125    7  105  138   96  146   92   94   36  128  149 
  15   15   17   17   17   17   20   21   22   24   24   26   26   27   27   27 
  24    9  117   23   70  107  130  103  110    2  150  152  122   11   57   76 
  29   31   31   33   33   33   33   34   34   35   35   35   36   37   37   37 
  58  160   61   80   97   40   68   46   64   79   84   98  148  147  123  154 
  38   38   41   41   41   42   42   47   49   49   50   52   53   54   56   57 
  81  112  156   85  109   71   20   90   48   29   31   44  155   42  120  126 
  61   61   64   66   69   72   74   76   78   83   84   86   86   87   89   92 
  82   39   52   37  131   56  100  132   34  108   10    8  143  101   15  157 
  94   95   99  108  117  118  118  118  119  119  120  122  122  125  127  127 
  51  137   17   65    3   77   16  142   19  104   66  141   95   21  111  134 
 128  133  134  134  140  140  142  143  146  149  152  155  156  158  164  168 
  22  121   41   28  127  139  115  153  151  136   99  159   13   55   53   45 
 173  174  181  190  202  210  213  214  218  220  244  268  285  289  294  307 
  86   78  140  145  106   30   67  116    4  133  144   63   54   69  158   12 
 312  328  335  366  376  377  406  409  412  422  425  430  444  467  476  483 
  60   72   83    5   59   47  135  113  114   14  119  118   74   91   87   18 
 490  503  521  522  535  553  565  616  644  690  735  746  752  794  814 2684 

用互動圖表找出重點族群

ckg = tibble(cust=rownames(x),kg=kg,by="cust")
gdf = inner_join(A0, ckg) %>% 
  group_by(kg) %>% summarise(
  gsize = n(), ttRev = sum(rev), ttProfit = sum(raw),
  avRev = mean(rev), avProfit = mean(raw),
  avRecent = mean(r), avFreq = mean(f), avMoney = mean(m)
  )  
filter(gdf, gsize >= 200, gsize <= 1000) %>% 
  ggplot(aes(avMoney,avFreq,col=ttProfit,size=gsize,label=kg)) + 
  geom_point(alpha=0.6) +
  scale_color_gradientn(colors=c("seagreen","gold","tomato")) +
  theme_bw() -> g
ggplotly(g)
5001000150020002345
80000120000160000200000240000280000ttProfitavMoneyavFreq

用熱圖找出各族群的購買樣態

color9 = c("darkblue","green","gold","orange",rep("red",5))
hmap1 = function(x, ...) { heatmaply(
  as.data.frame.matrix(x), cexRow=0.7, cexCol=0.7, 
  grid_color='gray70', ...)
  }  
g = filter(gdf, gsize >= 200, gsize <= 800) %>% pull(kg)
a = sapply(split(x[kg %in% g,1:30], kg[kg %in% g]), colMeans)  
hmap1(a, col=color9, show_dendrogram=c(F,F))
15856012754721196374151140159914769141133083106116458611813314445367136997859131451151215313913511455120103110401100205100312100102560402560201530101100201100301500202530114530110530105110106130204110507110136100507100505500201500210110117110217130315130106120106130205130206110411
01234
RFM矩陣 規則分群

在顧客資料框加入規則分群欄位

bm = c(0, quantile(A0$m,c(.25,0.5,.75)), max(A0$m)+100)
bf = c(0, quantile(A0$f,c(.25,0.5,.75)), max(A0$f)+100)
A = A0 %>% mutate(
  mx = cut(A0$m, bm, labels=paste0('M',1:4)),
  fx = cut(A0$f, bf, labels=paste0('F',1:4)),
  MF = paste0(mx, fx)
  )
table(A$mx, A$fx)
    
       F1   F2   F3   F4
  M1 3465 1477 1379 1748
  M2 2470 1475 1571 2536
  M3 2569 1557 1807 2134
  M4 3388 1790 1695 1180

找出營收最大的品類

cat100 = count(Z0, cat, wt=price, sort=T) %>% mutate(
  pc=n/sum(n), cum.pc=cumsum(pc)) %>% head(100)
cat100[c(1:5,96:100), ]
       cat       n        pc   cum.pc
1   560201 4329366 0.0422026 0.042203
2   560402 3634174 0.0354259 0.077629
3   500201 2204325 0.0214877 0.099116
4   110217 2201258 0.0214578 0.120574
5   320402 1481172 0.0144385 0.135013
96  100504  229815 0.0022402 0.547202
97  110106  227899 0.0022216 0.549424
98  100418  226905 0.0022119 0.551636
99  100407  224486 0.0021883 0.553824
100 110402  221145 0.0021557 0.555980

做出 顧客族群x品類 購買金額矩陣

Z = inner_join(Z0, A[,c('cust','MF')])
Joining, by = "cust"
mx0 = xtabs(price~MF+cat, filter(Z, cat %in% cat100$cat[1:30]))
dim(mx0)
[1] 16 30

依購買金額矩陣製作熱圖

hmap1(mx0, col=cool_warm)
560201560402320402470105470103530404530114520457530105500210100102110401530403100516500804110333500202100311110507100505110117120103130206130204100401100205530101530110500201110217M1F1M1F2M1F3M2F1M2F2M3F1M2F3M3F2M1F4M3F3M4F1M4F2M2F4M3F4M4F3M4F4
0200000400000600000800000

🌷 正規化 - 購買比例矩陣

mx1 = mx0/rowSums(mx0)
hmap1(mx1, col=cool_warm)
100205130204130206110401100102500210100401530110530101530403500202100505110117120103530105520457530114530404100311110507500804100516110333470103470105320402560402560201500201110217M1F1M1F2M1F4M1F3M2F1M2F2M2F3M2F4M3F4M3F3M3F1M3F2M4F1M4F2M4F3M4F4
0.000.050.100.15

熱圖的分群功能

mx2 = xtabs(price~MF+cat, filter(Z, cat %in% cat100$cat[1:20]))
mx3 = 100*mx2/rowSums(mx2)
hmap1(mx3, col=cool_warm, show_dendrogram=c(T,F),k_row=5)
130204130206100205100401530110530101500210100102110401530403530105520457530114530404470105320402560402560201500201110217M1F1M1F2M1F3M1F4M2F2M2F1M2F3M2F4M3F4M3F3M3F1M3F2M4F1M4F2M4F3M4F4
05101520



學習重點

💡 「矩陣」與「熱圖」
■ 依兩類別變數做分類統計,就會產生矩陣
■ 熱圖是矩陣資料的視覺化工具
■ 熱圖不只是用顏色代表數值而已
■ 它可以對矩陣的欄與列做集群分析,分別將兩個類別變數之中相似的分類撿在一起
■ 對數轉換可以降低極端(離群))值的影響,讓熱圖的顏色更有區辨效果
■ 為了建立比較基礎加強視覺效果,有時我們需要先對矩陣的欄、列或整個矩陣做轉化


💡 「比較基礎」和「數值散佈」
■ 分析其實就是做比較,而比較需要有:「比較基礎」和「可比較性」
■ 資料轉換通常是為了解決「比較基礎」和「數值散佈」這兩個問題
■ 當要比較(或視覺化)的數值之間大小相距很大的時候,可以考慮:
    § 將數值轉化為比率
    § 做對數轉換 (log10())
    § 設定數值範圍 (pmin(),pmax())
    § 標準化 (standardization)
    § 正規化 (normalization)
    § 標準化殘差矩陣 (standardization)


💡 「正規化」和「標準化」
基準化有兩種作法:
  ■ 正規化(Normalization)比較重視比例,它的值是單向的(從0到1);
  ■ 標準化(Standardization)比較重視變異,它的值是雙向的,以0為基準、以標準差為單位