pacman::p_load(latex2exp, tidyr, readr, caTools, ggplot2, dplyr, magrittr, vcd, d3heatmap, Matrix)

1. 資料前處理:Z

1.1 讀進資料
Parsed with column specification:
cols(
  TRANSACTION_DT = col_character(),
  CUSTOMER_ID = col_character(),
  AGE_GROUP = col_character(),
  PIN_CODE = col_character(),
  PRODUCT_SUBCLASS = col_double(),
  PRODUCT_ID = col_character(),
  AMOUNT = col_double(),
  ASSET = col_double(),
  SALES_PRICE = col_double()
)
1.2 處理離群值
       qty   cost  price
99%      6  858.0 1014.0
99.9%   14 2722.0 3135.8
99.95%  24 3799.3 3999.0

2. 交易紀錄:X

2.2 處理離群值
         items pieces money costs   gross gross_margin
0.05%    1.000   1.00    10     9 -284.94     -0.92308
99.99%  82.000 130.68 16267 13817 3263.31      0.53125
99.995% 91.681 147.91 20198 16332 3660.70      0.58494

3. 顧客資料:A

1. 簡單泡泡圖

2. 洞察變數間關聯性

使用馬賽克圖檢視列連表的關聯性(Association between Categorial Variables) + 方塊大小代表該類別組合的數量 + 紅(藍)色代表該類別組合的數量顯著小(大)於期望值 + 期望值就是邊際機率(如上方的直條圖所示)的乘積 + 卡方檢定(類別變數的關聯性檢定)的p值顯示在圖示最下方 + p-value < 2.22e-16 : agearea 之間有顯著的關聯性

2.1 年齡與地理區隔的關聯性

💡 主要發現:
※ 「年齡」與「地區」之間有很高的關聯性
    § 南港(z115):30~40歲的顧客比率比較低、50歲以上的顧客比率比較高
    § 汐止(z221)、內湖(z114)和其他(zOthers):3040歲的顧客比率比較高、4555歲的顧客比率比較低



2.2 年齡與購物日的關聯性

💡 主要發現:
※ 「年齡」與「購物日」之間有關聯性
    § 平日時,年齡為65-70歲的顧客較常出沒、年齡為30-40歲的顧客幾乎不會來     § 相反地,禮拜日時,年齡為30-40歲的顧客較常出沒、年齡為50歲以上的顧客幾乎不會來



2.3 地區與購物日的關聯性

💡 主要發現:
※ 「地區」與「購物日」之間有很高的關聯性
    § 平日時,南港(z115)生意興隆,剩下的地區較沒有生意     § 假日時,南港(z115)生意慘淡,顧客都來反倒來信義(z110)消費



3. 層級式集群分析

3.3 顧客參數常規化
               seniority                  recency                 frquency 
 0.000000000000000027527 -0.000000000000000019501 -0.000000000000000023395 
                 weekday                   pieces                 monetary 
-0.000000000000000102981 -0.000000000000000054087 -0.000000000000000025898 
            gross_margin 
-0.000000000000000036200 
   seniority      recency     frquency      weekday       pieces     monetary 
           1            1            1            1            1            1 
gross_margin 
           1 
                   1       2        3       4
seniority     94.914 104.765   77.263  35.331
recency       83.418  14.280   43.362  23.538
frquency       1.582   6.932    2.163   1.619
weekday        0.558   0.647    0.443   0.629
pieces         7.952   8.372   28.325   7.957
monetary     712.222 751.530 2940.579 761.342
gross_margin   0.106   0.105    0.173   0.126
                  1      2      3      4
seniority     0.417  0.706 -0.102 -1.334
recency       1.375 -0.686  0.181 -0.410
frquency     -0.458  0.732 -0.328 -0.449
weekday      -0.098  0.131 -0.394  0.085
pieces       -0.282 -0.236  1.957 -0.281
monetary     -0.298 -0.257  1.983 -0.247
gross_margin -0.086 -0.093  0.367  0.047



Check Quantile and Remove Outlier
        items pieces   money   costs  gross gross_margin
99.9%  56.000  84.00  9378.7  7797.4 1883.2      0.40035
99.95% 64.000  98.00 11261.8  9151.6 2317.1      0.43164
99.99% 85.646 137.65 17699.3 14108.5 3389.6      0.53125

Preparing the Target Variables (Y)

The Target for Regression - A$amount

Simply a Left Joint

Classification Model

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Call:
glm(formula = buy ~ ., family = binomial(), data = TR[, c(2:10, 
    12)])

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.613  -0.862  -0.650   0.991   1.885  

Coefficients:
                Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)   -1.1972523   0.1295743   -9.24 < 0.0000000000000002 ***
agea29        -0.0156346   0.0888963   -0.18              0.86039    
agea34         0.0693388   0.0819900    0.85              0.39772    
agea39         0.1105934   0.0812170    1.36              0.17329    
agea44         0.0512668   0.0837627    0.61              0.54051    
agea49         0.0309601   0.0871720    0.36              0.72247    
agea54        -0.0743826   0.0957263   -0.78              0.43714    
agea59         0.1346176   0.1153689    1.17              0.24327    
agea64         0.1202517   0.1212237    0.99              0.32121    
agea69         0.2046125   0.1075123    1.90              0.05702 .  
agea99       -16.6760779 106.7462650   -0.16              0.87586    
areaz106      -0.0291434   0.1351816   -0.22              0.82931    
areaz110      -0.2223265   0.1054748   -2.11              0.03504 *  
areaz114      -0.0353269   0.1130661   -0.31              0.75470    
areaz115       0.2089803   0.0983745    2.12              0.03364 *  
areaz221       0.0900806   0.0991870    0.91              0.36378    
areazOthers   -0.0678827   0.1057391   -0.64              0.52088    
areazUnknown -27.5223923 116.5778841   -0.24              0.81337    
recency       -0.0135307   0.0009220  -14.68 < 0.0000000000000002 ***
seniority      0.0105512   0.0009403   11.22 < 0.0000000000000002 ***
frquency       0.2970780   0.0163838   18.13 < 0.0000000000000002 ***
pieces         0.0096962   0.0027934    3.47              0.00052 ***
monetary      -0.0000563   0.0000345   -1.63              0.10265    
revenue       -0.0000122   0.0000125   -0.97              0.32975    
gross_margin  -0.6880819   0.1132861   -6.07         0.0000000012 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27448  on 20005  degrees of freedom
Residual deviance: 22049  on 19981  degrees of freedom
AIC: 22099

Number of Fisher Scoring iterations: 16
       predict
actual  FALSE TRUE
  FALSE  3923  877
  TRUE   1574 2201
[1] 0.55977 0.71417
                  [,1]
FALSE vs. TRUE 0.77885


Regression Model


Call:
lm(formula = amount ~ ., data = TR2[, c(2:11)])

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9841 -0.2311  0.0465  0.2809  1.5430 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   1.405737   0.056689   24.80 < 0.0000000000000002 ***
agea29        0.044212   0.025291    1.75               0.0805 .  
agea34        0.102705   0.023243    4.42    0.000010046777673 ***
agea39        0.126071   0.022889    5.51    0.000000037318791 ***
agea44        0.103045   0.023437    4.40    0.000011122781602 ***
agea49        0.082340   0.024272    3.39               0.0007 ***
agea54        0.073336   0.026761    2.74               0.0061 ** 
agea59        0.057401   0.031235    1.84               0.0661 .  
agea64        0.045274   0.032362    1.40               0.1619    
agea69       -0.033785   0.028674   -1.18               0.2387    
areaz106      0.026646   0.041911    0.64               0.5249    
areaz110     -0.008423   0.034046   -0.25               0.8046    
areaz114     -0.031301   0.035996   -0.87               0.3846    
areaz115     -0.033412   0.031347   -1.07               0.2865    
areaz221     -0.009636   0.031560   -0.31               0.7601    
areazOthers  -0.020780   0.033897   -0.61               0.5399    
recency       0.000330   0.000318    1.04               0.2994    
seniority     0.000114   0.000322    0.35               0.7229    
frquency      0.026753   0.002104   12.71 < 0.0000000000000002 ***
pieces        0.005315   0.000775    6.86    0.000000000007288 ***
monetary      0.391573   0.043044    9.10 < 0.0000000000000002 ***
revenue       0.047772   0.038784    1.23               0.2181    
gross_margin  0.290456   0.037517    7.74    0.000000000000011 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.42 on 8784 degrees of freedom
Multiple R-squared:  0.296, Adjusted R-squared:  0.294 
F-statistic:  168 on 22 and 8784 DF,  p-value: <0.0000000000000002
[1] 27048

In B, there is a record for each customer. B$Buy is the probability of buying in March.

💡: 預測購買金額時要記得做指數、對數轉換!

帶有「參數」的成本效益函數

第三組最佳策略

   x eR.ALL    N eR.SEL
1 65 340291 3166 340810

第四組最佳策略

   x eR.ALL    N eR.SEL
1 30 425431 7577 429626