background-image: url("pic/slide-front-page.jpg") class: center,middle # 高级统计(Advanced Statistics) ## 机器学习与预测 <!--- chakra: libs/remark-latest.min.js ---> ### 胡华平 ### 西北农林科技大学 ### 经济管理学院数量经济教研室 ### huhuaping01@hotmail.com ### 2019-12-24 --- class: inverse, center, middle ## 导论 ## 监督学习(Supervised learning) ### DT/ RF / KNN / NB / ANN ## 无监督学习(Unsupervised learning) ### K-mean / DR --- # 导论 如何描述变量之间的关系? - 探索性数据分析(exploratory data analysis) - 回归分析及相关统计技术(regression and related statistical techniques) - 机器学习(machine learning) - 监督学习(Supervised learning): 分类分析(classifiers) - 无监督学习(Unsupervised learning):聚类分析(Clustering) - 数据集 - 训练数据集(training dataset) - 测试数据集(test dataset) --- class: center, middle,inverse .small[ ## 监督学习(Supervised learning) ] --- # 监督学习(Supervised learning) 基本概念? The basic goal of supervised learning is to find a function that accurately describes how different measured explanatory variables can be combined to make a prediction about a response variable. 构建函数的动机? A function represents a relationship between inputs and an output。 - Predict the output given an input. - Determine which variables are useful inputs. - Generate hypotheses. - Understand how a system works. --- ## 美国高收入群体案例介绍 The 1994 United States Census provides information of records from 32,561 adults that include a binary variable indicating whether each person makes greater or less than \$50,000 . 具体数据情况如下:
--- ## 美国高收入群体案例介绍 全部变量结构如下: ``` 'data.frame': 32561 obs. of 15 variables: $ age : int 39 50 38 53 28 37 49 52 31 42 ... $ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ... $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ... $ education : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ... $ education.num : int 13 13 9 7 13 14 5 9 14 13 ... $ marital.status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ... $ occupation : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ... $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ... $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ... $ sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ... $ capital.gain : int 2174 0 0 0 0 0 0 0 14084 5178 ... $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ... $ hours.per.week: int 40 13 40 40 40 40 16 45 50 40 ... $ native.country: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ... $ income : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ... ``` --- ## 美国高收入群体案例介绍 我们把原始数据按80/20比例**随机性**地分割为两个子数据集:
--- ## 决策树(decision tree) 基本概念: .large[ - A decision tree is a tree-like flowchart that assigns class labels to individual observations. - Each branch of the tree separates the records in the data set into increasingly "pure" (i.e.,homogeneous) subsets, in the sense that they are more likely to share the same class label. ] --- ## 决策树(decision tree) 如何构造决策树? - 全局最优与局部最优 - Hunt决策树算子(Hunt's algorithm) - 子节(child nodes)的计算 - 基尼系数(Gini coefficient) - 信息得分(information gain) `$$\begin{align} Gini(t) & = 1- \sum_{i=1}^{2}{\left ( w_i(t) \right) ^2} \\ Entropy(t) & = - \sum_{i=1}^{2}{\left ( w_i(t) \cdot log_2 w_i(t) \right)} \end{align}$$` --- ## 决策树(decision tree) 收入群体案例中,**训练数据集**的收入分布情况是: ``` income <=50K >50K 76.18 23.82 ``` 利用`r`软件包`rpart`可以分析决策树过程: ``` n= 26049 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 26049 6206 <=50K (0.76176 0.23824) 2) capital.gain< 5119 24805 5030 <=50K (0.79722 0.20278) * 3) capital.gain>=5119 1244 68 >50K (0.05466 0.94534) * ``` --- ## 决策树(decision tree) 如果只考虑变量资产所得税`capital.gain`,可以得到如下的分类图: <img src="stat-learning-slide_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## 决策树(decision tree) 如果考虑到全部变量,高收入群体的决策树分析结果则为: ``` n= 26049 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 26049 6206 <=50K (0.76176 0.23824) 2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 14310 940 <=50K (0.93431 0.06569) 4) capital.gain< 7074 14055 694 <=50K (0.95062 0.04938) * 5) capital.gain>=7074 255 9 >50K (0.03529 0.96471) * 3) relationship= Husband, Wife 11739 5266 <=50K (0.55141 0.44859) 6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 8199 2717 <=50K (0.66862 0.33138) 12) capital.gain< 5096 7796 2321 <=50K (0.70228 0.29772) * 13) capital.gain>=5096 403 7 >50K (0.01737 0.98263) * 7) education= Bachelors, Doctorate, Masters, Prof-school 3540 991 >50K (0.27994 0.72006) * ``` --- ## 决策树(decision tree) 基于以上分析,我们可以绘制成如下的决策树图形: <img src="stat-learning-slide_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## 决策树(decision tree) 对于以上的决策树分类,我们可以聚焦观察到下面的转换图形: <img src="stat-learning-slide_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## 决策树(decision tree) 以上决策树分析的参数收敛结果是: ``` Classification tree: rpart(formula = form, data = train) Variables actually used in tree construction: [1] capital.gain education relationship Root node error: 6206/26049 = 0.24 n= 26049 CP nsplit rel error xerror xstd 1 0.126 0 1.00 1.00 0.0111 2 0.063 2 0.75 0.75 0.0100 3 0.038 3 0.69 0.69 0.0096 4 0.010 4 0.65 0.65 0.0094 ``` --- ## 决策树(decision tree) .large[ 当然,我们也可以计算出以上过程的混淆矩阵(confusion matrix) ``` income income_dtree <=50K >50K <=50K 18836 3015 >50K 1007 3191 ``` ] --- ## 随机森林(random forests) .large[ 基本概念: - Random forest is natural extension of a decision tree. - A random forest is collection of decision trees that are aggregated by majority rule. ] --- ## 随机森林(random forests) .large[ 过程步骤: 1. Choosing the number of decision trees to grow (controlled by the `ntree` argument) and the number of variables to consider in each tree (`mtry`) 2. Randomly selecting the rows of the data frame with replacement 3. Randomly selecting `mtry` variables from the data frame 4. Building a decision tree on the resulting data set 5. Repeating this procedure `ntree` times ] --- ## 随机森林(random forests) 下面控制参数`ntree=201`, `mtry=3`时,进行随机森林分类计算,结果为: ``` Call: randomForest(formula = form, data = train, ntree = 201, mtry = 3) Type of random forest: classification Number of trees: 201 No. of variables tried at each split: 3 OOB estimate of error rate: 13.44% Confusion matrix: <=50K >50K class.error <=50K 18548 1295 0.06526 >50K 2207 3999 0.35562 ``` 我们可以计算得到正确预测率为: ``` [1] 0.8656 ``` --- ## 随机森林(random forests) 进一步,可以分析各变量对预测的相对重要程度:
--- ## K临近分类(K-nearest neighbor, KNN) 基本思路:如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。 大概是下面这个意思(自行脑补吧!) <img src="pic/KNN.png" width="293" style="display: block; margin: auto;" /> --- ## K临近分类(K-nearest neighbor, KNN) 如果指定分类为`k=10` 类,K临近分类方法下可以得到如下的混淆矩阵: ``` income income_knn <=50K >50K <=50K 18997 2977 >50K 846 3229 ``` 当然,也就能计算出正确的分类预测比例为: ``` [1] 0.8532 ``` --- ## K临近分类(K-nearest neighbor, KNN) 此外,我们也可以做更多的分类设置`k = c(1:15, 20, 30, 40, 50)`,下图可以比较**分类数多少**与**识别正确率**之间的关系: <img src="pic/knn_plot.png" width="1120" style="display: block; margin: auto;" /> --- ## 朴素贝叶斯分类(Naive Bayes) 基本思路:朴素贝叶斯分类(Naive Bayes) 是基于贝叶斯公式的。 `$$\begin{align} Pr(Y|X) & = \frac{Pr(XY)}{Pr(X)} \\ & = \frac{Pr(X|Y)Pr(Y)}{Pr(X)} \end{align}$$` --- ## 朴素贝叶斯分类(Naive Bayes) 我们可以看一下训练数据集中的第一个人的情况: ``` 1 age "39" workclass " State-gov" fnlwgt "77516" education " Bachelors" education.num "13" marital.status " Never-married" occupation " Adm-clerical" relationship " Not-in-family" race " White" sex " Male" capital.gain "2174" capital.loss "0" hours.per.week "40" native.country " United-States" income " <=50K" hi_cap_gains "FALSE" husband_or_wife "FALSE" college_degree "FALSE" income_dtree " <=50K" ``` --- ## 朴素贝叶斯分类(Naive Bayes) 按照以上资料,可以计算他是高收入者的概率(自己去算一下吧!) `$$\begin{align} Pr(>50k|male) & = \frac{Pr(male|>50k)Pr(>50k)}{Pr(male)} \\ & = \frac{0.845 \cdot 0.243}{0.670} \\ & = 0.306 \end{align}$$` --- ## 朴素贝叶斯分类(Naive Bayes) 类似地,我们可以使用`r`软件的`e1071`包,进行朴素贝叶斯计算,并可以得到如下的混淆矩阵: ``` income income_nb <=50K >50K <=50K 18724 3591 >50K 1119 2615 ``` 以及相应的正确识别率: ``` [1] 0.8192 ``` --- ## 人工神经网络(Artificial neural networks, ANN) .large[ 基本思路: (还是简单点说吧!) While the impetus for the artifcial neural network comes from a biological understanding of the brain, the implementation here is entirely mathematical. ] --- ## 人工神经网络(Artificial neural networks, ANN) 我们可以使用`r`软件的`nnet`包,可以获得如下的迭代计算过程: ``` # weights: 296 initial value 27329.589383 iter 10 value 13247.925560 final value 13247.921773 converged ``` --- ## 人工神经网络(Artificial neural networks, ANN) 类似地,我们可以得到人工神经网络方法下的混淆矩阵: ``` income income_nn <=50K >50K <=50K 18424 4272 >50K 1419 1934 ``` 以及该方法下的正确分类识别率: ``` [1] 0.7815 ``` --- ## 集成分类方法 基本思路:前述方法各有道理,但又不一定总是有效。所以可以对它们进行**集成或整合**应用。基本的逻辑可以表达为: 考虑三种方法:KNN/ NB / ANN `$$\begin{align} & \epsilon_i < 1 \\ & \prod_{i=1}^{3} \epsilon_i < \epsilon_i <1 \\ & 1- \prod_{i=1}^{3} \epsilon_i \end{align}$$` --- ## 集成分类方法 因此,KNN/ NB / ANN三种方法集成分析,可以得到如下的混淆矩阵: ```r income_ensemble <- ifelse((income_knn == " >50K") + (income_nb == " >50K") + (income_nn == " >50K") >= 2, " >50K", " <=50K") confusion <- tally(income_ensemble ~ income, data = train, format = "count") confusion ``` ``` income income_ensemble <=50K >50K <=50K 18857 3673 >50K 986 2533 ``` 以及得到分类预测准确率: ```r sum(diag(confusion)) / nrow(train) ``` ``` [1] 0.8211 ``` --- ## 模型评价(Evaluating models) - 交叉验证(cross-validation)方法 - 80/20 scheme - 2-fold cross-validation - k-fold cross-validation - 测量预测误差(Measuring prediction error) --- ## 模型评价(Evaluating models) - 均方根误差(Root mean squared error, RMSE) `$$\begin{equation} RMSE_{(Y, \hat{Y})} = \sqrt{\frac{1}{n} \sum_{1}^{n}{(Y-\hat{Y})^2}} \end{equation}$$` - 平均绝对误差(Mean absolute error, MAE) `$$\begin{equation} MAE_{(Y, \hat{Y})} = \frac{1}{n} \sum_{1}^{n}{|Y-\hat{Y}|} \end{equation}$$` --- ## 模型评价(Evaluating models) - 相关系数(Correlation) - Spearman相关系数 `\(rho\)` - Kendall相关系数 `\(\tau\)` - 混淆矩阵 (Confusion matrix) --- ## 模型评价(Evaluating models) - ROC曲线 (receiver operating characteristic curve, ROC) - A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. - The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. - The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. - The false-positive rate is also known as the fall-out or probability of false alarm. --- ## 模型评价(Evaluating models) - 偏误-方差平衡(Bias-variance trade-off) - A complicated model will have less bias, but will in general have higher variance. - A simple model can reduce variance, but at the cost of increased bias. --- ## 模型评价(Evaluating models) 下面用案例数据说明ROC的计算过程: 对于前述朴素贝叶斯计算结果,我们有 ```r require("e1071") income_probs <- mod_nb %>% predict(newdata = train, type = "raw") %>% as.data.frame() head(income_probs, 3) ``` ``` <=50K >50K 1 0.9879 0.012097 2 0.8561 0.143901 3 0.9998 0.000239 ``` --- ## 模型评价(Evaluating models) .large[ 如果我们把阈值调整为`0.5`, 则预测15%的高收入。 ``` ` >50K` > 0.5 TRUE FALSE 14.33 85.67 ``` 如果我们把阈值调整为`0.24`, 则预测20%的高收入。 ``` ` >50K` > 0.24 TRUE FALSE 19.32 80.68 ``` ] --- ## 模型评价(Evaluating models) --- ## 模型评价(Evaluating models) 最后,我们可以绘制出ROC曲线图: ``` income income_nb <=50K >50K <=50K 18724 3591 >50K 1119 2615 ``` <img src="stat-learning-slide_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> --- ## 模型评价(Evaluating models) 此外,我们也给出偏误-方差平衡(Bias-variance trade-off)的评价图: <img src="stat-learning-slide_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> --- ## 案例通览与比较 分别对训练数据集和测试数据集进行描述性统计,表明变量`capital.gain`在两个数据集上都有很类似的分布形态。
--- ## 案例通览与比较 ``` [[1]] [1] "glm" "lm" [[2]] [1] "rpart" [[3]] [1] "randomForest.formula" "randomForest" [[4]] [1] "nnet.formula" "nnet" [[5]] [1] "naiveBayes" ``` ``` [1] "predict.glm" "predict.glmmPQL" [3] "predict.glmtree" "predict.naiveBayes" [5] "predict.nnet" "predict.randomForest" [7] "predict.rpart" ``` 下面给出汇总的比较结果:
--- ## 案例通览与比较 下面给出ROC图形比较 <img src="stat-learning-slide_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> --- class: center, middle,inverse .small[ ## 无监督学习(Unsupervised learning) ] --- ## 聚类分析(Clustering) <!---哺乳动物的进化树---> <div class="figure" style="text-align: center"> <img src="pic/mamals.jpg" alt="哺乳动物的进化树" width="476" /> <p class="caption">哺乳动物的进化树</p> </div> --- ## 汽车案例数据及说明 .large[ 案例背景: The United States Department of Energy maintains automobile characteristics for thousands of cars: miles per gallon, engine size, number of cylinders, number of gears, etc. ] --- ## 汽车案例数据及说明 ``` Observations: 75 Variables: 7 $ make <chr> "Toyota", "Toyota", "Toyota",... $ model <chr> "FR-S", "RC 200t", "RC 300 AW... $ displacement <dbl> 2.0, 2.0, 3.5, 3.5, 3.5, 5.0,... $ cylinders <dbl> 4, 4, 6, 6, 6, 8, 4, 4, 8, 4,... $ gears <dbl> 6, 8, 6, 8, 6, 8, 6, 1, 8, 8,... $ city_mpg <dbl> 25, 22, 19, 19, 19, 16, 33, 4... $ hwy_mpg <dbl> 34, 32, 26, 28, 26, 25, 42, 4... ``` - As a large automaker, Toyota has a diverse lineup of cars, trucks, SUVs, and hybrid vehicles. - Can we use unsupervised learning to categorize these vehicles in a sensible way with only the data we have been given? --- ## 层次聚类(Hierarchical clustering) 丰田6个汽车品牌的“点到点”距离关系如下: ``` FR-S RC 200t RC 300 AWD RC 350 RC 350 AWD FR-S 0.00 4.88 12.20 10.73 12.20 RC 200t 4.88 0.00 8.79 6.61 8.79 RC 300 AWD 12.20 8.79 0.00 3.35 0.00 RC 350 10.73 6.61 3.35 0.00 3.35 RC 350 AWD 12.20 8.79 0.00 3.35 0.00 RC F 16.35 12.41 5.32 5.83 5.32 RC F FR-S 16.35 RC 200t 12.41 RC 300 AWD 5.32 RC 350 5.83 RC 350 AWD 5.32 RC F 0.00 ``` --- <img src="stat-learning-slide_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" /> --- ## K聚类(k-means) 基本思路:K-means方法是聚类中的经典算法,数据挖掘十大经典算法之一;算法接受参数k,然后将事先输入的n个数据对象划分为k个聚类以便使得所获得的聚类满足聚类中的对象相似度较高,而不同聚类中的对象相似度较小。 算法思想:以空间中k个点为中心进行聚类,对最靠近他们的对象归类,通过迭代的方法,逐次更新各聚类中心的值,直到得到最好的聚类结果。 算法描述: - 适当选择c个类的初始中心; - 在第k次迭代中,对任意一个样本,求其到c各中心的距离,将该样本归到距离最短的那个中心所在的类; - 利用均值等方法更新该类的中心值; - 对于所有的C个聚类中心,如果利用迭代法更新后,值保持不变,则迭代结束;否则继续迭代。 --- ## K聚类(k-means) .large[ 全球大城市数据集: WorldCities data doesn't have any notion of continents. For simplicity, consider the 4,000 biggest cities in the world and their longitudes and latitudes. ``` Observations: 4,000 Variables: 2 $ longitude <dbl> 121.458, -58.377, 72.883, -99.12... $ latitude <dbl> 31.222, -34.613, 19.073, 19.428,... ``` ] --- ## K聚类(k-means) 小伙伴们,上图吧!(惊不惊喜,意不意外?) <img src="stat-learning-slide_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" /> --- ## 数据降维(Dimension reduction) .large[ 苏格兰议会投票案例说明: Consider votes in a parliament or congress. Specically, consider the Scot-tish Parliament in 2008. Legislators often vote together in pre-organized blocs, and thusthe pattern of "ayes" and "nays" on particular ballots may indicate which members are affiliated (i.e., members of the same political party). ``` 'data.frame': 103582 obs. of 3 variables: $ bill: Factor w/ 773 levels "S1M-1","S1M-1007.1",..: 1 657 658 637 677 161 391 225 639 645 ... $ name: chr "Canavan, Dennis" "Canavan, Dennis" "Canavan, Dennis" "Canavan, Dennis" ... $ vote: int 1 1 1 -1 -1 -1 -1 1 -1 -1 ... ``` ] --- ## 数据降维(Dimension reduction) <img src="stat-learning-slide_files/figure-html/unnamed-chunk-50-1.png" style="display: block; margin: auto;" /> --- ### 直接方法(Intuitive approaches) <img src="stat-learning-slide_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" /> --- ### 直接方法(Intuitive approaches) <img src="stat-learning-slide_files/figure-html/unnamed-chunk-52-1.png" style="display: block; margin: auto;" /> --- ### 奇异值分解(Singular value decomposition) 我们还是略过吧……………… --- layout:false background-image: url("pic/thank-you-gif-funny-fan.gif") class: inverse,center # 本章结束