如何描述变量之间的关系?
探索性数据分析(exploratory data analysis)
回归分析及相关统计技术(regression and related statistical techniques)
机器学习(machine learning)
数据集
基本概念?
The basic goal of supervised learning is to find a function that accurately describes how different measured explanatory variables can be combined to make a prediction about a response variable.
构建函数的动机?
A function represents a relationship between inputs and an output。
The 1994 United States Census provides information of records from 32,561 adults that include a binary variable indicating whether each person makes greater or less than \$50,000 .
具体数据情况如下:
全部变量结构如下:
'data.frame': 32561 obs. of 15 variables: $ age : int 39 50 38 53 28 37 49 52 31 42 ... $ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ... $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ... $ education : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ... $ education.num : int 13 13 9 7 13 14 5 9 14 13 ... $ marital.status: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ... $ occupation : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ... $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ... $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ... $ sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ... $ capital.gain : int 2174 0 0 0 0 0 0 0 14084 5178 ... $ capital.loss : int 0 0 0 0 0 0 0 0 0 0 ... $ hours.per.week: int 40 13 40 40 40 40 16 45 50 40 ... $ native.country: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ... $ income : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
基本概念:
A decision tree is a tree-like flowchart that assigns class labels to individual observations.
Each branch of the tree separates the records in the data set into increasingly "pure" (i.e.,homogeneous) subsets, in the sense that they are more likely to share the same class label.
如何构造决策树?
全局最优与局部最优
Hunt决策树算子(Hunt's algorithm)
子节(child nodes)的计算
Gini(t)=1−2∑i=1(wi(t))2Entropy(t)=−2∑i=1(wi(t)⋅log2wi(t))
收入群体案例中,训练数据集的收入分布情况是:
income <=50K >50K 76.18 23.82
利用r
软件包rpart
可以分析决策树过程:
n= 26049 node), split, n, loss, yval, (yprob) * denotes terminal node1) root 26049 6206 <=50K (0.76176 0.23824) 2) capital.gain< 5119 24805 5030 <=50K (0.79722 0.20278) * 3) capital.gain>=5119 1244 68 >50K (0.05466 0.94534) *
如果只考虑变量资产所得税capital.gain
,可以得到如下的分类图:
如果考虑到全部变量,高收入群体的决策树分析结果则为:
n= 26049 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 26049 6206 <=50K (0.76176 0.23824) 2) relationship= Not-in-family, Other-relative, Own-child, Unmarried 14310 940 <=50K (0.93431 0.06569) 4) capital.gain< 7074 14055 694 <=50K (0.95062 0.04938) * 5) capital.gain>=7074 255 9 >50K (0.03529 0.96471) * 3) relationship= Husband, Wife 11739 5266 <=50K (0.55141 0.44859) 6) education= 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, HS-grad, Preschool, Some-college 8199 2717 <=50K (0.66862 0.33138) 12) capital.gain< 5096 7796 2321 <=50K (0.70228 0.29772) * 13) capital.gain>=5096 403 7 >50K (0.01737 0.98263) * 7) education= Bachelors, Doctorate, Masters, Prof-school 3540 991 >50K (0.27994 0.72006) *
基于以上分析,我们可以绘制成如下的决策树图形:
对于以上的决策树分类,我们可以聚焦观察到下面的转换图形:
以上决策树分析的参数收敛结果是:
Classification tree:rpart(formula = form, data = train)Variables actually used in tree construction:[1] capital.gain education relationshipRoot node error: 6206/26049 = 0.24n= 26049 CP nsplit rel error xerror xstd1 0.126 0 1.00 1.00 0.01112 0.063 2 0.75 0.75 0.01003 0.038 3 0.69 0.69 0.00964 0.010 4 0.65 0.65 0.0094
当然,我们也可以计算出以上过程的混淆矩阵(confusion matrix)
incomeincome_dtree <=50K >50K <=50K 18836 3015 >50K 1007 3191
基本概念:
Random forest is natural extension of a decision tree.
A random forest is collection of decision trees that are aggregated by majority rule.
过程步骤:
Choosing the number of decision trees to grow (controlled by the ntree
argument) and the number of variables to consider in each tree (mtry
)
Randomly selecting the rows of the data frame with replacement
Randomly selecting mtry
variables from the data frame
Building a decision tree on the resulting data set
Repeating this procedure ntree
times
下面控制参数ntree=201
, mtry=3
时,进行随机森林分类计算,结果为:
Call: randomForest(formula = form, data = train, ntree = 201, mtry = 3) Type of random forest: classification Number of trees: 201No. of variables tried at each split: 3 OOB estimate of error rate: 13.44%Confusion matrix: <=50K >50K class.error <=50K 18548 1295 0.06526 >50K 2207 3999 0.35562
我们可以计算得到正确预测率为:
[1] 0.8656
基本思路:如果一个样本在特征空间中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别。该方法在定类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。
大概是下面这个意思(自行脑补吧!)
如果指定分类为k=10
类,K临近分类方法下可以得到如下的混淆矩阵:
incomeincome_knn <=50K >50K <=50K 18997 2977 >50K 846 3229
当然,也就能计算出正确的分类预测比例为:
[1] 0.8532
此外,我们也可以做更多的分类设置k = c(1:15, 20, 30, 40, 50)
,下图可以比较分类数多少与识别正确率之间的关系:
基本思路:朴素贝叶斯分类(Naive Bayes) 是基于贝叶斯公式的。
Pr(Y|X)=Pr(XY)Pr(X)=Pr(X|Y)Pr(Y)Pr(X)
我们可以看一下训练数据集中的第一个人的情况:
1 age "39" workclass " State-gov" fnlwgt "77516" education " Bachelors" education.num "13" marital.status " Never-married"occupation " Adm-clerical" relationship " Not-in-family"race " White" sex " Male" capital.gain "2174" capital.loss "0" hours.per.week "40" native.country " United-States"income " <=50K" hi_cap_gains "FALSE" husband_or_wife "FALSE" college_degree "FALSE" income_dtree " <=50K"
按照以上资料,可以计算他是高收入者的概率(自己去算一下吧!)
Pr(>50k|male)=Pr(male|>50k)Pr(>50k)Pr(male)=0.845⋅0.2430.670=0.306
类似地,我们可以使用r
软件的e1071
包,进行朴素贝叶斯计算,并可以得到如下的混淆矩阵:
incomeincome_nb <=50K >50K <=50K 18724 3591 >50K 1119 2615
以及相应的正确识别率:
[1] 0.8192
基本思路:
(还是简单点说吧!)
While the impetus for the artifcial neural network comes from a biological understanding of the brain, the implementation here is entirely mathematical.
我们可以使用r
软件的nnet
包,可以获得如下的迭代计算过程:
# weights: 296initial value 27329.589383 iter 10 value 13247.925560final value 13247.921773 converged
类似地,我们可以得到人工神经网络方法下的混淆矩阵:
incomeincome_nn <=50K >50K <=50K 18424 4272 >50K 1419 1934
以及该方法下的正确分类识别率:
[1] 0.7815
基本思路:前述方法各有道理,但又不一定总是有效。所以可以对它们进行集成或整合应用。基本的逻辑可以表达为:
考虑三种方法:KNN/ NB / ANN
ϵi<13∏i=1ϵi<ϵi<11−3∏i=1ϵi
因此,KNN/ NB / ANN三种方法集成分析,可以得到如下的混淆矩阵:
income_ensemble <- ifelse((income_knn == " >50K") + (income_nb == " >50K") + (income_nn == " >50K") >= 2, " >50K", " <=50K")confusion <- tally(income_ensemble ~ income, data = train, format = "count")confusion
incomeincome_ensemble <=50K >50K <=50K 18857 3673 >50K 986 2533
以及得到分类预测准确率:
sum(diag(confusion)) / nrow(train)
[1] 0.8211
交叉验证(cross-validation)方法
80/20 scheme
2-fold cross-validation
k-fold cross-validation
测量预测误差(Measuring prediction error)
RMSE(Y,ˆY)=√1nn∑1(Y−ˆY)2
MAE(Y,ˆY)=1nn∑1|Y−ˆY|
相关系数(Correlation)
混淆矩阵 (Confusion matrix)
ROC曲线 (receiver operating characteristic curve, ROC)
A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.
The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning.
The false-positive rate is also known as the fall-out or probability of false alarm.
偏误-方差平衡(Bias-variance trade-off)
A complicated model will have less bias, but will in general have higher variance.
A simple model can reduce variance, but at the cost of increased bias.
下面用案例数据说明ROC的计算过程:
对于前述朴素贝叶斯计算结果,我们有
require("e1071")income_probs <- mod_nb %>% predict(newdata = train, type = "raw") %>% as.data.frame()head(income_probs, 3)
<=50K >50K1 0.9879 0.0120972 0.8561 0.1439013 0.9998 0.000239
如果我们把阈值调整为0.5
, 则预测15%的高收入。
` >50K` > 0.5 TRUE FALSE 14.33 85.67
如果我们把阈值调整为0.24
, 则预测20%的高收入。
` >50K` > 0.24 TRUE FALSE 19.32 80.68
最后,我们可以绘制出ROC曲线图:
incomeincome_nb <=50K >50K <=50K 18724 3591 >50K 1119 2615
此外,我们也给出偏误-方差平衡(Bias-variance trade-off)的评价图:
[[1]][1] "glm" "lm" [[2]][1] "rpart"[[3]][1] "randomForest.formula" "randomForest" [[4]][1] "nnet.formula" "nnet" [[5]][1] "naiveBayes"
[1] "predict.glm" "predict.glmmPQL" [3] "predict.glmtree" "predict.naiveBayes" [5] "predict.nnet" "predict.randomForest"[7] "predict.rpart"
下面给出汇总的比较结果:
model | train_accuracy | test_accuracy | test_tpr | test_fpr | |
---|---|---|---|---|---|
1 | mod_forest | 0.865561058006066 | 0.861025798525799 | 0.632415902140673 | 0.0623334016813615 |
2 | mod_tree | 0.845598679411878 | 0.840141277641278 | 0.510091743119266 | 0.0492105802747591 |
3 | mod_nb | 0.819186916964183 | 0.812346437346437 | 0.436697247706422 | 0.061718269427927 |
4 | mod_nn | 0.781527121962455 | 0.77257371007371 | 0.315596330275229 | 0.0742259585810949 |
5 | mod_null | 0.761756689316288 | 0.748925061425061 | 0 | 0 |
下面给出ROC图形比较
哺乳动物的进化树
案例背景:
The United States Department of Energy maintains automobile characteristics for thousands of cars: miles per gallon, engine size, number of cylinders, number of gears, etc.
Observations: 75Variables: 7$ make <chr> "Toyota", "Toyota", "Toyota",...$ model <chr> "FR-S", "RC 200t", "RC 300 AW...$ displacement <dbl> 2.0, 2.0, 3.5, 3.5, 3.5, 5.0,...$ cylinders <dbl> 4, 4, 6, 6, 6, 8, 4, 4, 8, 4,...$ gears <dbl> 6, 8, 6, 8, 6, 8, 6, 1, 8, 8,...$ city_mpg <dbl> 25, 22, 19, 19, 19, 16, 33, 4...$ hwy_mpg <dbl> 34, 32, 26, 28, 26, 25, 42, 4...
As a large automaker, Toyota has a diverse lineup of cars, trucks, SUVs, and hybrid vehicles.
Can we use unsupervised learning to categorize these vehicles in a sensible way with only the data we have been given?
丰田6个汽车品牌的“点到点”距离关系如下:
FR-S RC 200t RC 300 AWD RC 350 RC 350 AWDFR-S 0.00 4.88 12.20 10.73 12.20RC 200t 4.88 0.00 8.79 6.61 8.79RC 300 AWD 12.20 8.79 0.00 3.35 0.00RC 350 10.73 6.61 3.35 0.00 3.35RC 350 AWD 12.20 8.79 0.00 3.35 0.00RC F 16.35 12.41 5.32 5.83 5.32 RC FFR-S 16.35RC 200t 12.41RC 300 AWD 5.32RC 350 5.83RC 350 AWD 5.32RC F 0.00
基本思路:K-means方法是聚类中的经典算法,数据挖掘十大经典算法之一;算法接受参数k,然后将事先输入的n个数据对象划分为k个聚类以便使得所获得的聚类满足聚类中的对象相似度较高,而不同聚类中的对象相似度较小。
算法思想:以空间中k个点为中心进行聚类,对最靠近他们的对象归类,通过迭代的方法,逐次更新各聚类中心的值,直到得到最好的聚类结果。
算法描述:
全球大城市数据集:
WorldCities data doesn't have any notion of continents. For simplicity, consider the 4,000 biggest cities in the world and their longitudes and latitudes.
Observations: 4,000Variables: 2$ longitude <dbl> 121.458, -58.377, 72.883, -99.12...$ latitude <dbl> 31.222, -34.613, 19.073, 19.428,...
小伙伴们,上图吧!(惊不惊喜,意不意外?)
苏格兰议会投票案例说明:
Consider votes in a parliament or congress. Specically, consider the Scot-tish Parliament in 2008.
Legislators often vote together in pre-organized blocs, and thusthe pattern of "ayes" and "nays" on particular ballots may indicate which members are affiliated (i.e., members of the same political party).
'data.frame': 103582 obs. of 3 variables: $ bill: Factor w/ 773 levels "S1M-1","S1M-1007.1",..: 1 657 658 637 677 161 391 225 639 645 ... $ name: chr "Canavan, Dennis" "Canavan, Dennis" "Canavan, Dennis" "Canavan, Dennis" ... $ vote: int 1 1 1 -1 -1 -1 -1 1 -1 -1 ...
我们还是略过吧………………
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |