Lab 3 局部回归(模拟数据)
3.1 实验说明
3.2 数据集描述
为了更好地进行数据验证,我们继续使用前面蒙特卡洛模拟数据集:
\[\begin{align} Y_i &= m(X) +e_i =\frac{sin(\frac{\pi}{4}\cdot(X_i-2))}{\frac{\pi}{4}\cdot(X_i-2)} +e_i\\ X_i &\sim U(0,10)\\ e_i &\sim N(0, 2)\\ n &=100 \end{align}\]
此时,我们具有上帝视角,实际上已经知道数据生成机制(DGP)
此时,我们心里面已知真实模型为非线性的
# simulation data set
n <- 100
xm = 10
x <- seq(0,xm,.01)
xn <- length(x)
set.seed(180)
xdat <- runif(n,0,xm)
a <- pi/4
m <- sin(xdat*a)/(xdat*a)
m <- sin((xdat-2)*a)/((xdat-2)*a)
ydat <- m + rnorm(n)/4
dt <- tibble(index = 1:length(xdat), X = xdat, Y = ydat)
3.2.1 数据呈现
dt %>%
#add_column(obs = 1:nrow(.), .before = "X") %>%
DT::datatable(
caption = paste0("模拟的样本数据集(n=",n,")"),
options = list(dom = "tip", pageLength =8))%>%
formatRound(c(2:3), digits = 4)
- 样本数据的描述性统计如下:
index X Y
Min. : 1 Min. :0.1 Min. :-0.65
1st Qu.: 26 1st Qu.:2.3 1st Qu.:-0.06
Median : 50 Median :4.6 Median : 0.27
Mean : 50 Mean :4.9 Mean : 0.38
3rd Qu.: 75 3rd Qu.:7.5 3rd Qu.: 0.81
Max. :100 Max. :9.9 Max. : 1.46
3.3 局部回归估计过程
3.3.1 基本原理
- 对于箱组局部回归估计:箱组数为
n=5
\[\widehat{m}(x)={\widehat{Y}_i} \quad \text{(OLS)}\]
注意:此处OLS回归的目标是得到不同箱组的预测值(Nn=1001)个,因此我们需要先根据样本数据(n=100)得到拟合线的截距和斜率,再进行新箱组数据(Nn=1001)进行预测。
- 对于滚动箱组局部回归估计,箱组数为
n=1001
,我们使用矩形核函数,采用局部加权OLS方法计算不同箱组下仅含截距项的回归模型的截距系数\(Y=m(X)+e \simeq m(x)+e\):
\[\begin{align} \widehat{m}_{\mathrm{nw}}(x)=\underset{m}{\operatorname{argmin}} \sum_{i=1}^{n} K\left(\frac{X_{i}-x}{h}\right)\left(Y_{i}-m\right)^{2} \end{align}\]
- 对于核函数局部回归估计,箱组数为
n=1001
,我们使用高斯核函数,采用局部OLS方法计算得到下式的\(\beta_0 =m(x)\)
\[\begin{align} Y &= m(X) +e \\ &\simeq m(x) + m^{\prime}(x)(X-x) +e \\ & = \beta_0 + \beta_1\cdot(X-x) +e \end{align}\]
以上模型可以视作为一元线性回归模型,其中 \(\beta_0 =m(x);\beta_1=m^{\prime}(x)\)。
3.3.2 局部回归计算
3.3.2.1 对于局部滚动回归和局部线性回归
局部滚动回归和局部线性回归,二者本质上是一样的,只是使用的核函数不同,二者具体代码计算如下:
#### calculate ####
# set bandwidth
h1 <- 1
h2 <- h1/sqrt(3)
# Rolling LL and Kernel LL estimation
mg <- matrix(0,xn,1)
mr <- matrix(0,xn,1)
for (j in 1:xn){
xj <- xdat-x[j]
z <- cbind(matrix(1,n,1),xj)
k1 <- abs(xj) < h1 # Rectangle Kernel
k2 <- dnorm(xj/h2) # Gaussian Kernel
zk1 <- z*k1
zk2 <- z*k2
betar <- solve(t(zk1)%*%z,t(zk1)%*%ydat)
betag <- solve(t(zk2)%*%z,t(zk2)%*%ydat)
mr[j] <- betar[1] # for Rolling LL
mg[j] <- betag[1] # for Kernel LL
}
3.3.2.2 对于局部箱组回归
如前所属,局部箱组回归总体来说需要进行两步运算:
我们需要先根据样本数据(n=100)直接使用OLS方法得到箱组内拟合线的截距和斜率,再进行新箱组数据(Nn=1001)进行预测。
局部箱组回归的具体代码计算如下:
# Binned LL estimation
x0 <- c(1,3,5,7,9)
## for bins 1
x1 <- as.matrix(subset(xdat,xdat<2))
y1 <- as.matrix(subset(ydat,xdat<2))
z1 <- cbind(matrix(1,length(y1),1),x1)
beta1 <- solve(t(z1)%*%z1,t(z1)%*%y1) # get OLS coefficients
m1 <- beta1[1]+beta1[2]*x0[1] # mean centers
x1 <- as.matrix(subset(x,x<2))
f1 <- beta1[1]+x1*beta1[2] # forecast with new data
## for bins 2
x2 <- as.matrix(subset(xdat,(xdat>=2)&(xdat<4)))
y2 <- as.matrix(subset(ydat,(xdat>=2)&(xdat<4)))
z2 <- cbind(matrix(1,length(y2),1),x2)
beta2 <- solve(t(z2)%*%z2,t(z2)%*%y2)
m2 <- beta2[1]+beta2[2]*x0[2]
x2 <- as.matrix(subset(x,(x>=2)&(x<4)))
f2 <- beta2[1]+x2*beta2[2]
## for bins 3
x3 <- as.matrix(subset(xdat,(xdat>=4)&(xdat<6)))
y3 <- as.matrix(subset(ydat,(xdat>=4)&(xdat<6)))
z3 <- cbind(matrix(1,length(y3),1),x3)
beta3 <- solve(t(z3)%*%z3,t(z3)%*%y3)
m3 <- beta3[1]+beta3[2]*x0[3]
x3 <- as.matrix(subset(x,(x>=4)&(x<6)))
f3 <- beta3[1]+x3*beta3[2]
## for bins 4
x4 <- as.matrix(subset(xdat,(xdat>=6)&(xdat<8)))
y4 <- as.matrix(subset(ydat,(xdat>=6)&(xdat<8)))
z4 <- cbind(matrix(1,length(y4),1),x4)
beta4 <- solve(t(z4)%*%z4,t(z4)%*%y4)
m4 <- beta4[1]+beta4[2]*x0[4]
x4 <- as.matrix(subset(x,(x>=6)&(x<8)))
f4 <- beta4[1]+x4*beta4[2]
## for bins 5
x5 <- as.matrix(subset(xdat,xdat>=8))
y5 <- as.matrix(subset(ydat,xdat>=8))
z5 <- cbind(matrix(1,length(y5),1),x5)
beta5 <- solve(t(z5)%*%z5,t(z5)%*%y5)
m5 <- beta5[1]+beta5[2]*x0[5]
x5 <- as.matrix(subset(x,(x>=8)))
f5 <- beta5[1]+x5*beta5[2]
## combine as data.frame
tbl_a0 <- cbind(beta1, beta2,beta3, beta4,beta5) %>%
as_tibble(.name_repair = "unique") %>%
rename_all(~paste0("bd",1:5)) %>%
add_column(par= c("intercept", "slope"), .before = "bd1")
## binned LL CEF estimation
m0 <- c(m1,m2,m3,m4,m5) ## mean centers
f0 <- c(f1, f2,f3, f4, f5) ## LL estimation
3.4 整合计算结果
3.4.1 基于箱组数的结果整合(N=1001)
根据前面的计算,我们把三种方法的局部回归计算过程及结果组合到一张表里来:
#### combine all result ####
tbl_result <- tibble(id = 1:length(x),
x = x,
x_upr = x +0.01) %>%
mutate(bd = cut_interval(x,length = 2, right=FALSE)) %>%
mutate(
lwr = as.numeric(str_extract(bd, "(\\d{1})(?=\\,)")),
upr = as.numeric(str_extract(bd, "(?<=\\,)(\\d{1,2})")),
mid = 0.5*(lwr+ upr)
) %>%
mutate(lft = x-h1,
rgt = x +h1,
bins = str_c('[',number(lft,0.01),',',
number(rgt,0.01), ')')) %>%
mutate(my = rep(m0, times=c(200,200,200,200,201)),
m0 = (as.vector(f0)), # pay attention here !!!
m1 = (as.vector(mr)),
m2 = as.vector(mg))
下面展示数据表结果:
3.4.2 匹配样本数据和箱组(Nn=100100)
考虑到每1个样本数据点,都会被不同的箱组所“套中”,因此我们现在匹配样本数据(n=100
)到不同的箱组(N=1001
)中去,最终得到了Nn=100100
行的计算表。
具体计算代码为:
3.4.3 获得不同方法下的计算表格
现在给出三种方法下的计算表:
# binned LL
tbl_m0 <- tbl_match %>%
filter(isbins==1) %>%
select(index, X, Y, bd,bins,lwr, upr,mid, x,my,m0, isbins) %>%
unique() %>%
arrange(lwr,index) %>%
group_by(x) %>%
mutate(sum_k = sum(isbins)) %>%
ungroup()
# rolling LL
tbl_m1 <- tbl_match %>%
filter(isbins==1) %>%
select(index, X, Y, bd,bins,lft, rgt,x,m1, isbins) %>%
unique() %>%
arrange(lft,index) %>%
group_by(x) %>%
mutate(sum_k = sum(isbins)) %>%
ungroup() %>%
rename("lwr"="lft", "upr"="rgt")
# kernel LL
tbl_m2 <- tbl_match %>%
filter(isbins==1) %>%
select(index, X, Y, bd,bins,lft, rgt,x,m2, isbins) %>%
unique() %>%
arrange(lft,index) %>%
group_by(x) %>%
mutate(sum_k = sum(isbins)) %>%
ungroup() %>%
rename("lwr"="lft", "upr"="rgt")
例如,我们这里展示滚动局部线性非参数估计的计算表:
3.5 绘制分析图
3.5.1 绘制底图
#### draw plot ####
# basic plot
fsize <- 16
p0 <- ggplot() +
geom_point(aes(X, Y),data = dt, pch=21) +
labs(x= "自变量X", y ="因变量Y") +
scale_x_continuous(breaks = seq(0,10,1), limits = c(0,10)) +
scale_y_continuous(breaks = seq(-1,2,0.5), limits = c(-1,2)) +
theme_bw() +
theme(text = element_text(size = fsize))
p00 <- p0 +
geom_vline(xintercept = x0, lty="dashed") +
geom_point(aes(X, Y, color=as.factor(bd)),
data = tbl_m0 %>% select(X, Y, bd) %>% unique(),
pch=21) +
geom_rect(aes(ymin=-1,ymax=2,
xmin= lwr, xmax = upr,
fill = as.factor(bd)),
data = tbl_m0 %>% select(bd, lwr, upr) %>% unique(),
alpha = 0.05, inherit.aes = FALSE) +
theme(legend.position = "none",
text = element_text(size = fsize))
3.5.2 箱组回归估计图形
### binned LL plot
p1 <- p00 +
geom_point(aes(mid, my),
data = tbl_m0 %>% select(mid, my) %>% unique(),
pch=15, color="black") +
geom_line(aes(x =x, y = m0,
color=as.factor(bd)),
data = tbl_m0 %>% select(x,m0, bd) %>% unique(),
lty="solid", lwd=0.8) +
theme(legend.position = "none",
text = element_text(size = fsize))
3.5.3 滚动箱组回归估计图形
## rolling LL plot
p2 <- p0 +
geom_point(aes(mid, my),
data = tbl_m0 %>% select(mid, my) %>% unique(),
pch=15, color="black") +
geom_line(aes(x =x, y = m1,
color=as.factor(bd)),
data = tbl_m1 %>% select(x,m1, bd) %>% unique(),
lty="solid", lwd=0.8) +
theme(legend.position = "none",
text = element_text(size = fsize))
3.5.4 局部线性核回归估计图形
## Kernel LL plot
p3 <- p0 +
geom_point(aes(mid, my),
data = tbl_m0 %>% select(mid, my) %>% unique(),
pch=15, color="black") +
geom_line(aes(x =x, y = m2,
color=as.factor(bd)),
data = tbl_m2 %>% select(x,m2, bd) %>% unique(),
lty="solid", lwd=0.8)+
theme(legend.position = "none",
text = element_text(size = fsize))
3.5.5 三种方法的图形
## all three plot
p_all <- p0 +
geom_vline(xintercept = x0, lty="dashed") +
geom_point(aes(mid, my),
data = tbl_m0 %>% select(mid, my) %>% unique(),
pch=15, color="black", size =2) +
geom_line(aes(x =x, y = m0,
color="binned reg"),
data = tbl_m0 %>% select(x,m0, bd) %>% unique(),
lty="solid", lwd=0.8) +
geom_line(aes(x, m1, color="rolling reg"),
data = tbl_m1 %>% select(x,m1, bd) %>% unique(),
lty="solid", lwd=0.8) +
geom_line(aes(x, m2, color = "LL reg"),
data = tbl_m2 %>% select(x,m2, bd) %>% unique(),
lty="solid", lwd=0.8) +
scale_color_manual(
name="mx reg",
breaks = c("binned reg", "rolling reg","LL reg"),
values=c("green", "red","blue"))+
theme(legend.position = "right",
text = element_text(size = fsize))
3.6 展示分析结果
3.6.1 箱组局部回归估计结果
利用箱组局部线性回归估计公式,我们可以计算得到不同箱组的CEF估计:
tbl_m0 %>%
select(bd, my, bins,x, m0) %>%
unique() %>%
arrange(x) %>%
DT::datatable(options = list(dom = "tip", pageLength =6))%>%
formatRound(c(2,4:5),c(4))
基于设定的
x
表示 \(x_i \in (0.00, 0.01, 0.02, 0.03, 0.04, 0.05,\cdots)\),共有 \(N=1001\)个箱组回归估计
m0
\(\widehat{m}(x_i)=\widehat{Y}_i|x_i\)如前,我们设定区隔了5个箱组
- 根据前面计算表的拟合数据对 \((x_i,\widehat{m}(x_i))\),我们可以得到箱组线性回归估计结果:
3.6.2 滚动箱组局部回归估计结果
我们可以计算得到不同箱组的CEF估计结果:
tbl_m1 %>%
select(bd, bins,x, m1) %>%
unique() %>%
arrange(x) %>%
DT::datatable(options = list(dom = "tip", pageLength =6))%>%
formatRound(c(3:4),c(4))
基于设定的
x
表示 \(x_i \in (0.00, 0.01, 0.02, 0.03, 0.04, 0.05,\cdots)\),共有 \(N=1001\)个箱组回归估计
m1
\(=\widehat{m}(x_i)\)根据前面计算表拟合数据对 \((x_i,\widehat{m}(x_i))\),我们可以得到滚动箱组线性回归估计结果:
3.6.3 局部线性回归估计结果
利用局部线性回归估计公式,我们可以计算得到不同箱组的估计:
tbl_m2 %>%
select(bd, bins,x, m2) %>%
unique() %>%
arrange(x) %>%
DT::datatable(options = list(dom = "tip", pageLength =6))%>%
formatRound(c(3:4),c(4))
基于设定的
x
表示 \(x_i \in (0.00, 0.01, 0.02, 0.03, 0.04, 0.05,\cdots)\),共有 \(N=1001\)个箱组回归估计
m2
\(=\widehat{m}(x_i)\)根据前面计算表的拟合数据对 \((x_i,\widehat{m}(x_i))\),我们可以得到局部线性回归估计结果: