Lab 3 局部回归(模拟数据)
3.1 实验说明
3.2 数据集描述
为了更好地进行数据验证,我们继续使用前面蒙特卡洛模拟数据集:
\[\begin{align} Y_i &= m(X) +e_i =\frac{sin(\frac{\pi}{4}\cdot(X_i-2))}{\frac{\pi}{4}\cdot(X_i-2)} +e_i\\ X_i &\sim U(0,10)\\ e_i &\sim N(0, 2)\\ n &=100 \end{align}\]
此时,我们具有上帝视角,实际上已经知道数据生成机制(DGP)
此时,我们心里面已知真实模型为非线性的
# simulation data set
n <- 100
xm = 10
x <- seq(0,xm,.01)
xn <- length(x)
set.seed(180)
xdat <- runif(n,0,xm)
a <- pi/4
m <- sin(xdat*a)/(xdat*a)
m <- sin((xdat-2)*a)/((xdat-2)*a)
ydat <- m + rnorm(n)/4
dt <- tibble(index = 1:length(xdat), X = xdat, Y = ydat)3.2.1 数据呈现
dt %>%
  #add_column(obs = 1:nrow(.), .before = "X") %>%
  DT::datatable(
    caption = paste0("模拟的样本数据集(n=",n,")"),
    options = list(dom = "tip", pageLength =8))%>%
  formatRound(c(2:3), digits = 4)- 样本数据的描述性统计如下:
 
     index           X             Y        
 Min.   :  1   Min.   :0.1   Min.   :-0.65  
 1st Qu.: 26   1st Qu.:2.3   1st Qu.:-0.06  
 Median : 50   Median :4.6   Median : 0.27  
 Mean   : 50   Mean   :4.9   Mean   : 0.38  
 3rd Qu.: 75   3rd Qu.:7.5   3rd Qu.: 0.81  
 Max.   :100   Max.   :9.9   Max.   : 1.46  
3.3 局部回归估计过程
3.3.1 基本原理
- 对于箱组局部回归估计:箱组数为
n=5 
\[\widehat{m}(x)={\widehat{Y}_i} \quad \text{(OLS)}\]
注意:此处OLS回归的目标是得到不同箱组的预测值(Nn=1001)个,因此我们需要先根据样本数据(n=100)得到拟合线的截距和斜率,再进行新箱组数据(Nn=1001)进行预测。
- 对于滚动箱组局部回归估计,箱组数为
n=1001,我们使用矩形核函数,采用局部加权OLS方法计算不同箱组下仅含截距项的回归模型的截距系数\(Y=m(X)+e \simeq m(x)+e\): 
\[\begin{align} \widehat{m}_{\mathrm{nw}}(x)=\underset{m}{\operatorname{argmin}} \sum_{i=1}^{n} K\left(\frac{X_{i}-x}{h}\right)\left(Y_{i}-m\right)^{2} \end{align}\]
- 对于核函数局部回归估计,箱组数为
n=1001,我们使用高斯核函数,采用局部OLS方法计算得到下式的\(\beta_0 =m(x)\) 
\[\begin{align} Y &= m(X) +e \\ &\simeq m(x) + m^{\prime}(x)(X-x) +e \\ & = \beta_0 + \beta_1\cdot(X-x) +e \end{align}\]
以上模型可以视作为一元线性回归模型,其中 \(\beta_0 =m(x);\beta_1=m^{\prime}(x)\)。
3.3.2 局部回归计算
3.3.2.1 对于局部滚动回归和局部线性回归
局部滚动回归和局部线性回归,二者本质上是一样的,只是使用的核函数不同,二者具体代码计算如下:
#### calculate ####
# set bandwidth
h1 <- 1
h2 <- h1/sqrt(3)
# Rolling LL and Kernel LL estimation
mg <- matrix(0,xn,1)
mr <- matrix(0,xn,1)
for (j in 1:xn){
  xj <- xdat-x[j]
  z <- cbind(matrix(1,n,1),xj)
  k1 <- abs(xj) < h1      # Rectangle Kernel
  k2 <- dnorm(xj/h2)      # Gaussian Kernel
  zk1 <- z*k1
  zk2 <- z*k2
  betar <- solve(t(zk1)%*%z,t(zk1)%*%ydat)
  betag <- solve(t(zk2)%*%z,t(zk2)%*%ydat)
  mr[j] <- betar[1]   # for Rolling LL
  mg[j] <- betag[1]   # for Kernel LL
}3.3.2.2 对于局部箱组回归
如前所属,局部箱组回归总体来说需要进行两步运算:
我们需要先根据样本数据(n=100)直接使用OLS方法得到箱组内拟合线的截距和斜率,再进行新箱组数据(Nn=1001)进行预测。
局部箱组回归的具体代码计算如下:
# Binned LL estimation
x0 <- c(1,3,5,7,9)
## for bins 1
x1 <- as.matrix(subset(xdat,xdat<2))
y1 <- as.matrix(subset(ydat,xdat<2))
z1 <- cbind(matrix(1,length(y1),1),x1)
beta1 <- solve(t(z1)%*%z1,t(z1)%*%y1)   # get OLS  coefficients
m1 <- beta1[1]+beta1[2]*x0[1]    # mean centers
x1 <- as.matrix(subset(x,x<2))
f1 <- beta1[1]+x1*beta1[2]     # forecast with new data
## for bins 2
x2 <- as.matrix(subset(xdat,(xdat>=2)&(xdat<4)))
y2 <- as.matrix(subset(ydat,(xdat>=2)&(xdat<4)))
z2 <- cbind(matrix(1,length(y2),1),x2)
beta2 <- solve(t(z2)%*%z2,t(z2)%*%y2)
m2 <- beta2[1]+beta2[2]*x0[2]          
x2 <- as.matrix(subset(x,(x>=2)&(x<4)))
f2 <- beta2[1]+x2*beta2[2]
## for bins 3
x3 <- as.matrix(subset(xdat,(xdat>=4)&(xdat<6)))
y3 <- as.matrix(subset(ydat,(xdat>=4)&(xdat<6)))
z3 <- cbind(matrix(1,length(y3),1),x3)
beta3 <- solve(t(z3)%*%z3,t(z3)%*%y3)
m3 <- beta3[1]+beta3[2]*x0[3]
x3 <- as.matrix(subset(x,(x>=4)&(x<6)))
f3 <- beta3[1]+x3*beta3[2]
## for bins 4
x4 <- as.matrix(subset(xdat,(xdat>=6)&(xdat<8)))
y4 <- as.matrix(subset(ydat,(xdat>=6)&(xdat<8)))
z4 <- cbind(matrix(1,length(y4),1),x4)
beta4 <- solve(t(z4)%*%z4,t(z4)%*%y4)
m4 <- beta4[1]+beta4[2]*x0[4]
x4 <- as.matrix(subset(x,(x>=6)&(x<8)))
f4 <- beta4[1]+x4*beta4[2]
## for bins 5
x5 <- as.matrix(subset(xdat,xdat>=8))
y5 <- as.matrix(subset(ydat,xdat>=8))
z5 <- cbind(matrix(1,length(y5),1),x5)
beta5 <- solve(t(z5)%*%z5,t(z5)%*%y5)
m5 <- beta5[1]+beta5[2]*x0[5]
x5 <- as.matrix(subset(x,(x>=8)))
f5 <- beta5[1]+x5*beta5[2]
## combine as data.frame
tbl_a0 <- cbind(beta1, beta2,beta3, beta4,beta5) %>%
  as_tibble(.name_repair = "unique") %>%
  rename_all(~paste0("bd",1:5)) %>%
  add_column(par= c("intercept", "slope"), .before = "bd1")
##  binned LL CEF estimation
m0 <- c(m1,m2,m3,m4,m5)        ## mean centers
f0 <- c(f1, f2,f3, f4, f5)     ## LL estimation3.4 整合计算结果
3.4.1 基于箱组数的结果整合(N=1001)
根据前面的计算,我们把三种方法的局部回归计算过程及结果组合到一张表里来:
#### combine all result ####
tbl_result <- tibble(id = 1:length(x),
                     x = x,
                     x_upr = x +0.01) %>%
  mutate(bd = cut_interval(x,length = 2, right=FALSE)) %>%
  mutate(
    lwr = as.numeric(str_extract(bd, "(\\d{1})(?=\\,)")),
    upr = as.numeric(str_extract(bd, "(?<=\\,)(\\d{1,2})")),
    mid = 0.5*(lwr+ upr)
  ) %>%
  mutate(lft = x-h1,
         rgt = x +h1,
         bins = str_c('[',number(lft,0.01),',',
                      number(rgt,0.01), ')')) %>%
  mutate(my = rep(m0, times=c(200,200,200,200,201)),
         m0 = (as.vector(f0)),   # pay attention here !!!
         m1 = (as.vector(mr)),
         m2 = as.vector(mg))下面展示数据表结果:
3.4.2 匹配样本数据和箱组(Nn=100100)
考虑到每1个样本数据点,都会被不同的箱组所“套中”,因此我们现在匹配样本数据(n=100)到不同的箱组(N=1001)中去,最终得到了Nn=100100行的计算表。
具体计算代码为:
3.4.3 获得不同方法下的计算表格
现在给出三种方法下的计算表:
# binned LL
tbl_m0 <- tbl_match %>%
  filter(isbins==1) %>%
  select(index, X, Y, bd,bins,lwr, upr,mid, x,my,m0, isbins) %>%
  unique() %>%
  arrange(lwr,index) %>%
  group_by(x) %>%
  mutate(sum_k = sum(isbins)) %>%
  ungroup() 
# rolling LL
tbl_m1 <- tbl_match %>%
  filter(isbins==1) %>%
  select(index, X, Y, bd,bins,lft, rgt,x,m1, isbins) %>%
  unique() %>%
  arrange(lft,index) %>%
  group_by(x) %>%
  mutate(sum_k = sum(isbins)) %>%
  ungroup() %>%
  rename("lwr"="lft", "upr"="rgt")
# kernel LL
tbl_m2 <- tbl_match %>%
  filter(isbins==1) %>%
  select(index, X, Y, bd,bins,lft, rgt,x,m2, isbins) %>%
  unique() %>%
  arrange(lft,index) %>%
  group_by(x) %>%
  mutate(sum_k = sum(isbins)) %>%
  ungroup() %>%
  rename("lwr"="lft", "upr"="rgt")例如,我们这里展示滚动局部线性非参数估计的计算表:
3.5 绘制分析图
3.5.1 绘制底图
#### draw plot ####
# basic plot
fsize <- 16
p0 <- ggplot() +
  geom_point(aes(X, Y),data = dt, pch=21) +
  labs(x= "自变量X", y ="因变量Y") +
  scale_x_continuous(breaks = seq(0,10,1), limits = c(0,10)) +
  scale_y_continuous(breaks = seq(-1,2,0.5), limits = c(-1,2)) +
  theme_bw() +
  theme(text = element_text(size = fsize))
p00 <- p0 +
  geom_vline(xintercept = x0, lty="dashed") +
  geom_point(aes(X, Y, color=as.factor(bd)),
             data = tbl_m0 %>% select(X, Y, bd) %>% unique(),
             pch=21) +
  geom_rect(aes(ymin=-1,ymax=2,
                xmin= lwr, xmax = upr, 
                fill = as.factor(bd)),
            data = tbl_m0 %>% select(bd, lwr, upr) %>% unique(),
            alpha = 0.05, inherit.aes = FALSE) +
  theme(legend.position = "none",
        text = element_text(size = fsize))3.5.2 箱组回归估计图形
### binned LL plot
p1 <- p00 +
  geom_point(aes(mid, my),
             data = tbl_m0 %>% select(mid, my) %>% unique(), 
             pch=15, color="black") +
  geom_line(aes(x =x, y = m0,
                color=as.factor(bd)),
               data = tbl_m0 %>% select(x,m0, bd) %>% unique(),
               lty="solid", lwd=0.8) +
  theme(legend.position = "none",
        text = element_text(size = fsize))3.5.3 滚动箱组回归估计图形
## rolling LL  plot
p2 <- p0 +
  geom_point(aes(mid, my),
             data = tbl_m0 %>% select(mid, my) %>% unique(), 
             pch=15, color="black")  +
  geom_line(aes(x =x, y = m1,
                color=as.factor(bd)),
            data = tbl_m1 %>% select(x,m1, bd) %>% unique(),
            lty="solid", lwd=0.8) +
  theme(legend.position = "none",
        text = element_text(size = fsize))3.5.4 局部线性核回归估计图形
## Kernel LL  plot
p3 <- p0 +
  geom_point(aes(mid, my),
             data = tbl_m0 %>% select(mid, my) %>% unique(), 
             pch=15, color="black")  +
  geom_line(aes(x =x, y = m2,
                color=as.factor(bd)),
            data = tbl_m2 %>% select(x,m2, bd) %>% unique(),
            lty="solid", lwd=0.8)+
  theme(legend.position = "none",
        text = element_text(size = fsize))3.5.5 三种方法的图形
## all three plot
p_all <- p0 +
  geom_vline(xintercept = x0, lty="dashed")  +
  geom_point(aes(mid, my),
             data = tbl_m0 %>% select(mid, my) %>% unique(), 
             pch=15, color="black", size =2)  +
  geom_line(aes(x =x, y = m0, 
                color="binned reg"),
            data = tbl_m0 %>% select(x,m0, bd) %>% unique(),
            lty="solid", lwd=0.8) +
  geom_line(aes(x, m1, color="rolling reg"),
            data = tbl_m1 %>% select(x,m1, bd) %>% unique(),
            lty="solid", lwd=0.8) +
  geom_line(aes(x, m2, color = "LL reg"),
            data = tbl_m2 %>% select(x,m2, bd) %>% unique(),
            lty="solid", lwd=0.8) +
  scale_color_manual(
    name="mx reg",
    breaks = c("binned reg", "rolling reg","LL reg"),
    values=c("green", "red","blue"))+
  theme(legend.position = "right",
        text = element_text(size = fsize))3.6 展示分析结果
3.6.1 箱组局部回归估计结果
利用箱组局部线性回归估计公式,我们可以计算得到不同箱组的CEF估计:
tbl_m0 %>%
  select(bd, my, bins,x, m0) %>%
  unique() %>%
  arrange(x) %>%
  DT::datatable(options = list(dom = "tip", pageLength =6))%>%
  formatRound(c(2,4:5),c(4))基于设定的
x表示 \(x_i \in (0.00, 0.01, 0.02, 0.03, 0.04, 0.05,\cdots)\),共有 \(N=1001\)个箱组回归估计
m0\(\widehat{m}(x_i)=\widehat{Y}_i|x_i\)如前,我们设定区隔了5个箱组
- 根据前面计算表的拟合数据对 \((x_i,\widehat{m}(x_i))\),我们可以得到箱组线性回归估计结果:
 
3.6.2 滚动箱组局部回归估计结果
我们可以计算得到不同箱组的CEF估计结果:
tbl_m1 %>%
  select(bd, bins,x, m1) %>%
  unique() %>%
  arrange(x) %>%
  DT::datatable(options = list(dom = "tip", pageLength =6))%>%
  formatRound(c(3:4),c(4))基于设定的
x表示 \(x_i \in (0.00, 0.01, 0.02, 0.03, 0.04, 0.05,\cdots)\),共有 \(N=1001\)个箱组回归估计
m1\(=\widehat{m}(x_i)\)根据前面计算表拟合数据对 \((x_i,\widehat{m}(x_i))\),我们可以得到滚动箱组线性回归估计结果:
3.6.3 局部线性回归估计结果
利用局部线性回归估计公式,我们可以计算得到不同箱组的估计:
tbl_m2 %>%
  select(bd, bins,x, m2) %>%
  unique() %>%
  arrange(x) %>%
  DT::datatable(options = list(dom = "tip", pageLength =6))%>%
  formatRound(c(3:4),c(4))基于设定的
x表示 \(x_i \in (0.00, 0.01, 0.02, 0.03, 0.04, 0.05,\cdots)\),共有 \(N=1001\)个箱组回归估计
m2\(=\widehat{m}(x_i)\)根据前面计算表的拟合数据对 \((x_i,\widehat{m}(x_i))\),我们可以得到局部线性回归估计结果: