研究目标是把各地区一般公共支出统计表,正确识别并自动化清洗。
中国统计年鉴
数据网站:搜数网人大权限
搜数标题:
可下载年限范围:2000-2019
文件格式:html-形式表格
总结起来问题大概有三个:
html表格形式异常:
标题行异常:
2004年及以前,2006年:双行标题
2005年,单行标题
2018年,共3个html页面,第一个页面双行标题,其他则为单行标题
排列缩进异常:
主要包括如下几个目标:
获得排列整齐的表格(dataframe)
获得原始文件的列标题
高度编程自动化,尽量减少手工操作介入
经过仔细分析html源文件,可以发现表格背后的一些规律和特点:
A.表格显示不整齐现象:
无法对齐(chrome直接看html 表格)。尤其是碰到缺失值(且无特定缺失值标记符,如-),进一步加剧显示的混乱性。
基本对齐(chrome查看源代码,chrome - 检查- sources -)。
B.表格显示不整齐的原因:
html编码时,table/tbody/tr/td, 采用了 空格符。其unic编码显示为\u00A0
。
数据表对齐方式。在html源码文件中(chrome - 检查- sources -) 可能存在“左对齐”和“右对齐”两种办法。
C.表格显示不整齐的解决办法:
识别左对齐还是右对齐(check)
然后定位位置(str_locate_all)
处理缺失值问题,并进行全列匹配(left_join)。
转载需要用到的各种R包。
# load pkgs
library("xml2")
require("rvest")
require("stringr")
require("tidyverse")
require("tidyselect")
require("openxlsx")
require("here")
提前导入已经设置好的R函数:
# files html path
files_dir <- here::here("data", "public-budget","html")
files_html <- list.files(str_c(files_dir,"/"))
page_url <- str_c(files_dir, files_html, sep = "/")
# specify the year
year<- 2011
indx<- str_detect(page_url,paste0(year))
page<- page_url[indx]
i_len <- length(page)
html文件位于D:/github/team-students/data/public-budget/html/2011-t00.html。
R代码如下:
# xpath selector
path_tbl<-"/html/body/table[2]/tbody/tr[2]/td/table/tbody/tr/td[2]/table"
# the html source file declares gb2312
tbl_raw <- read_html(page,encoding = "utf-8") %>%
html_nodes(xpath = path_tbl) %>%
html_table(., fill=T, trim=T) %>%
.[[1]]
初步提取结果如下:
X1 |
---|
【行业】财税 |
【地域】中国 |
【时间】20111231 |
【参考资料】精讯数据 |
【统计项目】中国2011年各地区财政支出统计(13145字) |
【指标参数】 |
<U+00A0><U+00A0><U+00A0><U+00A0>中国2011年各地区财政支出统计 <U+00A0><U+00A0><U+00A0><U+00A0>单位:亿元 | |地区<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>一般预算支出<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>一般公共服务<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>外交<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>国防<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>公共安全<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>教育<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>科学技术<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>文化体育与传媒<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>社会保障和就业<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>医疗卫生<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>节能保护<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>城乡社区事务<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>农林水事务<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>交通运输<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>资源勘探电力信息等事务<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>商业服务业等事务<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>金融监管等事务支出<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>地震灾后恢复重建支出<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>国土资源气象等事务<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>住房保障支出<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>粮油物资储备管理事务<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>国债还本付息支出<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>其他支出 | |地方合计<U+00A0><U+00A0><U+00A0><U+00A0>92733.68<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>10084.77<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>2.75<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>198.29<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>5267.26<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>15498.28<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>1885.88<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>1704.64<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>10606.92<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>6358.19<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>2566.79<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>7608.93<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>9520.99<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>7166.69<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>3547.26<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>1394.79<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>235.34<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>174.45<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>1289.74<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>3491.87<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>729.49<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>564.12<U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0><U+00A0>2836.25 |
# detect the start and end row
first_end <- which(str_detect(tbl_raw$X1,"地方合计|新疆"))
n_part <- length(first_end)/2
# detect the range of table part
spc_part <- n_part
range <- (first_end[(2*spc_part -1)]):first_end[2*spc_part] # 如有"全国",则起始要减去1
通过特殊行“地方合计|新疆”,可以判明表格被分割了1个部分。
同时,可以探查到表格主体的范围8-39行之间。
R代码如下:
# handle the extra rows
pro_extra <- c("地方合计","内蒙古","黑龙江")
if (year %in% c(2006,2007,2009)){
pro_standard <- c("地方合计\u00A0","内蒙古\u00A0","黑龙江\u00A0")
} else {
pro_standard <- c("地方合计\u00A0\u00A0","内蒙古\u00A0","黑龙江\u00A0")
}
# get the tidy alignment tbl
tbl_html <- tbl_raw %>%
.[range,] %>%
as_tibble() %>%
mutate(value= str_trim(value, "both")) %>%
mutate(value=paste0(value, "\u00A0")) %>%
mutate(value= mgsub::mgsub(value, pro_extra, pro_standard)) %>%
# be careful to the return value' class of function str_locate_all()
mutate(location=str_locate_all(value, "(?<=\u00A0)(\\d.+?)(?=\u00A0)"),
start=(map(location, function(x){x[,1]})),
end=(map(location, function(x){x[,2]}))) %>%
mutate(begin=map(start, function(x){x[[1]]}),
tail=map(end, function(x){x[[1]]})) %>%
mutate(begin=unlist(begin),
tail=unlist(tail),
mod_begin=unlist(getModes(begin)),
mod_tail=unlist(getModes(tail)),
diff_begin = mod_begin-begin,
diff_tail = mod_tail-tail,
check_begin = (begin==mod_begin),
check_tail = (tail==mod_tail))
识别html表格对齐模式的简要结果:
tbl_html %>%
as_tibble() %>%
select(begin:check_tail) %>%
head() %>%
add_column(index= 1:nrow(.), .before = "begin") %>%
kable(caption = "检查排列的关系")
index | begin | tail | mod_begin | mod_tail | diff_begin | diff_tail | check_begin | check_tail |
---|---|---|---|---|---|---|---|---|
1 | 11 | 18 | 11 | 17 | 0 | -1 | TRUE | FALSE |
2 | 11 | 17 | 11 | 17 | 0 | 0 | TRUE | TRUE |
3 | 11 | 17 | 11 | 17 | 0 | 0 | TRUE | TRUE |
4 | 11 | 17 | 11 | 17 | 0 | 0 | TRUE | TRUE |
5 | 11 | 17 | 11 | 17 | 0 | 0 | TRUE | TRUE |
6 | 11 | 17 | 11 | 17 | 0 | 0 | TRUE | TRUE |
手动调整异常情况后:
如果首列定位完全一样,也即all(tbl_htm$check_begin)
为True
,则判定源文件表格为左对齐
如果尾列定位完全一样,也即all(tbl_htm$check_tail)
为True
,则判定源文件表格为右对齐
R代码如下:
# get the data.frame
tbl_pair <- tbl_html %>%
mutate(num =str_extract_all(value, "(?<=\u00A0)(\\d.+?)(?=\u00A0)")) %>%
mutate(pair=pmap(list(start, end, num),
.f = function(x,y,z){value <- tibble(start=x, end=y, num=z)})) %>%
mutate(pair=map(pair,
function(dt){calc<- dt %>% mutate(n=end-start +1)})) %>%
mutate(len=map(pair, function(dt){dim(dt)[1]}))
# get the standard mod
tbl_mod <- tbl_pair %>%
filter(len==max(unlist(.$len))) %>%
.[1,] %>%
select(pair) %>%
unnest(pair) %>%
mutate(cols=paste0("V",str_pad(1:nrow(.), width = 2,pad = "0", side = "left"))) %>%
select(start,end, cols)
# match the standard rows
if (all(tbl_pair$check_begin)){
tbl_match <- tbl_pair %>%
select(value,pair)%>%
mutate(pair=map(pair, function(dt){
out<- tbl_mod %>%
select(-end) %>% left_join(., dt, by="start") %>%
select(start, everything(.))}
))
} else if (all(tbl_pair$check_tail)) {
tbl_match <- tbl_pair %>%
select(value,pair)%>%
mutate(pair=map(pair, function(dt){
out<- tbl_mod %>%
select(-start) %>% left_join(., dt, by="end") %>%
select( end, everything(.))}
))
}
# prepare for detect provinces
levels_province<-c('地方合计','北京','天津','河北','山西','内蒙古','辽宁','吉林','黑龙江','上海','江苏','浙江','安徽','福建','江西','山东','河南','湖北','湖南','广东','广西','海南','重庆','四川','贵州','云南','西藏','陕西','甘肃','青海','宁夏','新疆')
pattern_pro <- paste0(levels_province,collapse = "|")
# get the normal data table
tbl_rows <- tbl_match %>%
add_column(row=1:nrow(.)) %>%
mutate(province= str_extract(value, pattern_pro)) %>%
select(-value) %>%
unnest(pair) %>%
select(row, province, num, cols) %>%
spread(key=cols ,value = num)
获得的整齐表格如下:
row | province | V01 | V02 | V03 | V04 | V05 | V06 | V07 | V08 | V09 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 地方合计 | 92733.68 | 10084.77 | 2.75 | 198.29 | 5267.26 | 15498.28 | 1885.88 | 1704.64 | 10606.92 | 6358.19 | 2566.79 | 7608.93 | 9520.99 | 7166.69 | 3547.26 | 1394.79 | 235.34 | 174.45 | 1289.74 | 3491.87 | 729.49 | 564.12 | 2836.25 |
2 | 北京 | 3245.23 | 261.38 | 6.59 | 218.68 | 520.08 | 183.07 | 87.01 | 354.88 | 225.49 | 94.51 | 339.27 | 187.34 | 199.12 | 170.35 | 41.06 | 2.09 | 0.39 | 9.88 | 54.56 | 5.14 | 284.34 | ||
3 | 天津 | 1796.33 | 117.81 | 1.01 | 100.41 | 302.32 | 60.17 | 29.76 | 168.34 | 90.53 | 32.24 | 485.42 | 91.78 | 97.37 | 75.86 | 20.55 | 1.92 | 25.31 | 20.82 | 2.18 | 72.54 | |||
4 | 河北 | 3537.39 | 414.93 | 11.55 | 200.98 | 652.11 | 33.22 | 50.45 | 426.23 | 302.75 | 105.48 | 239.37 | 366.10 | 261.36 | 82.86 | 54.17 | 2.08 | 73.68 | 142.56 | 25.14 | 6.61 | 85.75 | ||
5 | 山西 | 2363.85 | 251.58 | 4.41 | 130.31 | 421.79 | 27.17 | 48.17 | 321.60 | 159.62 | 82.18 | 142.33 | 241.45 | 178.93 | 44.84 | 22.99 | 8.57 | 0.01 | 134.76 | 84.85 | 17.47 | 3.07 | 37.77 | |
6 | 内蒙古 | 2989.21 | 304.53 | 0.06 | 4.46 | 145.88 | 390.69 | 28.21 | 68.78 | 363.97 | 164.59 | 117.55 | 301.40 | 391.68 | 281.48 | 71.04 | 25.82 | 4.30 | 70.25 | 140.35 | 59.70 | 10.22 | 44.23 | |
7 | 辽宁 | 3905.85 | 415.23 | 11.43 | 210.29 | 544.09 | 87.20 | 68.60 | 657.36 | 182.07 | 74.20 | 442.58 | 329.20 | 220.53 | 229.09 | 56.39 | 13.32 | 0.05 | 74.04 | 152.44 | 36.43 | 25.30 | 76.00 | |
8 | 吉林 | 2201.74 | 231.40 | 4.48 | 118.81 | 319.82 | 21.18 | 44.25 | 298.99 | 143.87 | 102.42 | 154.17 | 255.57 | 149.79 | 73.82 | 22.46 | 7.12 | 0.60 | 24.97 | 124.15 | 55.45 | 34.52 | 13.90 | |
9 | 黑龙江 | 2794.08 | 256.37 | 0.09 | 5.13 | 141.22 | 373.83 | 33.23 | 44.94 | 392.05 | 170.78 | 92.27 | 179.71 | 355.98 | 249.88 | 108.67 | 28.72 | 2.56 | 0.05 | 34.41 | 183.17 | 69 | 3.14 | 68.89 |
10 | 上海 | 3914.88 | 236.11 | 7.61 | 206.11 | 549.24 | 218.50 | 68.80 | 417.50 | 190.03 | 51.62 | 579.29 | 161.54 | 137.04 | 379.53 | 51.13 | 13.81 | 11.86 | 82.46 | 16.90 | 55.30 | 480.51 | ||
11 | 江苏 | 6221.72 | 748.45 | 18.46 | 371.40 | 1093.22 | 213.40 | 116.86 | 481.65 | 349.86 | 170.37 | 812.06 | 618.13 | 391.69 | 294.39 | 128.56 | 27.45 | 1.59 | 48.16 | 118.92 | 29.34 | 4.57 | 183.18 | |
12 | 浙江 | 3842.59 | 471.55 | 7.54 | 290.87 | 751.42 | 143.90 | 85.09 | 291.82 | 278.98 | 78.11 | 338.43 | 373.32 | 273.99 | 137.81 | 96.97 | 5.78 | 2.80 | 30.43 | 58.05 | 13.11 | 3.20 | 109.42 | |
13 | 安徽 | 3302.99 | 345.34 | 5.57 | 128.58 | 564.71 | 77.03 | 62.35 | 392.98 | 277.23 | 81.96 | 280.76 | 351.87 | 219.59 | 125.40 | 65.63 | 8.06 | 40.87 | 161.22 | 31.21 | 11.90 | 70.76 | ||
14 | 福建 | 2198.18 | 247.47 | 4.76 | 144.74 | 406.73 | 40.48 | 35.86 | 184.92 | 159.30 | 37.95 | 146.24 | 207.89 | 240.05 | 95.71 | 49.63 | 1.58 | 25.25 | 56.45 | 13.78 | 2.98 | 96.41 | ||
15 | 江西 | 2534.60 | 258.00 | 5.03 | 123.90 | 474.43 | 21.32 | 39.66 | 272.75 | 196.32 | 43.76 | 125.34 | 287.99 | 218.05 | 157.36 | 38.16 | 2.06 | 28.51 | 108.28 | 47.11 | 12.51 | 74.07 | ||
16 | 山东 | 5002.07 | 618.48 | 15.13 | 274.69 | 1047.90 | 108.62 | 91.57 | 501.54 | 360.36 | 113.95 | 401.70 | 564.00 | 294.91 | 184.41 | 119.34 | 21.17 | 0.52 | 68.15 | 67.56 | 35.78 | 32.13 | 80.18 | |
17 | 河南 | 4248.82 | 559.02 | 5.67 | 204.80 | 857.14 | 56.59 | 57.54 | 547.96 | 361.48 | 95.60 | 191.30 | 480.48 | 281.21 | 126.55 | 67.60 | 16.83 | 3.42 | 63.54 | 142.64 | 39.85 | 26.41 | 63.17 | |
18 | 湖北 | 3214.74 | 394.95 | 2.51 | 188.40 | 488.16 | 44.19 | 47.09 | 449.29 | 247.30 | 101.11 | 161.33 | 376.23 | 255.18 | 112.46 | 52.52 | 7.28 | 1.21 | 45.39 | 103.06 | 35.34 | 12.68 | 89.05 | |
19 | 湖南 | 3520.76 | 466.74 | 0.01 | 9.56 | 174.42 | 540.83 | 41.96 | 44.87 | 484.44 | 256.76 | 85.26 | 276.00 | 394.26 | 301.31 | 119.88 | 53.46 | 3.87 | 49.88 | 126.76 | 29.70 | 13.47 | 47.31 | |
20 | 广东 | 6712.40 | 807.41 | 0.04 | 13.04 | 569.85 | 1227.87 | 203.92 | 170.56 | 548.65 | 433.75 | 232.62 | 518.16 | 420.34 | 533.40 | 147.12 | 91.88 | 14.61 | 18.61 | 53.77 | 146.42 | 25.72 | 152.31 | 382.36 |
21 | 广西 | 2545.28 | 322.18 | 8.21 | 139.44 | 456.89 | 28.25 | 37.48 | 250.64 | 232.88 | 53.90 | 118.73 | 314.86 | 248.98 | 88.81 | 23.69 | 8.04 | 39.49 | 106.85 | 13.66 | 12.75 | 39.56 | ||
22 | 海南 | 778.80 | 82.02 | 0.96 | 4.46 | 53.23 | 127.27 | 9.83 | 16.60 | 94.04 | 50.30 | 23.97 | 39.57 | 105.63 | 53.19 | 21.44 | 10.66 | 0.25 | 11.29 | 34.52 | 1.95 | 3.88 | 33.74 | |
23 | 重庆 | 2570.24 | 224.58 | 8.88 | 124.93 | 318.70 | 25.04 | 31.16 | 338.76 | 143.70 | 100.81 | 394.46 | 198.91 | 186.23 | 187.81 | 40.22 | 12.63 | 0.01 | 44.02 | 157.47 | 15.61 | 2.58 | 13.73 | |
24 | 四川 | 4674.92 | 485.11 | 10.72 | 246.07 | 684.66 | 45.75 | 87.35 | 645.79 | 372.96 | 115.80 | 268.56 | 545.70 | 364.89 | 157.77 | 63.25 | 11.19 | 132.49 | 57.28 | 231.26 | 38.06 | 23.86 | 86.39 | |
25 | 贵州 | 2249.40 | 307.21 | 4.15 | 117.31 | 376.86 | 21.68 | 35.31 | 194.78 | 173.26 | 55.45 | 65.52 | 278.47 | 305.16 | 55.18 | 19.81 | 0.44 | 32.72 | 121.65 | 8.31 | 13.35 | 62.77 | ||
26 | 云南 | 2929.60 | 282.05 | 0.23 | 6.32 | 165.32 | 483.00 | 28.30 | 45.34 | 386.50 | 236.98 | 95.86 | 122.96 | 409.80 | 276.19 | 50.40 | 33.72 | 7.51 | 0.01 | 41.60 | 151.71 | 8.20 | 53.37 | 44.23 |
27 | 西藏 | 758.11 | 95.94 | 1.10 | 1.14 | 51.76 | 77.81 | 3.38 | 18.91 | 57.68 | 35.30 | 16.05 | 21.66 | 126.53 | 77.88 | 49.60 | 5.52 | 0.37 | 0.08 | 6.21 | 60.79 | 1.74 | 48.68 | |
28 | 陕西 | 2930.81 | 341.32 | 3.36 | 128.06 | 529.46 | 29.01 | 61.27 | 365.43 | 197.61 | 96.13 | 147.13 | 333.79 | 313.42 | 76.27 | 41.59 | 5.07 | 2.90 | 44.67 | 150.28 | 18.97 | 10.57 | 34.51 | |
29 | 甘肃 | 1791.24 | 174.92 | 1.85 | 80.31 | 284.33 | 13.22 | 33.07 | 279.22 | 143.18 | 84.99 | 65.88 | 237.66 | 158.91 | 29.31 | 16.95 | 2.47 | 8.77 | 41.06 | 92.58 | 9.24 | 8.83 | 24.49 | |
30 | 青海 | 967.47 | 65.40 | 0.67 | 35.10 | 130.11 | 3.76 | 14.32 | 163.57 | 47.44 | 41.76 | 27.98 | 104.74 | 148.58 | 25.37 | 10.91 | 6.98 | 0.09 | 16.12 | 106.15 | 4.75 | 8.19 | 5.49 | |
31 | 宁夏 | 705.91 | 51.96 | 0.85 | 35.12 | 103.02 | 7.87 | 13.94 | 71.95 | 41.09 | 35.23 | 81.51 | 112.19 | 41.54 | 20.54 | 16.18 | 6.95 | 7.11 | 33.86 | 2.72 | 7.17 | 15.13 | ||
32 | 新疆 | 2284.49 | 245.36 | 0.26 | 3.77 | 146.28 | 399.80 | 26.43 | 47.70 | 201.64 | 132.43 | 53.67 | 140.10 | 297.59 | 206.83 | 47.60 | 25.24 | 9.00 | 0.85 | 35.06 | 170.04 | 17.91 | 9.23 | 67.70 |
R代码如下:
# get the header from raw html
### the pattern
head_index <- which(unlist(str_detect(tbl_raw$X1, "地方合计")))-1
### be careful the extra case: one line header or two lines header ?
head_index <- head_index
### get the chn names
vars_index <- tbl_raw %>%
.[head_index,] %>%
gsub("(\u00A0){1,20}", "&", .) %>%
str_split(., pattern = "&") %>%
unlist() %>%
c("序号", .)
# rename header
tbl_chn <- tbl_rows %>%
rename_at(all_of(names(.)), ~all_of(vars_index))
获得的列名称的最终表格如下:
序号 | 地区 | 一般预算支出 | 一般公共服务 | 外交 | 国防 | 公共安全 | 教育 | 科学技术 | 文化体育与传媒 | 社会保障和就业 | 医疗卫生 | 节能保护 | 城乡社区事务 | 农林水事务 | 交通运输 | 资源勘探电力信息等事务 | 商业服务业等事务 | 金融监管等事务支出 | 地震灾后恢复重建支出 | 国土资源气象等事务 | 住房保障支出 | 粮油物资储备管理事务 | 国债还本付息支出 | 其他支出 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 地方合计 | 92733.68 | 10084.77 | 2.75 | 198.29 | 5267.26 | 15498.28 | 1885.88 | 1704.64 | 10606.92 | 6358.19 | 2566.79 | 7608.93 | 9520.99 | 7166.69 | 3547.26 | 1394.79 | 235.34 | 174.45 | 1289.74 | 3491.87 | 729.49 | 564.12 | 2836.25 |
2 | 北京 | 3245.23 | 261.38 | 6.59 | 218.68 | 520.08 | 183.07 | 87.01 | 354.88 | 225.49 | 94.51 | 339.27 | 187.34 | 199.12 | 170.35 | 41.06 | 2.09 | 0.39 | 9.88 | 54.56 | 5.14 | 284.34 | ||
3 | 天津 | 1796.33 | 117.81 | 1.01 | 100.41 | 302.32 | 60.17 | 29.76 | 168.34 | 90.53 | 32.24 | 485.42 | 91.78 | 97.37 | 75.86 | 20.55 | 1.92 | 25.31 | 20.82 | 2.18 | 72.54 | |||
4 | 河北 | 3537.39 | 414.93 | 11.55 | 200.98 | 652.11 | 33.22 | 50.45 | 426.23 | 302.75 | 105.48 | 239.37 | 366.10 | 261.36 | 82.86 | 54.17 | 2.08 | 73.68 | 142.56 | 25.14 | 6.61 | 85.75 | ||
5 | 山西 | 2363.85 | 251.58 | 4.41 | 130.31 | 421.79 | 27.17 | 48.17 | 321.60 | 159.62 | 82.18 | 142.33 | 241.45 | 178.93 | 44.84 | 22.99 | 8.57 | 0.01 | 134.76 | 84.85 | 17.47 | 3.07 | 37.77 | |
6 | 内蒙古 | 2989.21 | 304.53 | 0.06 | 4.46 | 145.88 | 390.69 | 28.21 | 68.78 | 363.97 | 164.59 | 117.55 | 301.40 | 391.68 | 281.48 | 71.04 | 25.82 | 4.30 | 70.25 | 140.35 | 59.70 | 10.22 | 44.23 | |
7 | 辽宁 | 3905.85 | 415.23 | 11.43 | 210.29 | 544.09 | 87.20 | 68.60 | 657.36 | 182.07 | 74.20 | 442.58 | 329.20 | 220.53 | 229.09 | 56.39 | 13.32 | 0.05 | 74.04 | 152.44 | 36.43 | 25.30 | 76.00 | |
8 | 吉林 | 2201.74 | 231.40 | 4.48 | 118.81 | 319.82 | 21.18 | 44.25 | 298.99 | 143.87 | 102.42 | 154.17 | 255.57 | 149.79 | 73.82 | 22.46 | 7.12 | 0.60 | 24.97 | 124.15 | 55.45 | 34.52 | 13.90 | |
9 | 黑龙江 | 2794.08 | 256.37 | 0.09 | 5.13 | 141.22 | 373.83 | 33.23 | 44.94 | 392.05 | 170.78 | 92.27 | 179.71 | 355.98 | 249.88 | 108.67 | 28.72 | 2.56 | 0.05 | 34.41 | 183.17 | 69 | 3.14 | 68.89 |
10 | 上海 | 3914.88 | 236.11 | 7.61 | 206.11 | 549.24 | 218.50 | 68.80 | 417.50 | 190.03 | 51.62 | 579.29 | 161.54 | 137.04 | 379.53 | 51.13 | 13.81 | 11.86 | 82.46 | 16.90 | 55.30 | 480.51 | ||
11 | 江苏 | 6221.72 | 748.45 | 18.46 | 371.40 | 1093.22 | 213.40 | 116.86 | 481.65 | 349.86 | 170.37 | 812.06 | 618.13 | 391.69 | 294.39 | 128.56 | 27.45 | 1.59 | 48.16 | 118.92 | 29.34 | 4.57 | 183.18 | |
12 | 浙江 | 3842.59 | 471.55 | 7.54 | 290.87 | 751.42 | 143.90 | 85.09 | 291.82 | 278.98 | 78.11 | 338.43 | 373.32 | 273.99 | 137.81 | 96.97 | 5.78 | 2.80 | 30.43 | 58.05 | 13.11 | 3.20 | 109.42 | |
13 | 安徽 | 3302.99 | 345.34 | 5.57 | 128.58 | 564.71 | 77.03 | 62.35 | 392.98 | 277.23 | 81.96 | 280.76 | 351.87 | 219.59 | 125.40 | 65.63 | 8.06 | 40.87 | 161.22 | 31.21 | 11.90 | 70.76 | ||
14 | 福建 | 2198.18 | 247.47 | 4.76 | 144.74 | 406.73 | 40.48 | 35.86 | 184.92 | 159.30 | 37.95 | 146.24 | 207.89 | 240.05 | 95.71 | 49.63 | 1.58 | 25.25 | 56.45 | 13.78 | 2.98 | 96.41 | ||
15 | 江西 | 2534.60 | 258.00 | 5.03 | 123.90 | 474.43 | 21.32 | 39.66 | 272.75 | 196.32 | 43.76 | 125.34 | 287.99 | 218.05 | 157.36 | 38.16 | 2.06 | 28.51 | 108.28 | 47.11 | 12.51 | 74.07 | ||
16 | 山东 | 5002.07 | 618.48 | 15.13 | 274.69 | 1047.90 | 108.62 | 91.57 | 501.54 | 360.36 | 113.95 | 401.70 | 564.00 | 294.91 | 184.41 | 119.34 | 21.17 | 0.52 | 68.15 | 67.56 | 35.78 | 32.13 | 80.18 | |
17 | 河南 | 4248.82 | 559.02 | 5.67 | 204.80 | 857.14 | 56.59 | 57.54 | 547.96 | 361.48 | 95.60 | 191.30 | 480.48 | 281.21 | 126.55 | 67.60 | 16.83 | 3.42 | 63.54 | 142.64 | 39.85 | 26.41 | 63.17 | |
18 | 湖北 | 3214.74 | 394.95 | 2.51 | 188.40 | 488.16 | 44.19 | 47.09 | 449.29 | 247.30 | 101.11 | 161.33 | 376.23 | 255.18 | 112.46 | 52.52 | 7.28 | 1.21 | 45.39 | 103.06 | 35.34 | 12.68 | 89.05 | |
19 | 湖南 | 3520.76 | 466.74 | 0.01 | 9.56 | 174.42 | 540.83 | 41.96 | 44.87 | 484.44 | 256.76 | 85.26 | 276.00 | 394.26 | 301.31 | 119.88 | 53.46 | 3.87 | 49.88 | 126.76 | 29.70 | 13.47 | 47.31 | |
20 | 广东 | 6712.40 | 807.41 | 0.04 | 13.04 | 569.85 | 1227.87 | 203.92 | 170.56 | 548.65 | 433.75 | 232.62 | 518.16 | 420.34 | 533.40 | 147.12 | 91.88 | 14.61 | 18.61 | 53.77 | 146.42 | 25.72 | 152.31 | 382.36 |
21 | 广西 | 2545.28 | 322.18 | 8.21 | 139.44 | 456.89 | 28.25 | 37.48 | 250.64 | 232.88 | 53.90 | 118.73 | 314.86 | 248.98 | 88.81 | 23.69 | 8.04 | 39.49 | 106.85 | 13.66 | 12.75 | 39.56 | ||
22 | 海南 | 778.80 | 82.02 | 0.96 | 4.46 | 53.23 | 127.27 | 9.83 | 16.60 | 94.04 | 50.30 | 23.97 | 39.57 | 105.63 | 53.19 | 21.44 | 10.66 | 0.25 | 11.29 | 34.52 | 1.95 | 3.88 | 33.74 | |
23 | 重庆 | 2570.24 | 224.58 | 8.88 | 124.93 | 318.70 | 25.04 | 31.16 | 338.76 | 143.70 | 100.81 | 394.46 | 198.91 | 186.23 | 187.81 | 40.22 | 12.63 | 0.01 | 44.02 | 157.47 | 15.61 | 2.58 | 13.73 | |
24 | 四川 | 4674.92 | 485.11 | 10.72 | 246.07 | 684.66 | 45.75 | 87.35 | 645.79 | 372.96 | 115.80 | 268.56 | 545.70 | 364.89 | 157.77 | 63.25 | 11.19 | 132.49 | 57.28 | 231.26 | 38.06 | 23.86 | 86.39 | |
25 | 贵州 | 2249.40 | 307.21 | 4.15 | 117.31 | 376.86 | 21.68 | 35.31 | 194.78 | 173.26 | 55.45 | 65.52 | 278.47 | 305.16 | 55.18 | 19.81 | 0.44 | 32.72 | 121.65 | 8.31 | 13.35 | 62.77 | ||
26 | 云南 | 2929.60 | 282.05 | 0.23 | 6.32 | 165.32 | 483.00 | 28.30 | 45.34 | 386.50 | 236.98 | 95.86 | 122.96 | 409.80 | 276.19 | 50.40 | 33.72 | 7.51 | 0.01 | 41.60 | 151.71 | 8.20 | 53.37 | 44.23 |
27 | 西藏 | 758.11 | 95.94 | 1.10 | 1.14 | 51.76 | 77.81 | 3.38 | 18.91 | 57.68 | 35.30 | 16.05 | 21.66 | 126.53 | 77.88 | 49.60 | 5.52 | 0.37 | 0.08 | 6.21 | 60.79 | 1.74 | 48.68 | |
28 | 陕西 | 2930.81 | 341.32 | 3.36 | 128.06 | 529.46 | 29.01 | 61.27 | 365.43 | 197.61 | 96.13 | 147.13 | 333.79 | 313.42 | 76.27 | 41.59 | 5.07 | 2.90 | 44.67 | 150.28 | 18.97 | 10.57 | 34.51 | |
29 | 甘肃 | 1791.24 | 174.92 | 1.85 | 80.31 | 284.33 | 13.22 | 33.07 | 279.22 | 143.18 | 84.99 | 65.88 | 237.66 | 158.91 | 29.31 | 16.95 | 2.47 | 8.77 | 41.06 | 92.58 | 9.24 | 8.83 | 24.49 | |
30 | 青海 | 967.47 | 65.40 | 0.67 | 35.10 | 130.11 | 3.76 | 14.32 | 163.57 | 47.44 | 41.76 | 27.98 | 104.74 | 148.58 | 25.37 | 10.91 | 6.98 | 0.09 | 16.12 | 106.15 | 4.75 | 8.19 | 5.49 | |
31 | 宁夏 | 705.91 | 51.96 | 0.85 | 35.12 | 103.02 | 7.87 | 13.94 | 71.95 | 41.09 | 35.23 | 81.51 | 112.19 | 41.54 | 20.54 | 16.18 | 6.95 | 7.11 | 33.86 | 2.72 | 7.17 | 15.13 | ||
32 | 新疆 | 2284.49 | 245.36 | 0.26 | 3.77 | 146.28 | 399.80 | 26.43 | 47.70 | 201.64 | 132.43 | 53.67 | 140.10 | 297.59 | 206.83 | 47.60 | 25.24 | 9.00 | 0.85 | 35.06 | 170.04 | 17.91 | 9.23 | 67.70 |