自动化数据抓取技术(III):网页元素处理

常见问题场景及处理方法

中文乱码问题

中文乱码问题再次出现,见各种乱码问题。编码问题详解可以参看What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text

1.快捷办法,直接抓取网页表格,无论如何都显示乱码。只能放弃。

# look for table element
tableElem <- remDr$findElement(using = "id", "courseTable")

txt <- tableElem$getElementAttribute("outerHTML")[[1]]

table <- XML::readHTMLTable(txt, header=F, as.data.frame=TRUE)[[1]]
  1. 暴力抓取网页元素总,虽然颇为费劲,但总是可行。
# scrape the date and room 
v_date <- txt %>% read_html() %>%  xml_nodes("tbody") %>% xml_nodes("td:nth-child(2)") %>% xml_text()

v_room <- txt %>% read_html() %>%  xml_nodes("tbody") %>% xml_nodes("td:nth-child(4)") %>% xml_text()

# tidy data.frame
info <- data.frame(date=v_date, room =v_room) %>%
  separate(col = "date" , into = c("date","week", "weekday", "slot"), sep = " ")

nodes节点选择

如何选择和提取网页多个节点tr下第n个元素td下的文本text。参看网络解答

    txt %>% 
  read_html() %>%  
  xml_nodes("tbody") %>% 
  xml_nodes("td:nth-child(2)") %>% 
  xml_text()

利用chrome浏览器查看json数据

具体参看资料

Using Chrome: Right-click > Inspect; navigate to Network tab > type in .json > Search > Refresh Site (to catch calls made prior) 
Hu Huaping
Hu Huaping
PhD on Agricultural Economic and Management

My research interests include Data Science, Statistics, Agricultural Economics and Management.

Related