自动化数据抓取技术(III)：网页元素处理

Last updated on Dec 3, 2020 1 min read R

常见问题场景及处理方法

常见问题场景及处理方法

中文乱码问题

中文乱码问题再次出现，见各种乱码问题。编码问题详解可以参看What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text。

1.快捷办法，直接抓取网页表格，无论如何都显示乱码。只能放弃。

# look for table element
tableElem <- remDr$findElement(using = "id", "courseTable")

txt <- tableElem$getElementAttribute("outerHTML")[[1]]

table <- XML::readHTMLTable(txt, header=F, as.data.frame=TRUE)[[1]]

暴力抓取网页元素总，虽然颇为费劲，但总是可行。

# scrape the date and room 
v_date <- txt %>% read_html() %>%  xml_nodes("tbody") %>% xml_nodes("td:nth-child(2)") %>% xml_text()

v_room <- txt %>% read_html() %>%  xml_nodes("tbody") %>% xml_nodes("td:nth-child(4)") %>% xml_text()

# tidy data.frame
info <- data.frame(date=v_date, room =v_room) %>%
  separate(col = "date" , into = c("date","week", "weekday", "slot"), sep = " ")

nodes节点选择

如何选择和提取网页多个节点tr下第n个元素td下的文本text。参看网络解答

    txt %>% 
  read_html() %>%  
  xml_nodes("tbody") %>% 
  xml_nodes("td:nth-child(2)") %>% 
  xml_text()

利用chrome浏览器查看json数据

具体参看资料

Using Chrome: Right-click > Inspect; navigate to Network tab > type in .json > Search > Refresh Site (to catch calls made prior)

webscrape

Hu Huaping

PhD on Agricultural Economic and Management

My research interests include Data Science, Statistics, Agricultural Economics and Management.