自动化数据抓取技术(III):网页元素处理
常见问题场景及处理方法
中文乱码问题
中文乱码问题再次出现,见各种乱码问题。编码问题详解可以参看What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text。
1.快捷办法,直接抓取网页表格,无论如何都显示乱码。只能放弃。
# look for table element
tableElem <- remDr$findElement(using = "id", "courseTable")
txt <- tableElem$getElementAttribute("outerHTML")[[1]]
table <- XML::readHTMLTable(txt, header=F, as.data.frame=TRUE)[[1]]
- 暴力抓取网页元素总,虽然颇为费劲,但总是可行。
# scrape the date and room
v_date <- txt %>% read_html() %>% xml_nodes("tbody") %>% xml_nodes("td:nth-child(2)") %>% xml_text()
v_room <- txt %>% read_html() %>% xml_nodes("tbody") %>% xml_nodes("td:nth-child(4)") %>% xml_text()
# tidy data.frame
info <- data.frame(date=v_date, room =v_room) %>%
separate(col = "date" , into = c("date","week", "weekday", "slot"), sep = " ")
nodes节点选择
如何选择和提取网页多个节点tr下第n个元素td下的文本text。参看网络解答
txt %>%
read_html() %>%
xml_nodes("tbody") %>%
xml_nodes("td:nth-child(2)") %>%
xml_text()
利用chrome浏览器查看json数据
具体参看资料
Using Chrome: Right-click > Inspect; navigate to Network tab > type in .json > Search > Refresh Site (to catch calls made prior)