自动化数据抓取技术(V)：APACHA人机验证机制

Last updated on Dec 5, 2020 2 min read R

APACHA人机验证的类型
视觉识别OCR技术
- 传统的tesseract识别包
- google vision 云平台API

APACHA人机验证的类型

APACHA是一种人机验证机制，对于网络爬虫而言，大家更熟悉的是网站采用APACHA机制来设置“防爬虫”门槛，也即各类验证码、滑块验证。这种验证机制比较成熟和严谨，应用场景十分广泛。如何有效识别和破解成为现实一大难题。

视觉识别OCR技术

实际上视觉识别已经发展到多个领域，包括图片标记、面孔和地标检测、光学字符识别 (OCR)等。

传统的tesseract识别包

tesseract包专门用于从图片中提取文本github repo。

优点：独立算法，简单快速，本地即可运行，无需联网。
缺点：算法比较老旧，识别准确率不太高。

具体代码示例如下：

#install.packages("tesseract")
library("tesseract")
dir_gray <- here::here("pic", "zhuyun", "valid-img-gray.png")
eng <- tesseract("eng")
txt <- tesseract::ocr(image = dir_gray, engine = eng) %>% str_extract("\\d{4}")

google vision 云平台API

google cloud platform 提供了Vision API，可以完成各类视觉识别任务。

优点：识别技术强大，识别准确率高。
缺点：（国内）需要网络和网速支持。另外就是有使用量的限制，需要支付结算进行扩容使用量。

R用户的具体实现：

1.申请google vision API接入授权。具体：

登陆google开发者控制台（Google’s developer console）进行申请和授权。
创建project，并申请开通Vision的API服务。
设置OAuth 2.0客户端和OAuth同意屏幕。

2.下载安装RoogleVision包（github repo）。

具体代码示例如下：

#install.packages("RoogleVision", repos = c(getOption("repos"), "http://cloudyr.github.io/drat"))
if (!require("devtools")) {
    install.packages("ghit")
}
devtools::install_github("cloudyr/RoogleVision")

library("RoogleVision")

### plugin your credentials
options("googleAuthR.client_id" = keyring::key_get("id", keyring = "gg-vision2"))
options("googleAuthR.client_secret" = keyring::key_get("secret", keyring = "gg-vision2"))

## use the fantastic Google Auth R package
### define scope!
options("googleAuthR.scopes.selected" = c("https://www.googleapis.com/auth/cloud-platform"))

googleAuthR::gar_auth()


#Basic: you can provide both, local as well as online images:
txt <- getGoogleVisionResponse(imagePath="pic/zhuyun/valid-img-gray.png", feature="TEXT_DETECTION",numResults = 1)

webscrape

Hu Huaping

PhD on Agricultural Economic and Management

My research interests include Data Science, Statistics, Agricultural Economics and Management.