I'm trying to scrape data from a password-protected website in R. Reading around, it seems that the httr and RCurl packages are the best options for scraping with password authentication (I've also looked into the XML package).
我正在尝试从R.中的一个密码保护网站上抓取数据,似乎httr和RCurl包是使用密码身份验证进行抓取的最佳选择(我还研究了XML包)。
The website I'm trying to scrape is below (you need a free account in order to access the full page): http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2
我想要抓取的网站在下面(你需要一个免费账号才能进入完整页面):http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2
Here are my two attempts (replacing "username" with my username and "password" with my password):
以下是我的两次尝试(用我的用户名替换“用户名”,用我的密码替换“密码”):
#This returns "Status: 200" without the data from the page:
library(httr)
GET("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", authenticate("username", "password"))
#This returns the non-password protected preview (i.e., not the full page):
library(XML)
library(RCurl)
readHTMLTable(getURL("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", userpwd = "username:password"))
I have looked at other relevant posts (links below), but can't figure out how to apply their answers to my case.
我看过其他相关的文章(链接如下),但是我不知道如何将他们的答案应用到我的案例中。
How to use R to download a zipped file from a SSL page that requires cookies
如何使用R从需要cookie的SSL页面下载压缩文件
How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?
如何在R (https链接)(使用XML包的readHTMLTable)中获取安全页面?
Reading information from a password protected site
从密码保护站点读取信息
R - RCurl scrape data from a password-protected site
R - RCurl从一个受密码保护的站点收集数据
http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold
http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold
2 个解决方案
#1
11
I don't have an account to test with, but maybe this will work:
我没有要测试的帐户,但也许这个会有用:
library(httr)
library(XML)
handle <- handle("http://subscribers.footballguys.com")
path <- "amember/login.php"
# fields found in the login form.
login <- list(
amember_login = "username"
,amember_pass = "password"
,amember_redirect_url =
"http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)
response <- POST(handle = handle, path = path, body = login)
Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handle
might be re-used for subsequent requests. Can't test it; but this works for me in many situations.
现在,响应对象可能包含您需要的内容(或者您可以在登录请求之后直接查询相关页面;我不确定重定向是否有效,但它是web表单中的一个字段),句柄可能会被用于后续请求。不能对其进行测试;但这在很多情况下对我都适用。
You can output the table using XML
您可以使用XML输出表。
> readHTMLTable(content(response))[[1]][1:5,]
Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15
2 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70
4 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60
#2
12
You can use RSelenium. I have used the dev version as you can run phantomjs
without a Selenium Server.
您可以使用RSelenium。我使用了开发版本,因为您可以在没有Selenium服务器的情况下运行phantomjs。
# Install RSelenium if required. You will need phantomjs in your path or follow instructions
# in package vignettes
# devtools::install_github("ropensci/RSelenium")
# login first
appURL <- 'http://subscribers.footballguys.com/amember/login.php'
library(RSelenium)
pJS <- phantom() # start phantomjs
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "login")$sendKeysToElement(list("myusername"))
remDr$findElement("id", "pass")$sendKeysToElement(list("mypass"))
remDr$findElement("css", ".am-login-form input[type='submit']")$clickElement()
appURL <- 'http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2'
remDr$navigate(appURL)
tableElem<- remDr$findElement("css", "table.datamedium")
res <- readHTMLTable(header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])
> res[[1]][1:5, ]
Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15
2 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70
4 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60
Finally when you are finished close phantomjs
最后当你结束时,关闭幽灵
pJS$stop()
If you want to use a traditional browser like firefox for example (if you wanted to stick to the version on CRAN) you would use:
如果你想使用传统的浏览器,比如firefox(如果你想坚持使用CRAN版本),你可以使用:
RSelenium::startServer()
remDr <- remoteDriver()
........
........
remDr$closeServer()
in place of the related phantomjs
calls.
代替相关的显象调用。
#1
11
I don't have an account to test with, but maybe this will work:
我没有要测试的帐户,但也许这个会有用:
library(httr)
library(XML)
handle <- handle("http://subscribers.footballguys.com")
path <- "amember/login.php"
# fields found in the login form.
login <- list(
amember_login = "username"
,amember_pass = "password"
,amember_redirect_url =
"http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)
response <- POST(handle = handle, path = path, body = login)
Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handle
might be re-used for subsequent requests. Can't test it; but this works for me in many situations.
现在,响应对象可能包含您需要的内容(或者您可以在登录请求之后直接查询相关页面;我不确定重定向是否有效,但它是web表单中的一个字段),句柄可能会被用于后续请求。不能对其进行测试;但这在很多情况下对我都适用。
You can output the table using XML
您可以使用XML输出表。
> readHTMLTable(content(response))[[1]][1:5,]
Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15
2 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70
4 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60
#2
12
You can use RSelenium. I have used the dev version as you can run phantomjs
without a Selenium Server.
您可以使用RSelenium。我使用了开发版本,因为您可以在没有Selenium服务器的情况下运行phantomjs。
# Install RSelenium if required. You will need phantomjs in your path or follow instructions
# in package vignettes
# devtools::install_github("ropensci/RSelenium")
# login first
appURL <- 'http://subscribers.footballguys.com/amember/login.php'
library(RSelenium)
pJS <- phantom() # start phantomjs
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "login")$sendKeysToElement(list("myusername"))
remDr$findElement("id", "pass")$sendKeysToElement(list("mypass"))
remDr$findElement("css", ".am-login-form input[type='submit']")$clickElement()
appURL <- 'http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2'
remDr$navigate(appURL)
tableElem<- remDr$findElement("css", "table.datamedium")
res <- readHTMLTable(header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])
> res[[1]][1:5, ]
Rank Name Tm/Bye Age Exp Cmp Att Cm% PYd Y/Att PTD Int Rsh Yd TD FantPt
1 1 Peyton Manning DEN/4 38 17 415 620 66.9 4929 7.95 43 12 24 7 0 407.15
2 2 Drew Brees NO/6 35 14 404 615 65.7 4859 7.90 37 16 22 44 1 385.35
3 3 Aaron Rodgers GB/9 31 10 364 560 65.0 4446 7.94 33 13 52 224 3 381.70
4 4 Andrew Luck IND/10 25 3 366 610 60.0 4423 7.25 27 13 62 338 2 361.95
5 5 Matthew Stafford DET/9 26 6 377 643 58.6 4668 7.26 32 19 34 102 1 358.60
Finally when you are finished close phantomjs
最后当你结束时,关闭幽灵
pJS$stop()
If you want to use a traditional browser like firefox for example (if you wanted to stick to the version on CRAN) you would use:
如果你想使用传统的浏览器,比如firefox(如果你想坚持使用CRAN版本),你可以使用:
RSelenium::startServer()
remDr <- remoteDriver()
........
........
remDr$closeServer()
in place of the related phantomjs
calls.
代替相关的显象调用。