使用R接受cookie以下载PDF文件

I'm getting stuck on cookies when trying to download a PDF.

我在尝试下载PDF时遇到了问题。

For example, if I have a DOI for a PDF document on the Archaeology Data Service, it will resolve to this landing page with an embedded link in it to this pdf but which really redirects to this other link.

例如,如果我在考古数据服务上有PDF文档的DOI,它将解析到此着陆页,其中包含嵌入链接到此pdf,但它真正重定向到此其他链接。

library(httr) will handle resolving the DOI and we can extract the pdf URL from the landing page using library(XML) but I'm stuck at getting the PDF itself.

library(httr)将处理解析DOI,我们可以使用库(XML)从登陆页面中提取pdf URL但我仍然坚持获取PDF本身。

If I do this:

如果我这样做:

download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")

then I receive a HTML file that is the same as http://archaeologydataservice.ac.uk/myads/

然后我收到一个与http://archaeologydataservice.ac.uk/myads/相同的HTML文件

Trying the answer at How to use R to download a zipped file from a SSL page that requires cookies leads me to this:

尝试如何使用R从需要cookie的SSL页面下载压缩文件的答案引导我:

library(httr)

terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")

# Accept the terms on the form,
# generating the appropriate cookies

POST(terms, body = values)
GET(download, query = values)

# Actually download the file (this will take a while)

resp <- GET(download, query = values)

# write the content of the download to a binary file

writeBin(content(resp, "raw"), "c:/temp/thefile.zip")

But after the POST and GET functions I simply get the HTML of the same cookie page that I got with download.file:

但是在POST和GET函数之后,我只是获得了与download.file相同的cookie页面的HTML:

> GET(download, query = values)
Response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466]
  Date: 2016-01-06 00:35
  Status: 200
  Content-Type: text/html;charset=UTF-8
  Size: 21 kB
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "h...
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
        <head>
            <meta http-equiv="Content-Type" content="text/html; c...


            <title>Archaeology Data Service:  myADS</title>

            <link href="http://archaeologydataservice.ac.uk/css/u...
...

Looking at http://archaeologydataservice.ac.uk/about/Cookies it seems that the cookie situation at this site is complicated. Seems like this kind of cookie complexity is not unusual for UK data providers: automating the login to the uk data service website in R with RCurl or httr

看看http://archaeologydataservice.ac.uk/about/Cookies看来这个网站的cookie情况很复杂。对于英国数据提供商来说,这类cookie的复杂性似乎并不罕见:使用RCurl或httr自动登录R中的英国数据服务网站

How can I use R to get past the cookies on this website?

如何使用R来浏览本网站上的cookie?

2 个解决方案

#1

Your plea on rOpenSci has been heard!

你已经听到了对rOpenSci的请求!

There's lots of javascript between those pages that makes it somewhat annoying to try to decipher via httr + rvest. Try RSelenium. This worked on OS X 10.11.2, R 3.2.3 & Firefox loaded.

这些页面之间有很多javascript,这使得尝试通过httr + rvest解密有点烦人。试试RSelenium。这适用于OS X 10.11.2,R 3.2.3和Firefox加载。

library(RSelenium)

# check if a sever is present, if not, get a server
checkForServer()

# get the server going
startServer()

dir.create("~/justcreateddir")
setwd("~/justcreateddir")

# we need PDFs to download instead of display in-browser
prefs <- makeFirefoxProfile(list(
  `browser.download.folderList` = as.integer(2),
  `browser.download.dir` = getwd(),
  `pdfjs.disabled` = TRUE,
  `plugin.scan.plid.all` = FALSE,
  `plugin.scan.Acrobat` = "99.0",
  `browser.helperApps.neverAsk.saveToDisk` = 'application/pdf'
))
# get a browser going
dr <- remoteDriver$new(extraCapabilities=prefs)
dr$open()

# go to the page with the PDF
dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")

# find the PDF link and "hit ENTER"
pdf_elem <- dr$findElement(using="css selector", "a.dlb3")
pdf_elem$sendKeysToElement(list("\uE007"))

# find the ACCEPT button and "hit ENTER"
# that will save the PDF to the default downloads directory
accept_elem <- dr$findElement(using="css selector", "a[id$='agreeButton']")
accept_elem$sendKeysToElement(list("\uE007"))

Now wait for the download to complete. The R console will not be busy while it downloads, so it is easy to close the session accidently, before the download has completed.

现在等待下载完成。 R控制台在下载时不会很忙,因此在下载完成之前很容易意外关闭会话。

# close the session
dr$close()

#2

This answer came from John Harrison by email, posted here at his request:

这个答案来自约翰哈里森的电子邮件,在他的要求下张贴在这里:

This will allow you to download the PDF:

这将允许您下载PDF:

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
pdfData <- getBinaryURL(appURL, curl = curl, .opts = list(cookie = "ADSCOPYRIGHT=YES"))
writeBin(pdfData, "test2.pdf")

Here's a longer version showing his working

这是一个展示他工作的较长版本

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
appData <- getURL(appURL, curl = curl)

# get the necessary elements for the POST that is initiated when the ACCEPT button is pressed

doc <- htmlParse(appData)
appAttrs <- doc["//input", fun = xmlAttrs]
postData <- lapply(appAttrs, function(x){data.frame(name = x[["name"]], value = x[["value"]]
                                                    , stringsAsFactors = FALSE)})
postData <- do.call(rbind, postData)

# post your acceptance
postURL <- "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid="
# get jsessionid
jsessionid <- unlist(strsplit(getCurlInfo(curl)$cookielist[1], "\t"))[7]

searchData <- postForm(paste0(postURL, jsessionid), curl = curl,
                       "j_id10" = "j_id10",
                       from = postData[postData$name == "from", "value"],
                       "javax.faces.ViewState" = postData[postData$name == "javax.faces.ViewState", "value"],
                       "j_id10:_idcl" = "j_id10:agreeButton"
                       , binary = TRUE
)
con <- file("test.pdf", open = "wb")
writeBin(searchData, con)
close(con)


Pressing the ACCEPT button on the page you gave initiates a POST to "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid=......" via some javascript.
This post then redirects to the page with the pdf having given some additional cookies.

Checking our cookies we see:

> getCurlInfo(curl)$cookielist
[1] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tJSESSIONID\t3d249e3d7c98ec35998e69e15d3e" 
[2] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tSSOSESSIONID\t3d249e3d7c98ec35998e69e15d3e"
[3] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tADSCOPYRIGHT\tYES"          

so it would probably be sufficient to set that last cookie to start with (indicating we accept copyright)

#1