使用download.file()从HTTPS下载文件

时间:2021-08-25 13:48:21

I would like to read online data to R using download.file() as shown below.

我想用download.file()来读取在线数据,如下所示。

URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
download.file(URL, destfile = "./data/data.csv", method="curl")

Someone suggested to me that I add the line setInternet2(TRUE), but it still doesn't work.

有人建议我添加setInternet2(TRUE),但它仍然不起作用。

The error I get is:

我得到的错误是:

Warning messages:
1: running command 'curl  "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"  -o "./data/data.csv"' had status 127 
2: In download.file(URL, destfile = "./data/data.csv", method = "curl",  :
  download had nonzero exit status

Appreciate your help.

感谢你的帮助。

9 个解决方案

#1


34  

It might be easiest to try the RCurl package. Install the package and try the following:

可能最容易尝试RCurl包。安装包并尝试如下:

# install.packages("RCurl")
library(RCurl)
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
## Or 
## x <- getURL(URL, ssl.verifypeer = FALSE)
out <- read.csv(textConnection(x))
head(out[1:6])
#   RT SERIALNO DIVISION PUMA REGION ST
# 1  H      186        8  700      4 16
# 2  H      306        8  700      4 16
# 3  H      395        8  100      4 16
# 4  H      506        8  700      4 16
# 5  H      835        8  800      4 16
# 6  H      989        8  700      4 16
dim(out)
# [1] 6496  188

download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",destfile="reviews.csv",method="libcurl")

#2


10  

Here's an update as of Nov 2014. I find that setting method='curl' did the trick for me (while method='auto', does not).

以下是截至2014年11月的最新情况。我发现设置方法='curl'为我做了这个(while方法='auto',不)。

For example:

例如:

# does not work
download.file(url='https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
              destfile='localfile.zip')

# does not work. this appears to be the default anyway
download.file(url='https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
              destfile='localfile.zip', method='auto')

# works!
download.file(url='https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
              destfile='localfile.zip', method='curl')

#3


4  

I've succeed with the following code:

我获得了以下代码:

url = "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x = read.csv(file=url)

Note that I've changed the protocol from https to http, since the first one doesn't seem to be supported in R.

注意,我已经将协议从https更改为http,因为第一个协议似乎没有在R中得到支持。

#4


3  

If using RCurl you get an SSL error on the GetURL() function then set these options before GetURL(). This will set the CurlSSL settings globally.

如果使用RCurl,在GetURL()函数上得到一个SSL错误,然后在GetURL()之前设置这些选项。这将在全局设置CurlSSL设置。

The extended code:

扩展代码:

install.packages("RCurl")
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))   
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)

Worked for me on Windows 7 64-bit using R3.1.0!

使用R3.1.0在Windows 7上为我工作!

#5


2  

Had exactly the same problem as UseR (original question), I'm also using windows 7. I tried all proposed solutions and they didn't work.

与用户的问题完全相同(原始问题),我也使用windows 7。我尝试了所有的解决方案,但都不起作用。

I resolved the problem doing as follows:

我解决了以下问题:

  1. Using RStudio instead of R console.

    使用RStudio而不是R控制台。

  2. Actualising the version of R (from 3.1.0 to 3.1.1) so that the library RCurl runs OK on it. (I'm using now R3.1.1 32bit although my system is 64bit).

    实现R版本(从3.1.0到3.1.1),使库的RCurl可以运行。(我现在使用的是R3.1.1 32位,虽然我的系统是64位)。

  3. I typed the URL adress as https (secure connection) and with "/" instead of backslashes "\".

    我键入URL地址作为https(安全连接)和“/”而不是反斜杠“\”。

  4. Setting method = "auto".

    设置方法=“汽车”。

It works for me now. You should see the message:

现在它对我起作用了。你应该看到这样的信息:

Content type 'text/csv; charset=utf-8' length 9294 bytes opened URL downloaded 9294 by

内容类型的文本/ csv;charset=utf-8长度9294字节打开URL下载9294。

#6


1  

127 means command not found

命令未被发现。

In your case, curl command was not found. Therefore it means, curl was not found.

在您的情况中,curl命令没有被找到。所以它的意思是,旋度没有被发现。

You need to install/reinstall CURL. That's all. Get latest version for your OS from http://curl.haxx.se/download.html

您需要安装/重新安装CURL。这是所有。从http://curl.haxx.se/download.html获得您的操作系统的最新版本。

Close RStudio before installation.

关闭RStudio之前安装。

#7


1  

Offering the curl package as an alternative that I found to be reliable when extracting large files from an online database. In a recent project, I had to download 120 files from an online database and found it to half the transfer times and to be much more reliable than download.file.

提供curl包作为一种选择,在从在线数据库中提取大型文件时,我发现它是可靠的。在最近的一个项目中,我不得不从一个在线数据库下载了120个文件,并发现它有一半的传输时间,而且比下载文件要可靠得多。

#install.packages("curl")
library(curl)
#install.packages("RCurl")
library(RCurl)

ptm <- proc.time()
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
proc.time() - ptm
ptm

ptm1 <- proc.time()
curl_download(url =URL ,destfile="TEST.CSV",quiet=FALSE, mode="wb")
proc.time() - ptm1
ptm1

ptm2 <- proc.time()
y = download.file(URL, destfile = "./data/data.csv", method="curl")
proc.time() - ptm2
ptm2

In this case, rough timing on your URL showed no consistent difference in transfer times. In my application, using curl_download in a script to select and download 120 files from a website decreased my transfer times from 2000 seconds per file to 1000 seconds and increased the reliability from 50% to 2 failures in 120 files. The script is posted in my answer to a question I asked earlier, see .

在这种情况下,URL的粗略时间显示在传输时间上没有一致的差异。在我的应用程序中,使用curl_download脚本从一个网站选择和下载120个文件,将我的传输时间从2000秒降低到1000秒,并将可靠性从50%提高到120个文件中的2个故障。这个剧本是在我之前问过的一个问题的答案里写出来的。

#8


0  

You can set global options and try-

您可以设置全局选项并尝试。

options('download.file.method'='curl')
download.file(URL, destfile = "./data/data.csv", method="auto")

For issue refer to link- https://stat.ethz.ch/pipermail/bioconductor/2011-February/037723.html

有关问题,请参考link- https://stat.ethz.ch/pipermail/bioconductor/2011-February/037723.html。

#9


0  

Try following with heavy files

试试下面的重文件。

library(data.table)
URL <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- fread(URL)

#1


34  

It might be easiest to try the RCurl package. Install the package and try the following:

可能最容易尝试RCurl包。安装包并尝试如下:

# install.packages("RCurl")
library(RCurl)
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
## Or 
## x <- getURL(URL, ssl.verifypeer = FALSE)
out <- read.csv(textConnection(x))
head(out[1:6])
#   RT SERIALNO DIVISION PUMA REGION ST
# 1  H      186        8  700      4 16
# 2  H      306        8  700      4 16
# 3  H      395        8  100      4 16
# 4  H      506        8  700      4 16
# 5  H      835        8  800      4 16
# 6  H      989        8  700      4 16
dim(out)
# [1] 6496  188

download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",destfile="reviews.csv",method="libcurl")

#2


10  

Here's an update as of Nov 2014. I find that setting method='curl' did the trick for me (while method='auto', does not).

以下是截至2014年11月的最新情况。我发现设置方法='curl'为我做了这个(while方法='auto',不)。

For example:

例如:

# does not work
download.file(url='https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
              destfile='localfile.zip')

# does not work. this appears to be the default anyway
download.file(url='https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
              destfile='localfile.zip', method='auto')

# works!
download.file(url='https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
              destfile='localfile.zip', method='curl')

#3


4  

I've succeed with the following code:

我获得了以下代码:

url = "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x = read.csv(file=url)

Note that I've changed the protocol from https to http, since the first one doesn't seem to be supported in R.

注意,我已经将协议从https更改为http,因为第一个协议似乎没有在R中得到支持。

#4


3  

If using RCurl you get an SSL error on the GetURL() function then set these options before GetURL(). This will set the CurlSSL settings globally.

如果使用RCurl,在GetURL()函数上得到一个SSL错误,然后在GetURL()之前设置这些选项。这将在全局设置CurlSSL设置。

The extended code:

扩展代码:

install.packages("RCurl")
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))   
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)

Worked for me on Windows 7 64-bit using R3.1.0!

使用R3.1.0在Windows 7上为我工作!

#5


2  

Had exactly the same problem as UseR (original question), I'm also using windows 7. I tried all proposed solutions and they didn't work.

与用户的问题完全相同(原始问题),我也使用windows 7。我尝试了所有的解决方案,但都不起作用。

I resolved the problem doing as follows:

我解决了以下问题:

  1. Using RStudio instead of R console.

    使用RStudio而不是R控制台。

  2. Actualising the version of R (from 3.1.0 to 3.1.1) so that the library RCurl runs OK on it. (I'm using now R3.1.1 32bit although my system is 64bit).

    实现R版本(从3.1.0到3.1.1),使库的RCurl可以运行。(我现在使用的是R3.1.1 32位,虽然我的系统是64位)。

  3. I typed the URL adress as https (secure connection) and with "/" instead of backslashes "\".

    我键入URL地址作为https(安全连接)和“/”而不是反斜杠“\”。

  4. Setting method = "auto".

    设置方法=“汽车”。

It works for me now. You should see the message:

现在它对我起作用了。你应该看到这样的信息:

Content type 'text/csv; charset=utf-8' length 9294 bytes opened URL downloaded 9294 by

内容类型的文本/ csv;charset=utf-8长度9294字节打开URL下载9294。

#6


1  

127 means command not found

命令未被发现。

In your case, curl command was not found. Therefore it means, curl was not found.

在您的情况中,curl命令没有被找到。所以它的意思是,旋度没有被发现。

You need to install/reinstall CURL. That's all. Get latest version for your OS from http://curl.haxx.se/download.html

您需要安装/重新安装CURL。这是所有。从http://curl.haxx.se/download.html获得您的操作系统的最新版本。

Close RStudio before installation.

关闭RStudio之前安装。

#7


1  

Offering the curl package as an alternative that I found to be reliable when extracting large files from an online database. In a recent project, I had to download 120 files from an online database and found it to half the transfer times and to be much more reliable than download.file.

提供curl包作为一种选择,在从在线数据库中提取大型文件时,我发现它是可靠的。在最近的一个项目中,我不得不从一个在线数据库下载了120个文件,并发现它有一半的传输时间,而且比下载文件要可靠得多。

#install.packages("curl")
library(curl)
#install.packages("RCurl")
library(RCurl)

ptm <- proc.time()
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
proc.time() - ptm
ptm

ptm1 <- proc.time()
curl_download(url =URL ,destfile="TEST.CSV",quiet=FALSE, mode="wb")
proc.time() - ptm1
ptm1

ptm2 <- proc.time()
y = download.file(URL, destfile = "./data/data.csv", method="curl")
proc.time() - ptm2
ptm2

In this case, rough timing on your URL showed no consistent difference in transfer times. In my application, using curl_download in a script to select and download 120 files from a website decreased my transfer times from 2000 seconds per file to 1000 seconds and increased the reliability from 50% to 2 failures in 120 files. The script is posted in my answer to a question I asked earlier, see .

在这种情况下,URL的粗略时间显示在传输时间上没有一致的差异。在我的应用程序中,使用curl_download脚本从一个网站选择和下载120个文件,将我的传输时间从2000秒降低到1000秒,并将可靠性从50%提高到120个文件中的2个故障。这个剧本是在我之前问过的一个问题的答案里写出来的。

#8


0  

You can set global options and try-

您可以设置全局选项并尝试。

options('download.file.method'='curl')
download.file(URL, destfile = "./data/data.csv", method="auto")

For issue refer to link- https://stat.ethz.ch/pipermail/bioconductor/2011-February/037723.html

有关问题,请参考link- https://stat.ethz.ch/pipermail/bioconductor/2011-February/037723.html。

#9


0  

Try following with heavy files

试试下面的重文件。

library(data.table)
URL <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- fread(URL)