从Web到R读取文件名列表

时间:2023-02-06 10:42:08

I am trying to read a lot of csv files into R from a website. Threa are multiple years of daily (business days only) files. All of the files have the same data structure. I can sucessfully read one file using the following logic:

我试图从网站上读取很多csv文件到R中。 Threa是每日(仅限工作日)文件的多年。所有文件都具有相同的数据结构。我可以使用以下逻辑成功读取一个文件:

# enter user credentials
user     <- "JohnDoe"
password <- "SecretPassword"
credentials <- paste(user,":",password,"@",sep="")
web.site <- "downloads.theice.com/Settlement_Reports_CSV/Power/"

# construct path to data
path <- paste("https://", credentials, web.site, sep="")

# read data for 4/10/2013
file  <- "icecleared_power_2013_04_10"
fname <- paste(path,file,".dat",sep="")
df <- read.csv(fname,header=TRUE, sep="|",as.is=TRUE)

However, Im looking for tips on how to read all the files in the directory at once. I suppose I could generate a sequence of dates an construct the file name above in a loop and use rbind to append each file but that seems cumbersome. Plus there will be issues when attempting to read weekends and holidays where there is no files.

但是,我正在寻找有关如何一次读取目录中所有文件的提示。我想我可以生成一个日期序列,在循环中构造上面的文件名,并使用rbind附加每个文件,但这似乎很麻烦。此外,在尝试阅读没有文件的周末和假日时会出现问题。

The impages below show what the list of files look like in the web browser:

下面的插入内容显示了Web浏览器中文件列表的外观:

从Web到R读取文件名列表

... ... ...

...... ......

从Web到R读取文件名列表

Is there a way to scan the path (from above) to get a list of all the file names in the directory first that meet a certin crieteia (i.e. start with "icecleared_power_" as there are also some files in that location that have a different starting name that I do not want to read in) then loop the read.csv through that list and use rbind to append?

有没有办法扫描路径(从上面)获取目录中的所有文件名列表,首先满足certin crieteia(即以“icecleared_power_”开头,因为该位置的某些文件也有不同我不想读的起始名称然后循环read.csv通过该列表并使用rbind追加?

Any guidance would be greatly appreciated?

任何指导将不胜感激?

3 个解决方案

#1


4  

I would first try to just scrape the links to the relevant data files and use the resulting information to construct the full download path that includes user logins and so on. As others have suggested, lapply would be convenient for batch downloading.

我首先尝试抓取相关数据文件的链接,并使用生成的信息构建包含用户登录等的完整下载路径。正如其他人所建议的那样,lapply可以方便批量下载。

Here's an easy way to extract the URLs. Obviously, modify the example to suit your actual scenario.

这是一种提取URL的简便方法。显然,修改示例以适合您的实际场景。

Here, we're going to use the XML package to identify all the links available at the CRAN archives for the Amelia package (http://cran.r-project.org/src/contrib/Archive/Amelia/).

在这里,我们将使用XML包来识别Amelia包的CRAN档案中可用的所有链接(http://cran.r-project.org/src/contrib/Archive/Amelia/)。

> library(XML)
> url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/@href")
> free(doc)
> links
                   href                    href                    href 
             "?C=N;O=D"              "?C=M;O=A"              "?C=S;O=A" 
                   href                    href                    href 
             "?C=D;O=A" "/src/contrib/Archive/"  "Amelia_1.1-23.tar.gz" 
                   href                    href                    href 
 "Amelia_1.1-29.tar.gz"  "Amelia_1.1-30.tar.gz"  "Amelia_1.1-32.tar.gz" 
                   href                    href                    href 
 "Amelia_1.1-33.tar.gz"   "Amelia_1.2-0.tar.gz"   "Amelia_1.2-1.tar.gz" 
                   href                    href                    href 
  "Amelia_1.2-2.tar.gz"   "Amelia_1.2-9.tar.gz"  "Amelia_1.2-12.tar.gz" 
                   href                    href                    href 
 "Amelia_1.2-13.tar.gz"  "Amelia_1.2-14.tar.gz"  "Amelia_1.2-15.tar.gz" 
                   href                    href                    href 
 "Amelia_1.2-16.tar.gz"  "Amelia_1.2-17.tar.gz"  "Amelia_1.2-18.tar.gz" 
                   href                    href                    href 
  "Amelia_1.5-4.tar.gz"   "Amelia_1.5-5.tar.gz"   "Amelia_1.6.1.tar.gz" 
                   href                    href                    href 
  "Amelia_1.6.3.tar.gz"   "Amelia_1.6.4.tar.gz"     "Amelia_1.7.tar.gz" 

For the sake of demonstration, imagine that, ultimately, we only want the links for the 1.2 versions of the package.

为了演示,想象一下,最终,我们只需要1.2版本软件包的链接。

> wanted <- links[grepl("Amelia_1\\.2.*", links)]
> wanted
                  href                   href                   href 
 "Amelia_1.2-0.tar.gz"  "Amelia_1.2-1.tar.gz"  "Amelia_1.2-2.tar.gz" 
                  href                   href                   href 
 "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz" 
                  href                   href                   href 
"Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz" 
                  href                   href 
"Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz" 

You can now use that vector as follows:

您现在可以使用该向量,如下所示:

wanted <- links[grepl("Amelia_1\\.2.*", links)]
GetMe <- paste(url, wanted, sep = "")
lapply(seq_along(GetMe), 
       function(x) download.file(GetMe[x], wanted[x], mode = "wb"))

Update (to clarify your question in comments)

The last step in the example above downloads the specified files to your current working directory (use getwd() to verify where that is). If, instead, you know for sure that read.csv works on the data, you can also try to modify your anonymous function to read the files directly:

上例中的最后一步将指定的文件下载到当前工作目录(使用getwd()来验证它的位置)。相反,如果你确定read.csv可以处理数据,你也可以尝试修改你的匿名函数来直接读取文件:

lapply(seq_along(GetMe), 
       function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE))

However, I think a safer approach might be to download all the files into a single directory first, and then use read.delim or read.csv or whatever works to read in the data, similar to as was suggested by @Andreas. I say safer because it gives you more flexibility in case files aren't fully downloaded and so on. In that case, instead of having to redownload everything, you would only need to download the files which were not fully downloaded.

但是,我认为更安全的方法可能是首先将所有文件下载到单个目录中,然后使用read.delim或read.csv或其他可读取的数据,类似于@Andreas所建议的。我说更安全,因为它可以提供更大的灵活性,以防文件没有完全下载等等。在这种情况下,您只需要下载未完全下载的文件,而不必重新下载所有内容。

#2


1  

@MikeTP, if all the reports start with "icecleared_power_" and a date which is a business date the package "timeDate" offers an easy way to create a vector of business dates, like so:

@MikeTP,如果所有报告都以“icecleared_power_”开头,而日期是商业日期,则“timeDate”软件包提供了一种创建业务日期向量的简便方法,如下所示:

require(timeDate)
tSeq <- timeSequence("2012-01-01","2012-12-31") # vector of days
tBiz <- tSeq[isBizday(tSeq)] # vector of business days

and

paste0("icecleared_power_",as.character.Date(tBiz))

gives you the concatenated file name.

为您提供连接的文件名。

If the web site follows a different logic regarding the naming of files we need more information as Ananda Mahto observed.

如果网站遵循关于文件命名的不同逻辑,我们需要更多信息,如Ananda Mahto所观察到的那样。

Keep in mind that when you create a date vector with timeDate you can get much more sophisticated then my simple example. You can take into account holiday schedules, stock exchange dates etc.

请记住,当您使用timeDate创建日期向量时,您可以比我的简单示例更复杂。您可以考虑假期时间表,证券交易所日期等。

#3


1  

You can try using the command "download.file".

您可以尝试使用命令“download.file”。

### set up the path and destination
path <- "url where file is located"
dest <- "where on your hard disk you want the file saved"

### Ask R to try really hard to download your ".csv"
try(download.file(path, dest))

The trick to this is going to be figuring out how the "url" or "path" changes systematically between files. Often times, web pages are built such that the "url's" are systematic. In this case, you could potentially create a vector or data frame of url's to iterate over inside of an apply function.

解决这个问题的方法是弄清楚文件之间“url”或“path”如何系统地改变。通常,网页的构建使得“网址”是系统的。在这种情况下,您可以创建url的向量或数据框,以在apply函数内部进行迭代。

All of this can be sandwiched inside of an "lapply". The "data" object is simply whatever we are iterating over. It could be a vector of URL's or a data frame of year and month observations, which could then be used to create URL's within the "lapply" function.

所有这些都可以夹在“lapply”中。 “数据”对象就是我们迭代的任何东西。它可以是URL的矢量或年和月观察的数据帧,然后可以用于在“lapply”函数内创建URL。

### "dl" will apply a function to every element in our vector "data"
  # It will also help keep track of files which have no download data
dl <- lapply(data, function(x) {
        path <- 'url'
        dest <- './data_intermediate/...'
        try(download.file(path, dest))
      })

### Assign element names to your list "dl"
names(dl) <- unique(data$name)
index     <- sapply(dl, is.null)

### Figure out which downloads returned nothing
no.download <- names(dl)[index]

You can then use "list.files()" to merge all data together, assuming they belong in one data.frame

然后,您可以使用“list.files()”将所有数据合并在一起,假设它们属于一个data.frame

### Create a list of files you want to merge together
files <- list.files()

### Create a list of data.frames by reading each file into memory
data  <- lapply(files, read.csv)

### Stack data together
data <- do.call(rbind, data)

Sometimes, you will notice the file has been corrupted after downloading. In this case, pay attention to the option contained within the download.file() command, "mode". You can set mode = "w" or mode = "wb" if the file is stored in a binary format.

有时,您会注意到文件在下载后已损坏。在这种情况下,请注意download.file()命令中包含的选项“mode”。如果文件以二进制格式存储,则可以设置mode =“w”或mode =“wb”。

#1


4  

I would first try to just scrape the links to the relevant data files and use the resulting information to construct the full download path that includes user logins and so on. As others have suggested, lapply would be convenient for batch downloading.

我首先尝试抓取相关数据文件的链接,并使用生成的信息构建包含用户登录等的完整下载路径。正如其他人所建议的那样,lapply可以方便批量下载。

Here's an easy way to extract the URLs. Obviously, modify the example to suit your actual scenario.

这是一种提取URL的简便方法。显然,修改示例以适合您的实际场景。

Here, we're going to use the XML package to identify all the links available at the CRAN archives for the Amelia package (http://cran.r-project.org/src/contrib/Archive/Amelia/).

在这里,我们将使用XML包来识别Amelia包的CRAN档案中可用的所有链接(http://cran.r-project.org/src/contrib/Archive/Amelia/)。

> library(XML)
> url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/@href")
> free(doc)
> links
                   href                    href                    href 
             "?C=N;O=D"              "?C=M;O=A"              "?C=S;O=A" 
                   href                    href                    href 
             "?C=D;O=A" "/src/contrib/Archive/"  "Amelia_1.1-23.tar.gz" 
                   href                    href                    href 
 "Amelia_1.1-29.tar.gz"  "Amelia_1.1-30.tar.gz"  "Amelia_1.1-32.tar.gz" 
                   href                    href                    href 
 "Amelia_1.1-33.tar.gz"   "Amelia_1.2-0.tar.gz"   "Amelia_1.2-1.tar.gz" 
                   href                    href                    href 
  "Amelia_1.2-2.tar.gz"   "Amelia_1.2-9.tar.gz"  "Amelia_1.2-12.tar.gz" 
                   href                    href                    href 
 "Amelia_1.2-13.tar.gz"  "Amelia_1.2-14.tar.gz"  "Amelia_1.2-15.tar.gz" 
                   href                    href                    href 
 "Amelia_1.2-16.tar.gz"  "Amelia_1.2-17.tar.gz"  "Amelia_1.2-18.tar.gz" 
                   href                    href                    href 
  "Amelia_1.5-4.tar.gz"   "Amelia_1.5-5.tar.gz"   "Amelia_1.6.1.tar.gz" 
                   href                    href                    href 
  "Amelia_1.6.3.tar.gz"   "Amelia_1.6.4.tar.gz"     "Amelia_1.7.tar.gz" 

For the sake of demonstration, imagine that, ultimately, we only want the links for the 1.2 versions of the package.

为了演示,想象一下,最终,我们只需要1.2版本软件包的链接。

> wanted <- links[grepl("Amelia_1\\.2.*", links)]
> wanted
                  href                   href                   href 
 "Amelia_1.2-0.tar.gz"  "Amelia_1.2-1.tar.gz"  "Amelia_1.2-2.tar.gz" 
                  href                   href                   href 
 "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz" 
                  href                   href                   href 
"Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz" 
                  href                   href 
"Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz" 

You can now use that vector as follows:

您现在可以使用该向量,如下所示:

wanted <- links[grepl("Amelia_1\\.2.*", links)]
GetMe <- paste(url, wanted, sep = "")
lapply(seq_along(GetMe), 
       function(x) download.file(GetMe[x], wanted[x], mode = "wb"))

Update (to clarify your question in comments)

The last step in the example above downloads the specified files to your current working directory (use getwd() to verify where that is). If, instead, you know for sure that read.csv works on the data, you can also try to modify your anonymous function to read the files directly:

上例中的最后一步将指定的文件下载到当前工作目录(使用getwd()来验证它的位置)。相反,如果你确定read.csv可以处理数据,你也可以尝试修改你的匿名函数来直接读取文件:

lapply(seq_along(GetMe), 
       function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE))

However, I think a safer approach might be to download all the files into a single directory first, and then use read.delim or read.csv or whatever works to read in the data, similar to as was suggested by @Andreas. I say safer because it gives you more flexibility in case files aren't fully downloaded and so on. In that case, instead of having to redownload everything, you would only need to download the files which were not fully downloaded.

但是,我认为更安全的方法可能是首先将所有文件下载到单个目录中,然后使用read.delim或read.csv或其他可读取的数据,类似于@Andreas所建议的。我说更安全,因为它可以提供更大的灵活性,以防文件没有完全下载等等。在这种情况下,您只需要下载未完全下载的文件,而不必重新下载所有内容。

#2


1  

@MikeTP, if all the reports start with "icecleared_power_" and a date which is a business date the package "timeDate" offers an easy way to create a vector of business dates, like so:

@MikeTP,如果所有报告都以“icecleared_power_”开头,而日期是商业日期,则“timeDate”软件包提供了一种创建业务日期向量的简便方法,如下所示:

require(timeDate)
tSeq <- timeSequence("2012-01-01","2012-12-31") # vector of days
tBiz <- tSeq[isBizday(tSeq)] # vector of business days

and

paste0("icecleared_power_",as.character.Date(tBiz))

gives you the concatenated file name.

为您提供连接的文件名。

If the web site follows a different logic regarding the naming of files we need more information as Ananda Mahto observed.

如果网站遵循关于文件命名的不同逻辑,我们需要更多信息,如Ananda Mahto所观察到的那样。

Keep in mind that when you create a date vector with timeDate you can get much more sophisticated then my simple example. You can take into account holiday schedules, stock exchange dates etc.

请记住,当您使用timeDate创建日期向量时,您可以比我的简单示例更复杂。您可以考虑假期时间表,证券交易所日期等。

#3


1  

You can try using the command "download.file".

您可以尝试使用命令“download.file”。

### set up the path and destination
path <- "url where file is located"
dest <- "where on your hard disk you want the file saved"

### Ask R to try really hard to download your ".csv"
try(download.file(path, dest))

The trick to this is going to be figuring out how the "url" or "path" changes systematically between files. Often times, web pages are built such that the "url's" are systematic. In this case, you could potentially create a vector or data frame of url's to iterate over inside of an apply function.

解决这个问题的方法是弄清楚文件之间“url”或“path”如何系统地改变。通常,网页的构建使得“网址”是系统的。在这种情况下,您可以创建url的向量或数据框,以在apply函数内部进行迭代。

All of this can be sandwiched inside of an "lapply". The "data" object is simply whatever we are iterating over. It could be a vector of URL's or a data frame of year and month observations, which could then be used to create URL's within the "lapply" function.

所有这些都可以夹在“lapply”中。 “数据”对象就是我们迭代的任何东西。它可以是URL的矢量或年和月观察的数据帧,然后可以用于在“lapply”函数内创建URL。

### "dl" will apply a function to every element in our vector "data"
  # It will also help keep track of files which have no download data
dl <- lapply(data, function(x) {
        path <- 'url'
        dest <- './data_intermediate/...'
        try(download.file(path, dest))
      })

### Assign element names to your list "dl"
names(dl) <- unique(data$name)
index     <- sapply(dl, is.null)

### Figure out which downloads returned nothing
no.download <- names(dl)[index]

You can then use "list.files()" to merge all data together, assuming they belong in one data.frame

然后,您可以使用“list.files()”将所有数据合并在一起,假设它们属于一个data.frame

### Create a list of files you want to merge together
files <- list.files()

### Create a list of data.frames by reading each file into memory
data  <- lapply(files, read.csv)

### Stack data together
data <- do.call(rbind, data)

Sometimes, you will notice the file has been corrupted after downloading. In this case, pay attention to the option contained within the download.file() command, "mode". You can set mode = "w" or mode = "wb" if the file is stored in a binary format.

有时,您会注意到文件在下载后已损坏。在这种情况下,请注意download.file()命令中包含的选项“mode”。如果文件以二进制格式存储,则可以设置mode =“w”或mode =“wb”。