I'd like to read only the first character from each line of a text file, ignoring the rest.
我只希望从文本文件的每一行中读取第一个字符,而忽略其余字符。
Here's an example file:
这里有一个例子文件:
x <- c(
"Afklgjsdf;bosfu09[45y94hn9igf",
"Basfgsdbsfgn",
"Cajvw58723895yubjsdw409t809t80",
"Djakfl09w50968509",
"E3434t"
)
writeLines(x, "test.txt")
I can solve the problem by reading everything with readLines
and using substring
to get the first character:
我可以用读线和使用子字符串来获取第一个字符来解决这个问题:
lines <- readLines("test.txt")
substring(lines, 1, 1)
## [1] "A" "B" "C" "D" "E"
This seems inefficient though. Is there a way to persuade R to only read the first characters, rather than having to discard them?
这似乎是低效的。有没有办法说服R只阅读第一个字符,而不是放弃它们?
I suspect that there ought to be some incantation using scan
, but I can't find it. An alternative might be low level file manipulation (maybe with seek
).
我想应该是有什么用扫描的咒语,但是我找不到。另一种选择可能是低级文件操作(可能是使用seek)。
Since performance is only relevant for larger files, here's a bigger test file for benchmarking with:
由于性能只与较大的文件有关,这里有一个较大的测试文件用于基准测试:
set.seed(2015)
nch <- sample(1:100, 1e4, replace = TRUE)
x2 <- vapply(
nch,
function(nch)
{
paste0(
sample(letters, nch, replace = TRUE),
collapse = ""
)
},
character(1)
)
writeLines(x2, "bigtest.txt")
Update: It seems that you can't avoid scanning the whole file. The best speed gains seem to be using a faster alternative to readLines
(Richard Scriven's stringi::stri_read_lines
solution and Josh O'Brien's data.table::fread
solution), or to treat the file as binary (Martin Morgan's readBin
solution).
更新:似乎你无法避免扫描整个文件。最好的速度提升似乎是使用更快的readLines (Richard Scriven的stringi: stri_read_lines解决方案和Josh O'Brien的数据)。表:fread解决方案),或将文件视为二进制(Martin Morgan的readBin解决方案)。
6 个解决方案
#1
8
01/04/2015 Edited to bring the better solution to the top.
01/04/2015编辑,将更好的解决方案放在首位。
Update 2 Changing the scan()
method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit.
更新2将scan()方法更改为在打开的连接上运行,而不是在每次迭代中打开和关闭,允许逐行读取并消除循环。时机改进了不少。
## scan() on open connection
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
I also discovered the stri_read_lines()
function in the stringi package, Its help file says it's experimental at the moment, but it is very fast.
我还在stringi包中发现了stri_read_lines()函数,它的帮助文件说它目前还处于试验阶段,但速度很快。
## stringi::stri_read_lines()
library(stringi)
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
Here are the timings for these two methods.
以下是这两种方法的时间安排。
## timings
library(microbenchmark)
microbenchmark(
scan = {
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
},
stringi = {
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100
# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100
Original [slower] answer :
原始(慢)答:
You could try read.fwf()
(fixed width file), setting the width to a single 1 to capture the first character on each line.
您可以尝试read.fwf()(固定宽度文件),将宽度设置为1,以捕获每行上的第一个字符。
read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]
# [1] "A" "B" "C" "D" "E"
Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.
当然还没有完全测试,但是对测试文件有效,并且是一个很好的函数,可以在不需要读取整个文件的情况下获取子字符串。
Update 1 : read.fwf()
is not very efficient, calling scan()
and read.table()
internally. We can skip the middle-men and try scan()
directly.
更新1:read.fwf()不是很有效,在内部调用scan()和read.table()。我们可以跳过中间人直接尝试scan()。
lines <- count.fields("test.txt") ## length is num of lines in file
skip <- seq_along(lines) - 1 ## set up the 'skip' arg for scan()
read <- function(n) {
ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE)
substr(ch, 1, 1)
}
vapply(skip, read, character(1L))
# [1] "A" "B" "C" "D" "E"
version$platform
# [1] "x86_64-pc-linux-gnu"
#2
20
If you allow/have access to Unix command-line tools you can use
如果允许/有权使用Unix命令行工具,则可以使用这些工具
scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE)
Obviously less portable but probably very fast.
显然移动性较弱,但速度可能很快。
Using @RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file:
使用@RichieCotton的基准代码和OP建议的“bigtest”。txt文件:
expr min lq mean median uq
RC readLines 14.797830 17.083849 19.261917 18.103020 20.007341
RS read.fwf 125.113935 133.259220 148.122596 138.024203 150.528754
BB scan pipe cut 6.277267 7.027964 7.686314 7.337207 8.004137
RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464
RS scan 13.927765 14.752597 16.634288 15.274470 16.992124
#3
13
data.table::fread()
seems to beat all of the solutions so far proposed, and has the great virtue of running comparably fast on both Windows and *NIX machines:
表::fread()似乎打败了目前提出的所有解决方案,并且在Windows和*NIX机器上运行速度都非常快:
library(data.table)
substring(fread("bigtest.txt", sep="\n", header=FALSE)[[1]], 1, 1)
Here are microbenchmark timings on a Linux box (actually a dual-boot laptop, booted up as Ubuntu):
以下是Linux机箱上的微基准测试时间(实际上是双启动笔记本,作为Ubuntu启动):
Unit: milliseconds
expr min lq mean median uq max neval
RC readLines 15.830318 16.617075 18.294723 17.116666 18.959381 27.54451 100
JOB fread 5.532777 6.013432 7.225067 6.292191 7.727054 12.79815 100
RS read.fwf 111.099578 113.803053 118.844635 116.501270 123.987873 141.14975 100
BB scan pipe cut 6.583634 8.290366 9.925221 10.115399 11.013237 15.63060 100
RC readChar 1347.017408 1407.878731 1453.580001 1450.693865 1491.764668 1583.92091 100
And here are timings from the same laptop booted up as a Windows machine (with the command-line tool cut
supplied by Rtools):
以下是作为Windows机器启动的同一台笔记本电脑(Rtools提供的命令行工具)的计时:
Unit: milliseconds
expr min lq mean median uq max neval cld
RC readLines 26.653266 27.493167 33.13860 28.057552 33.208309 61.72567 100 b
JOB fread 4.964205 5.343063 6.71591 5.538246 6.027024 13.54647 100 a
RS read.fwf 213.951792 217.749833 229.31050 220.793649 237.400166 287.03953 100 c
BB scan pipe cut 180.963117 263.469528 278.04720 276.138088 280.227259 387.87889 100 d
RC readChar 1505.263964 1572.132785 1646.88564 1622.410703 1688.809031 2149.10773 100 e
#4
13
Figure out the file size, read it in as a single binary blob, find the offsets of the characters of interest (don't count the last '\n', at the end of the file!), and coerce to final form
求出文件大小,将其作为一个二进制blob读取,查找相关字符的偏移量(不要计算文件末尾的最后一个'\n'),并强制转换为最终形式
f0 <- function() {
sz <- file.info("bigtest.txt")$size
what <- charToRaw("\n")
x = readBin("bigtest.txt", raw(), sz)
idx = which(x == what)
rawToChar(x[c(1L, idx[-length(idx)] + 1L)], multiple=TRUE)
}
The data.table solution (was I think the fastest so far -- need to include the first line as part of the data!)
数据。表解决方案(我认为是目前最快的解决方案——需要包含第一行作为数据的一部分!)
library(data.table)
f1 <- function()
substring(fread("bigtest.txt", header=FALSE)[[1]], 1, 1)
and in comparison
和相比
> identical(f0(), f1())
[1] TRUE
> library(microbenchmark)
> microbenchmark(f0(), f1())
Unit: milliseconds
expr min lq mean median uq max neval
f0() 5.144873 5.515219 5.571327 5.547899 5.623171 5.897335 100
f1() 9.153364 9.470571 9.994560 10.162012 10.350990 11.047261 100
Still wasteful, since the entire file is read in to memory before mostly being discarded.
仍然很浪费,因为整个文件在被丢弃之前都被读入内存。
#5
5
Benchmarks for each answer, under Windows.
每个答案的基准,在Windows下。
library(microbenchmark)
microbenchmark(
"RC readLines" = {
lines <- readLines("test.txt")
substring(lines, 1, 1)
},
"RS read.fwf" = read.fwf("test.txt", 1, stringsAsFactors = FALSE)$V1,
"BB scan pipe cut" = scan(pipe("cut -c 1 test.txt"),what=character()),
"RC readChar" = {
con <- file("test.txt", "r")
x <- readChar(con, 1)
while(length(ch <- readChar(con, 1)) > 0)
{
if(ch == "\n")
{
x <- c(x, readChar(con, 1))
}
}
close(con)
}
)
## Unit: microseconds
## expr min lq mean median uq
## RC readLines 561.598 712.876 830.6969 753.929 884.8865
## RS read.fwf 5079.010 6429.225 6772.2883 6837.697 7153.3905
## BB scan pipe cut 308195.548 309941.510 313476.6015 310304.412 310772.0005
## RC readChar 1238.963 1549.320 1929.4165 1612.952 1740.8300
## max neval
## 2156.896 100
## 8421.090 100
## 510185.114 100
## 26437.370 100
And on the bigger dataset:
对于更大的数据集:
## Unit: milliseconds
## expr min lq mean median uq max neval
## RC readLines 52.212563 84.496008 96.48517 103.319789 104.124623 158.086020 20
## RS read.fwf 391.371514 660.029853 703.51134 766.867222 777.795180 799.670185 20
## BB scan pipe cut 283.442150 482.062337 516.70913 562.416766 564.680194 567.089973 20
## RC readChar 2819.343753 4338.041708 4500.98579 4743.174825 4921.148501 5089.594928 20
## RS scan 2.088749 3.643816 4.16159 4.651449 4.731706 5.375819 20
#6
2
I don't find it very informative to benchmark operations in the order of micro or milliseconds. But I understand that in some cases it can't be avoided. In those cases, still, I find it essential to test data of different (increasing sizes) to get a rough measure of how well the method scales..
我发现以微或毫秒为单位对操作进行基准测试并不是很有用。但我明白,在某些情况下,这是无法避免的。在这些情况下,我仍然发现有必要测试不同(不断增加的大小)的数据,以大致衡量方法的可扩展性。
Here's my run on @MartinMorgan's tests using f0()
and f1()
on 1e4, 1e5 and 1e6 rows and here are the results:
下面是我使用f0()和f1()在1e4、1e5和1e6行上运行的@MartinMorgan的测试,这里是结果:
1e4
# Unit: milliseconds
# expr min lq mean median uq max neval
# f0() 4.226333 7.738857 15.47984 8.398608 8.972871 89.87805 100
# f1() 8.854873 9.204724 10.48078 9.471424 10.143601 84.33003 100
1e5
# Unit: milliseconds
# expr min lq mean median uq max neval
# f0() 71.66205 176.57649 174.9545 184.0191 187.7107 307.0470 100
# f1() 95.60237 98.82307 104.3605 100.8267 107.9830 205.8728 100
1e6
# Unit: seconds
# expr min lq mean median uq max neval
# f0() 1.443471 1.537343 1.561025 1.553624 1.558947 1.729900 10
# f1() 1.089555 1.092633 1.101437 1.095997 1.102649 1.140505 10
identical(f0(), f1())
returned TRUE on all the tests.
在所有测试中都返回TRUE (f0(), f1())。
Update:
更新:
1e7
I also ran on 1e7 rows.
我也在1e7行上运行。
f1()
(data.table) ran in 9.7 seconds, where as f0()
ran in 7.8 seconds the first time, and 9.4 and 6.6s the second time.
f1() (data.table)跑9.7秒,f0()第一次跑7.8秒,第二次跑9.4和6.6秒。
However, f1()
resulted in no noticeable change in memory while reading the entire 0.479GB file, whereas, f0()
resulted in a spike of 2.4GB.
然而,在读取整个0.479GB文件时,f1()并没有导致内存的显著变化,而f0()则导致了2.4GB的峰值。
Another observation:
set.seed(2015)
x2 <- vapply(
1:1e5,
function(i)
{
paste0(
sample(letters, 100L, replace = TRUE),
collapse = "_"
)
},
character(1)
)
# 10 million rows, with 200 characters each
writeLines(unlist(lapply(1:100, function(x) x2)), "bigtest.txt")
## readBin() results in a 2 billion row vector
system.time(f0()) ## explodes on memory
Because the readBin()
step results in a 2 billion length vector (~1.9GB to read the file), and which(x == what)
step takes ~4.5+GB (= ~6.5GB in total) at which point I stopped the process.
因为readBin()步骤产生了20亿个长度向量(~1.9GB来读取文件),而哪个(x =什么)步骤需要~4.5+GB (= ~6.5GB),这时我停止了这个过程。
fread()
takes ~23 seconds in this case.
fread()在这种情况下需要大约23秒。
HTH
HTH
#1
8
01/04/2015 Edited to bring the better solution to the top.
01/04/2015编辑,将更好的解决方案放在首位。
Update 2 Changing the scan()
method to run on an open connection instead of opening and closing on every iteration allows to read line-by-line and eliminates the looping. The timing improved quite a bit.
更新2将scan()方法更改为在打开的连接上运行,而不是在每次迭代中打开和关闭,允许逐行读取并消除循环。时机改进了不少。
## scan() on open connection
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
I also discovered the stri_read_lines()
function in the stringi package, Its help file says it's experimental at the moment, but it is very fast.
我还在stringi包中发现了stri_read_lines()函数,它的帮助文件说它目前还处于试验阶段,但速度很快。
## stringi::stri_read_lines()
library(stringi)
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
Here are the timings for these two methods.
以下是这两种方法的时间安排。
## timings
library(microbenchmark)
microbenchmark(
scan = {
conn <- file("bigtest.txt", "rt")
substr(scan(conn, what = "", sep = "\n", quiet = TRUE), 1, 1)
close(conn)
},
stringi = {
stri_sub(stri_read_lines("bigtest.txt"), 1, 1)
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# scan 50.00170 50.10403 50.55055 50.18245 50.56112 54.64646 100
# stringi 13.67069 13.74270 14.20861 13.77733 13.86348 18.31421 100
Original [slower] answer :
原始(慢)答:
You could try read.fwf()
(fixed width file), setting the width to a single 1 to capture the first character on each line.
您可以尝试read.fwf()(固定宽度文件),将宽度设置为1,以捕获每行上的第一个字符。
read.fwf("test.txt", 1, stringsAsFactors = FALSE)[[1L]]
# [1] "A" "B" "C" "D" "E"
Not fully tested of course, but works for the test file and is a nice function for getting substrings without having to read the entire file.
当然还没有完全测试,但是对测试文件有效,并且是一个很好的函数,可以在不需要读取整个文件的情况下获取子字符串。
Update 1 : read.fwf()
is not very efficient, calling scan()
and read.table()
internally. We can skip the middle-men and try scan()
directly.
更新1:read.fwf()不是很有效,在内部调用scan()和read.table()。我们可以跳过中间人直接尝试scan()。
lines <- count.fields("test.txt") ## length is num of lines in file
skip <- seq_along(lines) - 1 ## set up the 'skip' arg for scan()
read <- function(n) {
ch <- scan("test.txt", what = "", nlines = 1L, skip = n, quiet=TRUE)
substr(ch, 1, 1)
}
vapply(skip, read, character(1L))
# [1] "A" "B" "C" "D" "E"
version$platform
# [1] "x86_64-pc-linux-gnu"
#2
20
If you allow/have access to Unix command-line tools you can use
如果允许/有权使用Unix命令行工具,则可以使用这些工具
scan(pipe("cut -c 1 test.txt"), what="", quiet=TRUE)
Obviously less portable but probably very fast.
显然移动性较弱,但速度可能很快。
Using @RichieCotton's benchmarking code with the OP's suggested "bigtest.txt" file:
使用@RichieCotton的基准代码和OP建议的“bigtest”。txt文件:
expr min lq mean median uq
RC readLines 14.797830 17.083849 19.261917 18.103020 20.007341
RS read.fwf 125.113935 133.259220 148.122596 138.024203 150.528754
BB scan pipe cut 6.277267 7.027964 7.686314 7.337207 8.004137
RC readChar 1163.126377 1219.982117 1324.576432 1278.417578 1368.321464
RS scan 13.927765 14.752597 16.634288 15.274470 16.992124
#3
13
data.table::fread()
seems to beat all of the solutions so far proposed, and has the great virtue of running comparably fast on both Windows and *NIX machines:
表::fread()似乎打败了目前提出的所有解决方案,并且在Windows和*NIX机器上运行速度都非常快:
library(data.table)
substring(fread("bigtest.txt", sep="\n", header=FALSE)[[1]], 1, 1)
Here are microbenchmark timings on a Linux box (actually a dual-boot laptop, booted up as Ubuntu):
以下是Linux机箱上的微基准测试时间(实际上是双启动笔记本,作为Ubuntu启动):
Unit: milliseconds
expr min lq mean median uq max neval
RC readLines 15.830318 16.617075 18.294723 17.116666 18.959381 27.54451 100
JOB fread 5.532777 6.013432 7.225067 6.292191 7.727054 12.79815 100
RS read.fwf 111.099578 113.803053 118.844635 116.501270 123.987873 141.14975 100
BB scan pipe cut 6.583634 8.290366 9.925221 10.115399 11.013237 15.63060 100
RC readChar 1347.017408 1407.878731 1453.580001 1450.693865 1491.764668 1583.92091 100
And here are timings from the same laptop booted up as a Windows machine (with the command-line tool cut
supplied by Rtools):
以下是作为Windows机器启动的同一台笔记本电脑(Rtools提供的命令行工具)的计时:
Unit: milliseconds
expr min lq mean median uq max neval cld
RC readLines 26.653266 27.493167 33.13860 28.057552 33.208309 61.72567 100 b
JOB fread 4.964205 5.343063 6.71591 5.538246 6.027024 13.54647 100 a
RS read.fwf 213.951792 217.749833 229.31050 220.793649 237.400166 287.03953 100 c
BB scan pipe cut 180.963117 263.469528 278.04720 276.138088 280.227259 387.87889 100 d
RC readChar 1505.263964 1572.132785 1646.88564 1622.410703 1688.809031 2149.10773 100 e
#4
13
Figure out the file size, read it in as a single binary blob, find the offsets of the characters of interest (don't count the last '\n', at the end of the file!), and coerce to final form
求出文件大小,将其作为一个二进制blob读取,查找相关字符的偏移量(不要计算文件末尾的最后一个'\n'),并强制转换为最终形式
f0 <- function() {
sz <- file.info("bigtest.txt")$size
what <- charToRaw("\n")
x = readBin("bigtest.txt", raw(), sz)
idx = which(x == what)
rawToChar(x[c(1L, idx[-length(idx)] + 1L)], multiple=TRUE)
}
The data.table solution (was I think the fastest so far -- need to include the first line as part of the data!)
数据。表解决方案(我认为是目前最快的解决方案——需要包含第一行作为数据的一部分!)
library(data.table)
f1 <- function()
substring(fread("bigtest.txt", header=FALSE)[[1]], 1, 1)
and in comparison
和相比
> identical(f0(), f1())
[1] TRUE
> library(microbenchmark)
> microbenchmark(f0(), f1())
Unit: milliseconds
expr min lq mean median uq max neval
f0() 5.144873 5.515219 5.571327 5.547899 5.623171 5.897335 100
f1() 9.153364 9.470571 9.994560 10.162012 10.350990 11.047261 100
Still wasteful, since the entire file is read in to memory before mostly being discarded.
仍然很浪费,因为整个文件在被丢弃之前都被读入内存。
#5
5
Benchmarks for each answer, under Windows.
每个答案的基准,在Windows下。
library(microbenchmark)
microbenchmark(
"RC readLines" = {
lines <- readLines("test.txt")
substring(lines, 1, 1)
},
"RS read.fwf" = read.fwf("test.txt", 1, stringsAsFactors = FALSE)$V1,
"BB scan pipe cut" = scan(pipe("cut -c 1 test.txt"),what=character()),
"RC readChar" = {
con <- file("test.txt", "r")
x <- readChar(con, 1)
while(length(ch <- readChar(con, 1)) > 0)
{
if(ch == "\n")
{
x <- c(x, readChar(con, 1))
}
}
close(con)
}
)
## Unit: microseconds
## expr min lq mean median uq
## RC readLines 561.598 712.876 830.6969 753.929 884.8865
## RS read.fwf 5079.010 6429.225 6772.2883 6837.697 7153.3905
## BB scan pipe cut 308195.548 309941.510 313476.6015 310304.412 310772.0005
## RC readChar 1238.963 1549.320 1929.4165 1612.952 1740.8300
## max neval
## 2156.896 100
## 8421.090 100
## 510185.114 100
## 26437.370 100
And on the bigger dataset:
对于更大的数据集:
## Unit: milliseconds
## expr min lq mean median uq max neval
## RC readLines 52.212563 84.496008 96.48517 103.319789 104.124623 158.086020 20
## RS read.fwf 391.371514 660.029853 703.51134 766.867222 777.795180 799.670185 20
## BB scan pipe cut 283.442150 482.062337 516.70913 562.416766 564.680194 567.089973 20
## RC readChar 2819.343753 4338.041708 4500.98579 4743.174825 4921.148501 5089.594928 20
## RS scan 2.088749 3.643816 4.16159 4.651449 4.731706 5.375819 20
#6
2
I don't find it very informative to benchmark operations in the order of micro or milliseconds. But I understand that in some cases it can't be avoided. In those cases, still, I find it essential to test data of different (increasing sizes) to get a rough measure of how well the method scales..
我发现以微或毫秒为单位对操作进行基准测试并不是很有用。但我明白,在某些情况下,这是无法避免的。在这些情况下,我仍然发现有必要测试不同(不断增加的大小)的数据,以大致衡量方法的可扩展性。
Here's my run on @MartinMorgan's tests using f0()
and f1()
on 1e4, 1e5 and 1e6 rows and here are the results:
下面是我使用f0()和f1()在1e4、1e5和1e6行上运行的@MartinMorgan的测试,这里是结果:
1e4
# Unit: milliseconds
# expr min lq mean median uq max neval
# f0() 4.226333 7.738857 15.47984 8.398608 8.972871 89.87805 100
# f1() 8.854873 9.204724 10.48078 9.471424 10.143601 84.33003 100
1e5
# Unit: milliseconds
# expr min lq mean median uq max neval
# f0() 71.66205 176.57649 174.9545 184.0191 187.7107 307.0470 100
# f1() 95.60237 98.82307 104.3605 100.8267 107.9830 205.8728 100
1e6
# Unit: seconds
# expr min lq mean median uq max neval
# f0() 1.443471 1.537343 1.561025 1.553624 1.558947 1.729900 10
# f1() 1.089555 1.092633 1.101437 1.095997 1.102649 1.140505 10
identical(f0(), f1())
returned TRUE on all the tests.
在所有测试中都返回TRUE (f0(), f1())。
Update:
更新:
1e7
I also ran on 1e7 rows.
我也在1e7行上运行。
f1()
(data.table) ran in 9.7 seconds, where as f0()
ran in 7.8 seconds the first time, and 9.4 and 6.6s the second time.
f1() (data.table)跑9.7秒,f0()第一次跑7.8秒,第二次跑9.4和6.6秒。
However, f1()
resulted in no noticeable change in memory while reading the entire 0.479GB file, whereas, f0()
resulted in a spike of 2.4GB.
然而,在读取整个0.479GB文件时,f1()并没有导致内存的显著变化,而f0()则导致了2.4GB的峰值。
Another observation:
set.seed(2015)
x2 <- vapply(
1:1e5,
function(i)
{
paste0(
sample(letters, 100L, replace = TRUE),
collapse = "_"
)
},
character(1)
)
# 10 million rows, with 200 characters each
writeLines(unlist(lapply(1:100, function(x) x2)), "bigtest.txt")
## readBin() results in a 2 billion row vector
system.time(f0()) ## explodes on memory
Because the readBin()
step results in a 2 billion length vector (~1.9GB to read the file), and which(x == what)
step takes ~4.5+GB (= ~6.5GB in total) at which point I stopped the process.
因为readBin()步骤产生了20亿个长度向量(~1.9GB来读取文件),而哪个(x =什么)步骤需要~4.5+GB (= ~6.5GB),这时我停止了这个过程。
fread()
takes ~23 seconds in this case.
fread()在这种情况下需要大约23秒。
HTH
HTH