如何在R中从一个大的固定宽度文件中读取特定的列?

时间:2022-02-24 20:26:35

Is there any convenient way in R to read a specific column (or multiple columns) from a fixed-width data file? E.g. the file looks like this:

在R中有什么方便的方法可以从固定宽度的数据文件中读取特定的列(或多个列)吗?文件是这样的:

10010100100002000000
00010010000001000000
10010000001002000000

Say, I would be interested in column 15. At the moment I am reading the whole data with read.fwf and as width a vector of 1's with length of the total number of columns:

比如说,我对第15栏感兴趣。目前,我正在阅读全部数据。fwf为宽度,为1的向量,列数的长度为:

data <- read.fwf("demo.asc", widths=rep(1,20))
data[,14]
[1] 2 1 2

This works well, but doesn't scale to data-sets with 100,000s of columns and rows. Is there any efficient way how to do this?

这很有效,但不能扩展到包含10万列和行的数据集。有什么有效的方法来做这件事吗?

1 个解决方案

#1


2  

You can use a connection and process the file in blocks:

您可以使用连接并以块的形式处理文件:

Replicate your data:

复制你的数据:

dat <-"10010100100002000000
00010010000001000000
10010000001002000000"

Process in blocks using a connection:

使用连接的块进程:

# Define a connection
con = textConnection(dat)


# Do the block update
linesPerUpdate <- 2
result <- character()
repeat {
  line <- readLines(con, linesPerUpdate)
  result <- c(result, substr(line, start=14, stop=14))
  if (length(line) < linesPerUpdate) break
}

# Close the connection
close(con)

The result:

结果:

result
[1] "2" "1" "2"

#1


2  

You can use a connection and process the file in blocks:

您可以使用连接并以块的形式处理文件:

Replicate your data:

复制你的数据:

dat <-"10010100100002000000
00010010000001000000
10010000001002000000"

Process in blocks using a connection:

使用连接的块进程:

# Define a connection
con = textConnection(dat)


# Do the block update
linesPerUpdate <- 2
result <- character()
repeat {
  line <- readLines(con, linesPerUpdate)
  result <- c(result, substr(line, start=14, stop=14))
  if (length(line) < linesPerUpdate) break
}

# Close the connection
close(con)

The result:

结果:

result
[1] "2" "1" "2"