不使用换行器读取固定宽度的数据

I've got a flatfile, fixed width with neither newline nor linefeed (dump from AS400).

我有一个平面文件，固定宽度，没有换行和换行(从AS400转储)。

How do I load this file into an R data.frame?

如何将该文件加载到R data.frame中?

I've tried different combinations of textConnection and read.fwf, to no avail.

我尝试过不同的textConnection和read组合。fwf,无济于事。

The code below crashes Rstudio, so I'm assuming I'm overloading the system.

下面的代码会崩溃Rstudio，所以我假设我正在重载系统。

len below is 24376400, which is tame as far as the files I usually load using read.table. Record length is 400.

下面的len是24376400，对于我通常使用read.table加载的文件来说，它是驯服的。记录长度是400。

Is there any RECLEN parameter I should set, similar to SAS? Is there an option to set EOL = "\n" or "\r\n" ? Thank you.

是否需要设置与SAS相似的RECLEN参数?是否有设置EOL = "\n"或"\r\n"的选项?谢谢你！

fname <- "AS400FILE.TXT"
len <- file.info(fname)$size
conn <- file(fname, 'r')
contents <- readChar(conn, len)
close(conn)

df <- read.fwf( textConnection(contents) , widths=layout$length , sep="")

> dput(layout)
structure(list(start = c(1L, 41L, 81L, 121L, 161L, 201L, 224L, 
226L, 231L, 235L, 237L, 238L, 240L, 280L, 290L, 300L, 305L, 308L, 
309L, 330L, 335L, 337L, 349L, 350L, 351L, 355L, 365L), end = c(40L, 
80L, 120L, 160L, 200L, 223L, 225L, 230L, 234L, 236L, 237L, 239L, 
279L, 289L, 299L, 304L, 307L, 308L, 329L, 334L, 336L, 348L, 349L, 
350L, 354L, 364L, 400L), length = c(40L, 40L, 40L, 40L, 40L, 
23L, 2L, 5L, 4L, 2L, 1L, 2L, 40L, 10L, 10L, 5L, 3L, 1L, 21L, 
5L, 2L, 12L, 1L, 1L, 4L, 10L, 36L), label = c("TITLE", "SUFFIX", 
"ADDRESS1", "ADDRESS2", "ADDRESS3", "CITY", "STATE", 
"ZIP", "ZIP+4", "DELIVERY", "CHECKD", "FILLER", "NAME", 
"SOURCECODE", "ID", "FILLER", "BATCH", "FILLER", "FILLER", 
"GRID", "LOT", "FILLER", "CONTROL", 
"ZIPIND", "TROUTE", "SOURCEA", "FILLER")), .Names = c("start", 
"end", "length", "label"), class = "data.frame", row.names = c(NA, 
-27L))
> dim(layout)
[1] 27  4
>

1 个解决方案

#1

You could use readChar for this.

您可以为此使用readChar。

First make up some sample data (I think the format is as you describe as far as I can tell from the question? i.e. wall of text with a specified width per column, no new lines in the entire file):

首先，整理一些样本数据(我认为格式就像你描述的那样，我能从问题中看出?)即每列有指定宽度的文本墙，整个文件中没有新行):

lengths <- c(2,3,4,2,3,4)
nFields <- length(lengths)
nRows   <- 10              # let's make 10 rows.
contents <- paste(letters[sample.int(26,size=sum(lengths)*nRows,replace=TRUE)],
                  collapse="")
#> contents
#[1] "lepajmcgcqooekmedjprkmmicm.......
cat(contents,file='test.txt')

I can think of 3 ways to do it, various differences between each:

我可以想出三种方法，不同的方法:

If you know the number of rows in advance you can do:

如果你提前知道行数，你可以这样做:

# If you know #rows in advance..
conn <- file('test.txt','r')
data <- readChar( conn, rep(lengths,nRows) )
close(conn)
# reshape data to dataframe
df <- data.frame(matrix(data,ncol=nFields,byrow=T))

Otherwise you can use a loop (why read in the file once to work out the number of rows and then again to parse?)

否则，您可以使用循环(为什么要在文件中读取一次以计算行数，然后再解析?)

# Otherwise use a loop
conn <- file('test.txt','r')
df <- data.frame(matrix(nrow=0,ncol=6)) # initialise 0-row data frame
while ( length(data <- readChar(conn, lengths)) > 0 ) {
    df[nrow(df)+1,] <- data
}
close(conn)

Or, since you already have all of contents in a string, you can just split the string using substring:

或者，由于字符串中已经包含了所有内容，您可以使用子字符串分割字符串:

# have already read in contents so can calculate nRows
nRows <- floor(nchar(contents)/sum(lengths)) # 10 for my example
starts <- c(0,cumsum(lengths[-nFields]))
df3 <- data.frame(t(
                    vapply( seq(1,nRows*sum(lengths),sum(lengths)),
                    function(r) 
                        substring(contents,starts+r,starts+r+lengths-1),
                    rep("",nFields) )))

If you want to do it in as little file reads as possible, I suggest the second or third methods.

如果您希望在尽可能小的文件中读取数据，我建议使用第二或第三种方法。

The third method "feels" most elegant to me, but requires you to read in the entire contents all at once, which, depending on file size, may not be viable.

第三种方法对我来说是“感觉”最优雅的，但要求您同时阅读全部内容，这取决于文件的大小，可能不可行。

If that's the case I'd go for the second, which only reads in one set of nFields fields at a time.

如果是这样的话，我就选第二个，它只在一组nFields字段中读取。

I don't recommend the first, unless you know the number of rows in advance - it was just my first attempt. I don't recommend it because you have to first read in the file to determine the number of rows, and then you close it and read it in again. If you want to go down that route then just use method 3! However, if you know by some other means the number of rows in advance, then you could use this method.

我不推荐第一个，除非你事先知道行数——这只是我的第一次尝试。我不推荐它，因为您必须首先读取文件以确定行数，然后关闭它并再次读取它。如果你想沿着那条路线走下去，那就使用方法3!但是，如果您通过其他方法知道了行数，那么您可以使用这个方法。

#1