导入具有多种记录类型的文件

时间:2022-09-06 17:03:53

How would you read a file with multiple record types (e.g. Header and Details) into a dataframe in R?

如何将具有多种记录类型(例如标题和详细信息)的文件读入R中的数据框?

For example, the data looks like

例如,数据看起来像

HAAABBB
D12345
D23456
HCCCDDD
D67890

...

I would like to make a dataframe like this:

我想制作一个这样的数据帧:

v1  v2  v3 
AAA BBB 12345
AAA BBB 23456
CCC DDD 67890

It seems cumbersome to readline and use the rownumber to determine the header record of each detail.

readline和使用rownumber确定每个细节的标题记录似乎很麻烦。

I used to use a software called Monarch to open these types of file, however, it is prety slow for big files.

我曾经使用一种名为Monarch的软件来打开这些类型的文件,但是,它对于大文件来说是很慢的。

1 个解决方案

#1


0  

You can read all the lines in your file into R and then process them according to the rules defined:

您可以将文件中的所有行读入R,然后根据定义的规则处理它们:

# read input file
rawtext <- readLines("FileName")

rawtext
# [1] "HAAABBB" "D12345"  "D23456"  "HCCCDDD" "D67890"

# get locations of headers and values
headers.loc <- which (startsWith(rawtext,"H"))
values.loc <- which (startsWith(rawtext,"D"))

# extract values
values <- substring(rawtext[values.loc],2)

# find locations of corresponding headers
hv <- sapply(values.loc,FUN=function(x){ max(which( x-headers.loc >0)) })

# create a dataframe
df <- data.frame(v1 = substring(rawtext[headers.loc[hv]], 2, 4), 
                 v2 = substring(rawtext[headers.loc[hv]], 5, 7), 
                 v3 = values)


df
#    v1  v2    v3
# 1 AAA BBB 12345
# 2 AAA BBB 23456
# 3 CCC DDD 67890

#1


0  

You can read all the lines in your file into R and then process them according to the rules defined:

您可以将文件中的所有行读入R,然后根据定义的规则处理它们:

# read input file
rawtext <- readLines("FileName")

rawtext
# [1] "HAAABBB" "D12345"  "D23456"  "HCCCDDD" "D67890"

# get locations of headers and values
headers.loc <- which (startsWith(rawtext,"H"))
values.loc <- which (startsWith(rawtext,"D"))

# extract values
values <- substring(rawtext[values.loc],2)

# find locations of corresponding headers
hv <- sapply(values.loc,FUN=function(x){ max(which( x-headers.loc >0)) })

# create a dataframe
df <- data.frame(v1 = substring(rawtext[headers.loc[hv]], 2, 4), 
                 v2 = substring(rawtext[headers.loc[hv]], 5, 7), 
                 v3 = values)


df
#    v1  v2    v3
# 1 AAA BBB 12345
# 2 AAA BBB 23456
# 3 CCC DDD 67890