How would you read a file with multiple record types (e.g. Header and Details) into a dataframe in R?
如何将具有多种记录类型(例如标题和详细信息)的文件读入R中的数据框?
For example, the data looks like
例如,数据看起来像
HAAABBB
D12345
D23456
HCCCDDD
D67890
...
I would like to make a dataframe like this:
我想制作一个这样的数据帧:
v1 v2 v3
AAA BBB 12345
AAA BBB 23456
CCC DDD 67890
It seems cumbersome to readline and use the rownumber to determine the header record of each detail.
readline和使用rownumber确定每个细节的标题记录似乎很麻烦。
I used to use a software called Monarch to open these types of file, however, it is prety slow for big files.
我曾经使用一种名为Monarch的软件来打开这些类型的文件,但是,它对于大文件来说是很慢的。
1 个解决方案
#1
0
You can read all the lines in your file into R and then process them according to the rules defined:
您可以将文件中的所有行读入R,然后根据定义的规则处理它们:
# read input file
rawtext <- readLines("FileName")
rawtext
# [1] "HAAABBB" "D12345" "D23456" "HCCCDDD" "D67890"
# get locations of headers and values
headers.loc <- which (startsWith(rawtext,"H"))
values.loc <- which (startsWith(rawtext,"D"))
# extract values
values <- substring(rawtext[values.loc],2)
# find locations of corresponding headers
hv <- sapply(values.loc,FUN=function(x){ max(which( x-headers.loc >0)) })
# create a dataframe
df <- data.frame(v1 = substring(rawtext[headers.loc[hv]], 2, 4),
v2 = substring(rawtext[headers.loc[hv]], 5, 7),
v3 = values)
df
# v1 v2 v3
# 1 AAA BBB 12345
# 2 AAA BBB 23456
# 3 CCC DDD 67890
#1
0
You can read all the lines in your file into R and then process them according to the rules defined:
您可以将文件中的所有行读入R,然后根据定义的规则处理它们:
# read input file
rawtext <- readLines("FileName")
rawtext
# [1] "HAAABBB" "D12345" "D23456" "HCCCDDD" "D67890"
# get locations of headers and values
headers.loc <- which (startsWith(rawtext,"H"))
values.loc <- which (startsWith(rawtext,"D"))
# extract values
values <- substring(rawtext[values.loc],2)
# find locations of corresponding headers
hv <- sapply(values.loc,FUN=function(x){ max(which( x-headers.loc >0)) })
# create a dataframe
df <- data.frame(v1 = substring(rawtext[headers.loc[hv]], 2, 4),
v2 = substring(rawtext[headers.loc[hv]], 5, 7),
v3 = values)
df
# v1 v2 v3
# 1 AAA BBB 12345
# 2 AAA BBB 23456
# 3 CCC DDD 67890