在r data.table fread()中使用grep作为命令行工具 - 结果不正确

时间:2022-10-22 14:55:52

Hello,

This is my first posting here. I use excellent R data.table package. I need to import a file without comment lines, but I don't see any option in fread() to get rid of comment lines which are spread across the file, not only at the beginning of the file. To simplify - the file test.txt consists of 4 lines, comment lines begin with "#":

这是我在这里的第一篇文章。我使用优秀的R data.table包。我需要导入一个没有注释行的文件,但我没有在fread()中看到任何选项来摆脱遍布文件的注释行,而不仅仅是在文件的开头。为了简化 - 文件test.txt由4行组成,注释行以“#”开头:

#A
A   AA
A   A#A
#A

I import data with fread() and then get rid of comment lines with grep (^#); everything works. There is also an option to use grep inside fread() as a command line call instead a single file name. (For the record, I working in Windows, thus I have grep.exe in my project folder.) Grep works with simple regular expressions as expected when I call it from R:

我使用fread()导入数据,然后用grep(^#)删除注释行;一切正常。还可以选择在fread()中使用grep作为命令行调用而不是单个文件名。 (为了记录,我在Windows中工作,因此我的项目文件夹中有grep.exe。)当我从R调用Grep时,Grep可以使用简单的正则表达式:

> system("grep # test.txt")
#A
A   A#A
#A
> system("grep ^# test.txt")
#A
#A

But it ignores the beginning of line command "^" when called as a system command inside fread() function:

但它在fread()函数中作为系统命令调用时忽略了行命令“^”的开头:

> fread("grep # test.txt", sep = "\t", header = FALSE, fill = TRUE)
   V1  V2
1: #A    
2:  A A#A
3: #A

> fread("grep ^# test.txt", sep = "\t", header = FALSE, fill = TRUE)
   V1  V2
1: #A    
2:  A A#A
3: #A  

Thus, grep.exe as well as grep() in R are working as expected, but grep.exe called from fread() ignores beginning of line command (didn't try other regex). What is wrong here?

因此,grep.exe以及R中的grep()按预期工作,但是从fread()调用的grep.exe忽略行开头命令(没有尝试其他正则表达式)。这有什么不对?

1 个解决方案

#1


0  

Thank you very much, Frank. I tried fread("grep '^#' test.txt", sep = "\t", header = FALSE, fill = TRUE), but this call results in error, whereas fread("grep '^#' test.txt", sep = "\t", header = FALSE, fill = TRUE), as you suggested, works correctly. Strange behavior. Also, I noticed that making system call to grep inside fread() is almost twice slower than reading the file with fread() and then using the grep() inside R. My file is ~1 Mb, 3521034 lines, 1058 of them comment lines. Maybe, if the proportion of comment lines would be much larger, system call would be quicker, because in that case you don't need to import comment lines into data.table (without grep call before importing I need to use fill = TRUE to add empty values to missing columns in comment lines, otherwise fread() fails).

非常感谢,弗兰克。我试过fread(“grep'^#'test.txt”,sep =“\ t”,header = FALSE,fill = TRUE),但这个调用导致错误,而fread(“grep'^#'test.txt “,sep =”\ t“,header = FALSE,fill = TRUE),如您所知,正常工作。奇怪的行为。另外,我注意到在fread()中对grep进行系统调用几乎比用fread()读取文件然后在R里面使用grep()慢两倍。我的文件是〜1 Mb,3521034行,其中1058个注释线。也许,如果注释行的比例会大得多,系统调用会更快,因为在这种情况下你不需要将注释行导入data.table(在导入之前没有grep调用我需要使用fill = TRUE来将空值添加到注释行中的缺失列,否则fread()失败)。

#1


0  

Thank you very much, Frank. I tried fread("grep '^#' test.txt", sep = "\t", header = FALSE, fill = TRUE), but this call results in error, whereas fread("grep '^#' test.txt", sep = "\t", header = FALSE, fill = TRUE), as you suggested, works correctly. Strange behavior. Also, I noticed that making system call to grep inside fread() is almost twice slower than reading the file with fread() and then using the grep() inside R. My file is ~1 Mb, 3521034 lines, 1058 of them comment lines. Maybe, if the proportion of comment lines would be much larger, system call would be quicker, because in that case you don't need to import comment lines into data.table (without grep call before importing I need to use fill = TRUE to add empty values to missing columns in comment lines, otherwise fread() fails).

非常感谢,弗兰克。我试过fread(“grep'^#'test.txt”,sep =“\ t”,header = FALSE,fill = TRUE),但这个调用导致错误,而fread(“grep'^#'test.txt “,sep =”\ t“,header = FALSE,fill = TRUE),如您所知,正常工作。奇怪的行为。另外,我注意到在fread()中对grep进行系统调用几乎比用fread()读取文件然后在R里面使用grep()慢两倍。我的文件是〜1 Mb,3521034行,其中1058个注释线。也许,如果注释行的比例会大得多,系统调用会更快,因为在这种情况下你不需要将注释行导入data.table(在导入之前没有grep调用我需要使用fill = TRUE来将空值添加到注释行中的缺失列,否则fread()失败)。