I have hundreds of text files with the following information in each file:

我有数百个文本文件，每个文件都有以下信息:

*****Auto-Corelation Results******
1     .09    -.19     .18     non-Significant

*****STATISTICS FOR MANN-KENDELL TEST******
S=  609
VAR(S)=      162409.70
Z=           1.51
Random : No trend at 95%

*****SENs STATISTICS ******
SEN SLOPE =  .24

Now, I want to read all these files, and "collect" Sen's Statistics from each file (eg. .24) and compile into one file along with the corresponding file names. I have to do it in R.

现在，我要读取所有这些文件，并从每个文件(例如. .24)中“收集”Sen的统计数据，并将其编译成一个文件，并附带相应的文件名。我必须用R来表示。

I have worked with CSV files but not sure how to use text files.

我曾使用过CSV文件，但不知道如何使用文本文件。

This is the code I am using now:

这是我现在使用的代码:

require(gtools)
GG <- grep("*.txt", list.files(), value = TRUE)
GG<-mixedsort(GG)
S <- sapply(seq(GG), function(i){
X <- readLines(GG[i])
grep("SEN SLOPE", X, value = TRUE)
})
spl <- unlist(strsplit(S, ".*[^.0-9]"))
SenStat <- as.numeric(spl[nzchar(spl)])
SenStat<-data.frame( SenStat,file = GG)
write.table(SenStat, "sen.csv",sep = ", ",row.names = FALSE)

The current code is not able to read all values correctly and giving this error:

当前的代码不能正确读取所有值并给出错误:

Warning message:
NAs introduced by coercion

Also I am not getting the file names the other column of Output. Please help!

另外，我还没有得到文件名称的另一列输出。请帮助!

Diagnosis 1

The code is reading the = sign as well. This is the output of print(spl)

代码也读取=符号。这是print(spl)的输出

 [1] ""       "5.55"   ""       "-.18"   ""       "3.08"   ""       "3.05"   ""       "1.19"   ""       "-.32"  
[13] ""       ".22"    ""       "-.22"   ""       ".65"    ""       "1.64"   ""       "2.68"   ""       ".10"   
[25] ""       ".42"    ""       "-.44"   ""       ".49"    ""       "1.44"   ""       "=-1.07" ""       ".38"   
[37] ""       ".14"    ""       "=-2.33" ""       "4.76"   ""       ".45"    ""       ".02"    ""       "-.11"  
[49] ""       "=-2.64" ""       "-.63"   ""       "=-3.44" ""       "2.77"   ""       "2.35"   ""       "6.29"  
[61] ""       "1.20"   ""       "=-1.80" ""       "-.63"   ""       "5.83"   ""       "6.33"   ""       "5.42"  
[73] ""       ".72"    ""       "-.57"   ""       "3.52"   ""       "=-2.44" ""       "3.92"   ""       "1.99"  
[85] ""       ".77"    ""       "3.01"

Diagnosis 2

Found the problem I think. The negative sign is a bit tricky. In some files it is

找到了我想的问题。这个负号有点棘手。在一些文件中。

SEN SLOPE =-1.07
SEN SLOPE = -.11

Because of the gap after =, I am getting NAs for the first one, but the code is reading the second one. How can I modify the regex to fix this? Thanks!

因为在=后面的间隙，我得到了第一个的NAs，但是代码正在读取第二个。如何修改正则表达式来解决这个问题?谢谢!

4 个解决方案

#1

Assume "text.txt" is one of your text files. Read into R with readLines, you can use grep to find the line containing SEN SLOPE. With no further arguments, grep returns the index number(s) for the element where the regular expression was found. Here we find that it's the 11th line. Add the value = TRUE argument to get the line as it reads.

假设“文本。txt是你的文本文件之一。用readLines读入R，你可以用grep找到包含SEN斜率的直线。没有进一步的参数，grep将返回找到正则表达式元素的索引号(s)。这是第11行。添加值= TRUE参数，以获取该行读取的行。

x <- readLines("text.txt")
grep("SEN SLOPE", x)
## [1] 11
( gg <- grep("SEN SLOPE", x, value = TRUE) )
## [1] "SEN SLOPE =  .24"

To find all the .txt files in the working directory we can use list.files with a regular expression.

要在工作目录中找到所有的.txt文件，我们可以使用list。带有正则表达式的文件。

list.files(pattern = "*.txt")
## [1] "text.txt"

LOOPING OVER MULTIPLE FILES

循环在多个文件

I created a second text file, text2.txt with a different SEN SLOPE value to illustrate how I might apply this method over multiple files. We can use sapply, followed by strsplit, to get the spl values that are desired.

我创建了第二个文本文件text2。txt和一个不同的SEN斜率值来说明我如何将这个方法应用于多个文件。我们可以使用sapply，然后是strsplit，以获得所需的spl值。

GG <- list.files(pattern = "*.txt")
S <- sapply(seq_along(GG), function(i){
    X <- readLines(GG[i])
    ifelse(length(X) > 0, grep("SEN SLOPE", X, value = TRUE), NA)
    ## added 04/23/14 to account for empty files (as per comment)
})
spl <- unlist(strsplit(S, split = ".*((=|(\\s=))|(=\\s|\\s=\\s))"))
## above regex changed to capture up to and including "=" and 
## surrounding space, if any - 04/23/14 (as per comment)
SenStat <- as.numeric(spl[nzchar(spl)])

Then we can put the results into a data frame and send it to a file with write.table

然后我们可以将结果放入一个数据框中，并将其发送到一个有write.table的文件中。

( SenStatDf <- data.frame(SenStat, file = GG) )
##   SenStat      file
## 1    0.46 text2.txt
## 2    0.24  text.txt

We can write it to a file with

我们可以把它写成一个文件。

write.table(SenStatDf, "myFile.csv", sep = ", ", row.names = FALSE)

UPDATED 07/21/2014:

更新07/21/2014:

Since the result is being written to a file, this can be made much more simple (and faster) with

由于将结果写入文件，因此可以使用更简单(也更快)的方法。

( SenStatDf <- cbind(
      SenSlope = c(lapply(GG, function(x){
          y <- readLines(x)
          z <- y[grepl("SEN SLOPE", y)]
          unlist(strsplit(z, split = ".*=\\s+"))[-1]
          }), recursive = TRUE),
      file = GG
 ) )
#      SenSlope file       
# [1,] ".46"   "test2.txt"
# [2,] ".24"   "test.txt"

And then written and read into R with

然后写成R。

write.table(SenStatDf, "myFile.txt", row.names = FALSE)
read.table("myFile.txt", header = TRUE)
#   SenSlope      file
# 1     1.24 test2.txt
# 2     0.24  test.txt

#2

First make a sample text file:

首先制作一个示例文本文件:

cat('*****Auto-Corelation Results******
1     .09    -.19     .18     non-Significant

*****STATISTICS FOR MANN-KENDELL TEST******
S=  609
VAR(S)=      162409.70
Z=           1.51
Random : No trend at 95%

*****SENs STATISTICS ******
SEN SLOPE =  .24',file='samp.txt')

Then read it in:

然后读:

tf <- readLines('samp.txt')

Now extract the appropriate line:

现在提取适当的线:

sen_text <- grep('SEN SLOPE',tf,value=T)

And then get the value past the equals sign:

然后把这个值通过等号:

sen_value <- as.numeric(unlist(strsplit(sen_text,'='))[2])

Then combine these results for each of your files (no file structure mentioned in the original question)

然后将这些结果合并到每个文件中(在原始问题中没有提到文件结构)

#3

If you're text files are always of that format, (eg Sen Slope is always on line 11) and the text is identical over all your files you can do what you need in just two lines.

如果你的文本文件总是那样的格式，(例如，Sen斜率总是在第11行)，而文本在你所有的文件中都是相同的，你可以只用两行来做你需要做的事情。

char_vector <- readLines("Path/To/Document/sample.txt")
statistic <- as.numeric(strsplit(char_vector[11]," ")[[1]][5])

That will give you 0.24.

得到0。24。

You then iterate over all your files via an apply statement or a for loop.

然后，通过应用语句或for循环遍历所有文件。

For clarity:

为了清晰:

> char_vector[11]
[1] "SEN SLOPE =  .24"

and

和

> strsplit(char_vector[11]," ")
[[1]]
[1] "SEN"   "SLOPE" "="     ""      ".24"

Thus you want [[1]] [5] of the result from strsplit.

因此，您希望从strsplit得到[[1]] [5]。

#4

Step1: Save complete fileNames in a single variable:

步骤1:在单个变量中保存完整的文件名:

fileNames <- dir(dataDir,full.names=TRUE)

Step2: Lets read and process one of the files, and ensure that it is giving correct results:

步骤2:让我们读取和处理其中一个文件，并确保它给出了正确的结果:

data.frame(
  file=basename(fileNames[1]), 
  SEN.SLOPE= as.numeric(tail(
    strsplit(grep('SEN SLOPE',readLines(fileNames[1]),value=T),"=")[[1]],1))
  )

Step3: Do this on all the fileNames

步骤3:在所有的文件名上做这个。

do.call(
  rbind,
  lapply(fileNames, 
         function(fileName) data.frame(
           file=basename(fileName), 
           SEN.SLOPE= as.numeric(tail(
             strsplit(grep('SEN SLOPE',
                           readLines(fileName),value=T),"=")[[1]],1)
             )
           )
         )
  )

Hope this helps!!

希望这有助于! !

#1

x <- readLines("text.txt")
grep("SEN SLOPE", x)
## [1] 11
( gg <- grep("SEN SLOPE", x, value = TRUE) )
## [1] "SEN SLOPE =  .24"