根据列值拆分大文件--linux

I wanted to split the large file (185 Million records) to more than one files based on one column value.The file is .dat file and the delimiter used inbetween the columns are ^A (\u0001).

我想根据一个列值将大文件(1.85亿条记录)拆分为多个文件。文件是.dat文件,列之间使用的分隔符是^ A(\ u0001)。

The File content is like this:

文件内容如下:

194^A1^A091502^APR^AKIMBERLY^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A1^A091502^APR^AJOHN^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A^A091502^APR^AASHLEY^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A3^A091502^APR^APETER^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A
194^A4^A091502^APR^AJOE^APO83^A^A^A^A0183^AUSA^A^A^A^A^A^A^A^A

now i wanted to split the file based on second column value, if you see the third row the second column value is empty, so all the empty rows should come one file , remaining all should come one file.

现在我想根据第二列值拆分文件,如果你看到第三行第二列值为空,那么所有空行应该是一个文件,剩下的都应该是一个文件。

Please help me on this. I tried to google, it seems we should use awk for this.

请帮帮我。我试图谷歌,似乎我们应该使用awk。

Regards, Shankar

1 个解决方案

#1

With awk:

awk -F '\x01' '$2 == "" { print > "empty.dat"; next } { print > "normal.dat" }' filename

The file names can be chosen arbitrarily, of course. print > "file" prints the current record to a file named "file".

当然,文件名可以任意选择。 print>“file”将当前记录打印到名为“file”的文件中。

Addendum re: comment: Removing the column is a little trickier but certainly feasible. I'd use

补遗:评论:删除专栏有点棘手,但肯定是可行的。我用了

awk -F '\x01' 'BEGIN { OFS = FS } { fname = $2 == "" ? "empty.dat" : "normal.dat"; for(i = 2; i < NF; ++i) $i = $(i + 1); --NF; print > fname }' filename

This works as follows:

其工作原理如下:

BEGIN {                                          # output field separator is
  OFS = FS                                       # the same as input field
                                                 # separator, so that the
                                                 # rebuilt lines are formatted
                                                 # just like they came in
}
{
  fname = $2 == "" ? "empty.dat" : "normal.dat"  # choose file name

  for(i = 2; i < NF; ++i) {                      # set all fields after the
    $i = $(i + 1)                                # second back one position
  }

  --NF                                           # let awk know the last field
                                                 # is not needed in the output

  print > fname                                  # then print to file.
}

#1