What is fast and succinct way to match lines from a text file with a matching first field.
如何快速、简洁地将文本文件中的行与匹配的第一个字段匹配。
Sample input:
样例输入:
a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output:
期望的输出:
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output, alternative:
期望的输出,选择:
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
I can imagine many ways to write this, but I suspect there's a smart way to do it, e.g., with sed, awk, etc. My source file is approx 0.5 GB.
我可以想象很多方法来写这个,但是我怀疑有一个很聪明的方法,例如,sed, awk,等等。我的源文件大约是0.5 GB。
There are some related questions here, e.g., "awk | merge line on the basis of field matching", but that other question loads too much content into memory. I need a streaming method.
这里有一些相关的问题,例如“基于字段匹配的awk |合并行”,但是另一个问题将太多的内容加载到内存中。我需要一个流媒体方法。
5 个解决方案
#1
3
Here's a method where you only have to remember the previous line (therefore requires the input file to be sorted)
这里有一个方法,您只需记住前面的行(因此需要对输入文件进行排序)
awk -F \| '
$1 == prev_key {print prev_line; matches ++}
$1 != prev_key {
if (matches) print prev_line
matches = 0
prev_key = $1
}
{prev_line = $0}
END { if (matches) print $0 }
' filename
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Alternate output
交替输出
awk -F \| '
$1 == prev_key {
if (matches == 0) printf "%s", $1
printf "%s%s", FS, prev_value
matches ++
}
$1 != prev_key {
if (matches) printf "%s%s\n", FS, prev_value
matches = 0
prev_key = $1
}
{prev_value = $2}
END {if (matches) printf "%s%s\n", FS, $2}
' filename
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
#2
3
For fixed width fields you can used uniq
:
对于固定宽度的字段,可以使用uniq:
$ uniq -Dw 1 file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
If you don't have fixed width fields here are two awk
solution:
如果你没有固定宽度的字段,这里有两个awk解决方案:
awk -F'|' '{a[$1]++;b[$1]=(b[$1])?b[$1]RS$0:$0}END{for(k in a)if(a[k]>1)print b[k]}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
awk -F'|' '{a[$1]++;b[$1]=b[$1]FS$2}END{for(k in a)if(a[k]>1)print k b[k]}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
#3
1
Using awk:
使用awk:
awk -F '|' '!($1 in a){a[$1]=$2; next} $1 in a{b[$1]=b[$1] FS a[$1] FS $2}
END{for(i in b) print i b[i]}' file
d|amet|consectetur
e|adipisicing|elit
b|ipsum|dolor
#4
1
This might work for you (GNU sed):
这可能对您有用(GNU sed):
sed -r ':a;$!N;s/^(([^|]*\|).*)\n\2/\1|/;ta;/^([^\n|]*\|){2,}/P;D' /file
This reads 2 lines into the pattern space then checks to see if the keys in both lines are the same. If so it removes the second key and repeats. If not it checks to see if more than two fields exist in the first line and if so prints it out and then deletes it otherwise it just deletes the first line.
这将向模式空间中读取两行,然后检查这两行中的键是否相同。如果是这样,它将删除第二个键并重复。如果不是,它会检查第一行中是否存在两个以上的字段,如果是,则将其打印出来,然后删除,否则只删除第一行。
#5
0
$ awk -F'|' '$1 == prev {rec = rec RS $0; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
$ awk -F'|' '$1 == prev {rec = rec FS $2; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
#1
3
Here's a method where you only have to remember the previous line (therefore requires the input file to be sorted)
这里有一个方法,您只需记住前面的行(因此需要对输入文件进行排序)
awk -F \| '
$1 == prev_key {print prev_line; matches ++}
$1 != prev_key {
if (matches) print prev_line
matches = 0
prev_key = $1
}
{prev_line = $0}
END { if (matches) print $0 }
' filename
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Alternate output
交替输出
awk -F \| '
$1 == prev_key {
if (matches == 0) printf "%s", $1
printf "%s%s", FS, prev_value
matches ++
}
$1 != prev_key {
if (matches) printf "%s%s\n", FS, prev_value
matches = 0
prev_key = $1
}
{prev_value = $2}
END {if (matches) printf "%s%s\n", FS, $2}
' filename
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
#2
3
For fixed width fields you can used uniq
:
对于固定宽度的字段,可以使用uniq:
$ uniq -Dw 1 file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
If you don't have fixed width fields here are two awk
solution:
如果你没有固定宽度的字段,这里有两个awk解决方案:
awk -F'|' '{a[$1]++;b[$1]=(b[$1])?b[$1]RS$0:$0}END{for(k in a)if(a[k]>1)print b[k]}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
awk -F'|' '{a[$1]++;b[$1]=b[$1]FS$2}END{for(k in a)if(a[k]>1)print k b[k]}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
#3
1
Using awk:
使用awk:
awk -F '|' '!($1 in a){a[$1]=$2; next} $1 in a{b[$1]=b[$1] FS a[$1] FS $2}
END{for(i in b) print i b[i]}' file
d|amet|consectetur
e|adipisicing|elit
b|ipsum|dolor
#4
1
This might work for you (GNU sed):
这可能对您有用(GNU sed):
sed -r ':a;$!N;s/^(([^|]*\|).*)\n\2/\1|/;ta;/^([^\n|]*\|){2,}/P;D' /file
This reads 2 lines into the pattern space then checks to see if the keys in both lines are the same. If so it removes the second key and repeats. If not it checks to see if more than two fields exist in the first line and if so prints it out and then deletes it otherwise it just deletes the first line.
这将向模式空间中读取两行,然后检查这两行中的键是否相同。如果是这样,它将删除第二个键并重复。如果不是,它会检查第一行中是否存在两个以上的字段,如果是,则将其打印出来,然后删除,否则只删除第一行。
#5
0
$ awk -F'|' '$1 == prev {rec = rec RS $0; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
$ awk -F'|' '$1 == prev {rec = rec FS $2; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit