使用命令行删除文本中的重复字符串键

时间:2022-06-02 07:38:39

I was trying to remove some duplicate string in a line by line text. eg:

我试图逐行删除一些重复的字符串。例如:

A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}
A {id: "x" p {id: "da" v: "i4"} on:faer"}
A {id: "y" p {id: "werw" v: "i4"} on:asee"}
A {id: "y" p {id: "werw" v: "i4"} on:asee"}

the output should be the ones with no duplicated A_id, which means the output should be:

输出应该是没有重复A_id的输出,这意味着输出应该是:

A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}

The problem I met was I don't know how to sort and make it unique with a substring only. I tried to use:

我遇到的问题是我不知道如何排序并仅使用子字符串使其唯一。我试过用:

cat input.txt | grep 'A\s\{id:\s\"[^;]*\sp\s\{id:' | sort -u > output.txt

But it doesn't remove the duplicate substring but only remove lines which are exactly the same with others. So it's like it only removed:

但它不会删除重复的子字符串,只会删除与其他字符串完全相同的行。所以它只是删除了:

A {id: "y" p {id: "werw" v: "i4"} on:asee"}

which is all the same with the last two lines, but didn't remove:

这与最后两行完全相同,但没有删除:

A {id: "y" p {id: "wse" v: "i4"} on:ue"}

which has the duplicate id but different content.

它具有重复ID但内容不同。

3 个解决方案

#1


2  

An awk solution:

一个awk解决方案:

$ awk '!a[$3]++' file
A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}

Combing the matching from your grep command:

梳理grep命令的匹配:

$ awk '$1=="A" && $2=="{id:" && $4=="p" && $5=="{id:" && !a[$3]++' file
A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}

#2


1  

The problem is that sort uses the entire string as key by default, so it would only eliminate identical lines.

问题是sort默认使用整个字符串作为键,因此它只会消除相同的行。

Try changing

尝试改变

sort -u

to

sort -uk3,3

to eliminate duplicates where the key is the 3rd field. Fields are separated by white-space.

消除密钥是第3个字段的重复项。字段由空格分隔。

-k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)

-k, - key = POS1 [,POS2]在POS1开始一个键,在POS2结束它(原点1)

POS is F[.C][OPTS], where F is the field number and C the character position in the field. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key.

POS是F [.C] [OPTS],其中F是字段编号,C是字段中的字符位置。 OPTS是一个或多个单字母排序选项,它覆盖该键的全局排序选项。如果没有给出密钥,请使用整行作为密钥。

Reference.

参考。

#3


0  

A Perl solution:

Perl解决方案:

perl -ne 'if (/\{id: "([^"]+)"/ and not exists $h{$1}) { $h{$1}++; print }'

It stores the ids that matched in a hash, and only prints if the id was not already in the hash.

它存储在散列中匹配的id,并且仅在id不在散列中时才打印。

#1


2  

An awk solution:

一个awk解决方案:

$ awk '!a[$3]++' file
A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}

Combing the matching from your grep command:

梳理grep命令的匹配:

$ awk '$1=="A" && $2=="{id:" && $4=="p" && $5=="{id:" && !a[$3]++' file
A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}

#2


1  

The problem is that sort uses the entire string as key by default, so it would only eliminate identical lines.

问题是sort默认使用整个字符串作为键,因此它只会消除相同的行。

Try changing

尝试改变

sort -u

to

sort -uk3,3

to eliminate duplicates where the key is the 3rd field. Fields are separated by white-space.

消除密钥是第3个字段的重复项。字段由空格分隔。

-k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)

-k, - key = POS1 [,POS2]在POS1开始一个键,在POS2结束它(原点1)

POS is F[.C][OPTS], where F is the field number and C the character position in the field. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key.

POS是F [.C] [OPTS],其中F是字段编号,C是字段中的字符位置。 OPTS是一个或多个单字母排序选项,它覆盖该键的全局排序选项。如果没有给出密钥,请使用整行作为密钥。

Reference.

参考。

#3


0  

A Perl solution:

Perl解决方案:

perl -ne 'if (/\{id: "([^"]+)"/ and not exists $h{$1}) { $h{$1}++; print }'

It stores the ids that matched in a hash, and only prints if the id was not already in the hash.

它存储在散列中匹配的id,并且仅在id不在散列中时才打印。