查找文件中的重复行，并计算每一行重复了多少次?

Suppose I have a file similar to the following:

假设我有一个类似如下的文件:

I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc. So ideally, the output would be like:

我想知道有多少次“123”被复制，多少次“234”被复制，等等，所以理想情况下，输出应该是:

123  3 
234  2 
345  1

7 个解决方案

#1

573

Assuming there is one number per line:

假设每行有一个数字:

sort <file> | uniq -c

You can use the more verbose --count flag too with the GNU version, e.g., on Linux:

您可以使用更详细的——也可以使用GNU版本(例如，在Linux上)的count标志。

sort <file> | uniq --count

#2

289

This will print duplicate lines only, with counts:

这将只打印重复的行数:

sort FILE | uniq -cd

or, with GNU long options (on Linux):

或者，使用GNU long选项(在Linux上):

sort FILE | uniq --count --repeated

on BSD and OSX you have to use grep to filter out unique lines:

在BSD和OSX上，你必须使用grep来过滤出独特的线条:

sort FILE | uniq -c | grep -v '^ *1 '

For the given example, the result would be:

对于给定的例子，结果是:

  3 123
  2 234

If you want to print counts for all lines including those that appear only once:

如果您想要打印所有行的计数，包括那些只出现一次的行:

sort FILE | uniq -c

or, with GNU long options (on Linux):

或者，使用GNU long选项(在Linux上):

sort FILE | uniq --count

For the given input, the output is:

对于给定的输入，输出是:

  3 123
  2 234
  1 345

In order to sort the output with the most frequent lines on top, you can do the following (to get all results):

为了将输出排序为最频繁的行，您可以执行以下操作(获得所有结果):

sort FILE | uniq -c | sort -nr

or, to get only duplicate lines, most frequent first:

或者，只得到重复的行，最常见的是:

sort FILE | uniq -cd | sort -nr

on OSX and BSD the final one becomes:

在OSX和BSD中，最后一个变成:

sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr

#3

To find and count duplicate lines in multiple files, you can try the following command:

要查找和计算多个文件中的重复行，可以尝试以下命令:

sort <files> | uniq -c | sort -nr

or:

或者:

cat <files> | sort | uniq -c | sort -nr

#4

Via awk:

通过awk:

awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' data

In awk 'dups[$1]++' command, the variable $1 holds the entire contents of column1 and square brackets are array access. So, for each 1st column of line in data file, the node of the array named dups is incremented.

在awk 'dups[$1]++'命令中，变量$1保存了column1的全部内容，方括号是数组访问。因此，对于数据文件中的第1行行，名为dups的数组的节点会递增。

And at the end, we are looping over dups array with num as variable and print the saved numbers first then their number of duplicated value by dups[num].

最后，我们使用num作为变量循环遍历dups数组，并首先通过dups[num]将保存的数字打印出来。

Note that your input file has spaces on end of some lines, if you clear up those, you can use $0 in place of $1 in command above :)

注意，您的输入文件在某些行的末尾有空格，如果您清除这些，您可以使用$0代替上面的$1:)

#5

In windows using "Windows PowerShell" I used the command mentioned below to achieve this

在windows中使用“windows PowerShell”，我使用下面提到的命令来实现这一点。

Get-Content .\file.txt | Group-Object | Select Name, Count

Also we can use the where-object Cmdlet to filter the result

我们还可以使用where-object Cmdlet来过滤结果。

Get-Content .\file.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count

#6

Assuming you've got access to a standard Unix shell and/or cygwin environment:

假设您已经访问了标准的Unix shell和/或cygwin环境:

tr -s ' ' '\n' < yourfile | sort | uniq -d -c
       ^--space char

Basically: convert all space characters to linebreaks, then sort the tranlsated output and feed that to uniq and count duplicate lines.

基本上:将所有空格字符转换为换行符，然后对输出的输出进行排序，并将其输入到uniq，并计算重复的行数。

#7

If someone is looking for the online website which does the similar job:

如果有人在寻找做类似工作的在线网站:

http://www.kennistranslations.com/wordcount

#1

573

Assuming there is one number per line:

假设每行有一个数字:

sort <file> | uniq -c

You can use the more verbose --count flag too with the GNU version, e.g., on Linux:

您可以使用更详细的——也可以使用GNU版本(例如，在Linux上)的count标志。

sort <file> | uniq --count

#2

289

This will print duplicate lines only, with counts:

这将只打印重复的行数:

sort FILE | uniq -cd

or, with GNU long options (on Linux):

或者，使用GNU long选项(在Linux上):

sort FILE | uniq --count --repeated

on BSD and OSX you have to use grep to filter out unique lines:

在BSD和OSX上，你必须使用grep来过滤出独特的线条:

sort FILE | uniq -c | grep -v '^ *1 '

For the given example, the result would be:

对于给定的例子，结果是:

  3 123
  2 234

If you want to print counts for all lines including those that appear only once:

如果您想要打印所有行的计数，包括那些只出现一次的行:

sort FILE | uniq -c

or, with GNU long options (on Linux):

或者，使用GNU long选项(在Linux上):

sort FILE | uniq --count

For the given input, the output is:

对于给定的输入，输出是:

  3 123
  2 234
  1 345

In order to sort the output with the most frequent lines on top, you can do the following (to get all results):

为了将输出排序为最频繁的行，您可以执行以下操作(获得所有结果):

sort FILE | uniq -c | sort -nr

or, to get only duplicate lines, most frequent first:

或者，只得到重复的行，最常见的是:

sort FILE | uniq -cd | sort -nr

on OSX and BSD the final one becomes:

在OSX和BSD中，最后一个变成:

sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr

#3

To find and count duplicate lines in multiple files, you can try the following command:

要查找和计算多个文件中的重复行，可以尝试以下命令:

sort <files> | uniq -c | sort -nr

or:

或者:

cat <files> | sort | uniq -c | sort -nr

#4

Via awk:

通过awk:

awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' data

在awk 'dups[$1]++'命令中，变量$1保存了column1的全部内容，方括号是数组访问。因此，对于数据文件中的第1行行，名为dups的数组的节点会递增。

And at the end, we are looping over dups array with num as variable and print the saved numbers first then their number of duplicated value by dups[num].

最后，我们使用num作为变量循环遍历dups数组，并首先通过dups[num]将保存的数字打印出来。

Note that your input file has spaces on end of some lines, if you clear up those, you can use $0 in place of $1 in command above :)

注意，您的输入文件在某些行的末尾有空格，如果您清除这些，您可以使用$0代替上面的$1:)

#5

In windows using "Windows PowerShell" I used the command mentioned below to achieve this

在windows中使用“windows PowerShell”，我使用下面提到的命令来实现这一点。

Get-Content .\file.txt | Group-Object | Select Name, Count

Also we can use the where-object Cmdlet to filter the result

我们还可以使用where-object Cmdlet来过滤结果。

Get-Content .\file.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count

#6

Assuming you've got access to a standard Unix shell and/or cygwin environment:

假设您已经访问了标准的Unix shell和/或cygwin环境:

tr -s ' ' '\n' < yourfile | sort | uniq -d -c
       ^--space char

Basically: convert all space characters to linebreaks, then sort the tranlsated output and feed that to uniq and count duplicate lines.

基本上:将所有空格字符转换为换行符，然后对输出的输出进行排序，并将其输入到uniq，并计算重复的行数。

#7

If someone is looking for the online website which does the similar job:

如果有人在寻找做类似工作的在线网站:

http://www.kennistranslations.com/wordcount

秒客网

查找文件中的重复行，并计算每一行重复了多少次?

7 个解决方案

#1

#2

#3

#4

#5

#6

#7

#1

#2

#3

#4

#5

#6

#7

相关文章