I am trying to read a file and sort it by number of occurrences of a particular field. Suppose i want to find out the most repeated date from a log file then i use uniq -c option and sort it in descending order. something like this
我正在尝试读取文件并按特定字段的出现次数对其进行排序。假设我想从日志文件中找出最重复的日期,然后我使用uniq -c选项并按降序排序。这样的事情
uniq -c | sort -nr
This will produce some output like this -
这会产生一些这样的输出 -
809 23/Dec/2008:19:20
the first field which is actually the count is the problem for me .... i want to get ony the date from the above output but m not able to get this. I tried to use cut command and did this
实际上是计数的第一个字段对我来说是个问题....我想从上面的输出得到日期,但是我无法得到它。我尝试使用cut命令并执行此操作
uniq -c | sort -nr | cut -d' ' -f2
but this just prints blank space ... please can someone help me on getting the date only and chop off the count. I want only
但这只是打印空白区域...请有人帮助我获取日期并切断计数。我只想要
23/Dec/2008:19:20
Thanks
5 个解决方案
#1
8
The count from uniq
is preceded by spaces unless there are more than 7 digits in the count, so you need to do something like:
来自uniq的计数前面有空格,除非计数中的数字超过7位,因此您需要执行以下操作:
uniq -c | sort -nr | cut -c 9-
to get columns (character positions) 9 upwards. Or you can use sed
:
获取列(字符位置)9向上。或者你可以使用sed:
uniq -c | sort -nr | sed 's/^.\{8\}//'
or:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
This second option is robust in the face of a repeat count of 10,000,000 or more; if you think that might be a problem, it is probably better than the cut
alternative. And there are undoubtedly other options available too.
面对重复计数10,000,000或更多,第二种选择是强劲的;如果你认为这可能是一个问题,它可能比削减替代品更好。毫无疑问,还有其他选择。
Caveat: the counts were determined by experimentation on Mac OS X 10.7.3 but using GNU uniq
from coreutils
8.3. The BSD uniq -c
produced 3 leading spaces before a single digit count. The POSIX spec says the output from uniq -c
shall be formatted as if with:
警告:计数是通过Mac OS X 10.7.3上的实验确定的,但是使用来自coreutils 8.3的GNU uniq。 BSD uniq -c在单个数字计数之前产生3个前导空格。 POSIX规范说uniq -c的输出格式应如下:
printf("%d %s", repeat_count, line);
which would not have any leading blanks. Given this possible variance in output formats, the sed
script with the [0-9]
regex is the most reliable way of dealing with the variability in observed and theoretical output from uniq -c
:
这不会有任何领先的空白。鉴于输出格式可能存在差异,使用[0-9]正则表达式的sed脚本是处理uniq -c的观察和理论输出变化的最可靠方法:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
#2
4
Instead of cut -d' ' -f2
, try
而不是削减-d'' - f2,尝试
awk '{$1="";print}'
Maybe you need to remove one more blank in the beginning:
也许你需要在开头删除一个空白:
awk '{$1="";print}' | sed 's/^.//'
or completly with sed, preserving original whitspace:
或完全与sed,保留原始whitspace:
sed -r 's/^[^0-9]*[0-9]+//'
#3
2
an alternative solution is this:
另一种解决方案是:
uniq -c | sort -nr | awk '{print $1, $2}'
also you may easily print a single field.
您也可以轻松打印单个字段。
#4
1
If you want to work with the count field downstream, following command will reformat it to a 'pipe friendly' tab delimited format without the left padding:
如果要使用下游计数字段,则以下命令会将其重新格式化为“管道友好”制表符分隔格式,而不使用左边距:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/'
For the original task it is a bit of an overkill, but after reformatting, cut
can be used to remove the field, as OP intended:
对于原始任务来说,这有点过分,但在重新格式化之后,可以使用cut来删除字段,如OP所预期的:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/' | cut -d $'\t' -f2-
#5
1
Add tr -s
to the pipe chain to "squeeze" multiple spaces into one space delimiter:
将tr -s添加到管道链中以将多个空格“挤压”到一个空格分隔符中:
uniq -c | tr -s ' ' | cut -d ' ' -f3
tr
is very useful in some obscure places. Unfortunately it doesn't get rid of the first leading space, hence the -f3
tr在一些不起眼的地方非常有用。不幸的是,它没有摆脱第一个领先的空间,因此-f3
#1
8
The count from uniq
is preceded by spaces unless there are more than 7 digits in the count, so you need to do something like:
来自uniq的计数前面有空格,除非计数中的数字超过7位,因此您需要执行以下操作:
uniq -c | sort -nr | cut -c 9-
to get columns (character positions) 9 upwards. Or you can use sed
:
获取列(字符位置)9向上。或者你可以使用sed:
uniq -c | sort -nr | sed 's/^.\{8\}//'
or:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
This second option is robust in the face of a repeat count of 10,000,000 or more; if you think that might be a problem, it is probably better than the cut
alternative. And there are undoubtedly other options available too.
面对重复计数10,000,000或更多,第二种选择是强劲的;如果你认为这可能是一个问题,它可能比削减替代品更好。毫无疑问,还有其他选择。
Caveat: the counts were determined by experimentation on Mac OS X 10.7.3 but using GNU uniq
from coreutils
8.3. The BSD uniq -c
produced 3 leading spaces before a single digit count. The POSIX spec says the output from uniq -c
shall be formatted as if with:
警告:计数是通过Mac OS X 10.7.3上的实验确定的,但是使用来自coreutils 8.3的GNU uniq。 BSD uniq -c在单个数字计数之前产生3个前导空格。 POSIX规范说uniq -c的输出格式应如下:
printf("%d %s", repeat_count, line);
which would not have any leading blanks. Given this possible variance in output formats, the sed
script with the [0-9]
regex is the most reliable way of dealing with the variability in observed and theoretical output from uniq -c
:
这不会有任何领先的空白。鉴于输出格式可能存在差异,使用[0-9]正则表达式的sed脚本是处理uniq -c的观察和理论输出变化的最可靠方法:
uniq -c | sort -nr | sed 's/^ *[0-9]* //'
#2
4
Instead of cut -d' ' -f2
, try
而不是削减-d'' - f2,尝试
awk '{$1="";print}'
Maybe you need to remove one more blank in the beginning:
也许你需要在开头删除一个空白:
awk '{$1="";print}' | sed 's/^.//'
or completly with sed, preserving original whitspace:
或完全与sed,保留原始whitspace:
sed -r 's/^[^0-9]*[0-9]+//'
#3
2
an alternative solution is this:
另一种解决方案是:
uniq -c | sort -nr | awk '{print $1, $2}'
also you may easily print a single field.
您也可以轻松打印单个字段。
#4
1
If you want to work with the count field downstream, following command will reformat it to a 'pipe friendly' tab delimited format without the left padding:
如果要使用下游计数字段,则以下命令会将其重新格式化为“管道友好”制表符分隔格式,而不使用左边距:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/'
For the original task it is a bit of an overkill, but after reformatting, cut
can be used to remove the field, as OP intended:
对于原始任务来说,这有点过分,但在重新格式化之后,可以使用cut来删除字段,如OP所预期的:
.. | sort | uniq -c | sed -r 's/^ +([0-9]+) /\1\t/' | cut -d $'\t' -f2-
#5
1
Add tr -s
to the pipe chain to "squeeze" multiple spaces into one space delimiter:
将tr -s添加到管道链中以将多个空格“挤压”到一个空格分隔符中:
uniq -c | tr -s ' ' | cut -d ' ' -f3
tr
is very useful in some obscure places. Unfortunately it doesn't get rid of the first leading space, hence the -f3
tr在一些不起眼的地方非常有用。不幸的是,它没有摆脱第一个领先的空间,因此-f3