unix排序-n -t ",会产生意想不到的结果

时间:2022-09-06 18:50:52

unix numeric sort gives strange results, even when I specify the delimiter.

unix数字排序会产生奇怪的结果,即使我指定了分隔符。

$ cat example.csv  # here's a small example
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035

$ cat example.csv | sort -n --field-separator=,
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035

For this example, sort gives the same result regardless if you specify the delimiter. I know if I set LC_ALL=C then sort starts to give expected behavior again. But I do not understand why the default environment settings, as shown below, would make this happen.

对于本例,无论您是否指定了分隔符,sort都会给出相同的结果。我知道如果我设置LC_ALL=C那么排序就会再次给出预期行为。但是我不理解为什么默认的环境设置(如下所示)会导致这种情况发生。

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

I've read from many other questions (e.g. here, here, and here) how to avoid this behavior in sort, but still, this behavior is incredibly weird and unpredictable and has caused me a week of heartache. Can someone explain why sort with default environment settings on Mac OS X (10.8.5) would behave this way? In other words: what is sort doing (with local variables set to en_US.UTF-8) to get that result?

我读过很多其他的问题(比如这里,这里,这里)如何避免这种行为,但是,这种行为是不可思议的不可预测的,并且已经让我心痛了一个星期。有人能解释为什么在Mac OS X(10.8.5)上使用默认环境设置进行排序吗?换句话说:sort做了什么(将局部变量设置为en_US.UTF-8)来获得这个结果?

I'm using

我使用

 sort 5.93                        November 2005

 $ type sort
 sort is /usr/bin/sort

UPDATE

I've discussed this on the gnu-coreutils list and now understand why sort with english unicode default locale settings gave the output it did. Because in English unicode, the comma character "," is considered a numeric (so as to allow for comma's as thousand's (or e.g. hundreds) separators), and sort defaults to "being greedy" when it interprets a line, it read the example numbers as approximately

我已经在gnu-coreutils列表中讨论过这一点,现在我理解为什么用英语unicode默认语言环境设置排序会给出输出。因为在英语unicode中,逗号字符“,”被认为是一个数字(以便允许逗号作为千位(或数百)分隔符),并且当它解释一行时,排序默认为“贪婪”,它将示例号作为近似

581.491...
590.000...
590.001...
591.492...
600.001...
601.492...
1213.08...
1214.15...
1226.76...
1250.45...

Although this was not what I had intended and chepner is right that to get the actual result I want, I need to specify that I want sort to key on only the first field. sort defaults to interpreting more of the line as a key rather than just the first field as a key.

虽然这不是我想要的,切普纳是对的,为了得到我想要的实际结果,我需要指定我想要sort to键在第一个字段上。sort默认地将更多的行解释为键,而不是将第一个字段解释为键。

This behavior of sort has been discussed in gnu-coreutil's FAQ, and is further specified in the POSIX description of sort.

这种排序行为已经在gnu-coreutil的FAQ中进行了讨论,并在sort的POSIX描述中进行了进一步的说明。

So that, as Eric Blake on the gnu-coreutil's list put it, if the field-separator is also a numeric (which a comma is) then "Without -k to stop things, [the field-separator] serves as BOTH a separator AND a numeric character - you are sorting on numbers that span multiple fields."

因此,正如gnu-coreutil列表上的Eric Blake所言,如果字段分隔符也是一个数字(它是一个逗号),那么“如果没有-k来停止事情,[字段分隔符]既是一个分隔符,又是一个数字字符——您正在对跨越多个字段的数字进行排序”。

3 个解决方案

#1


7  

I'm not sure this is entirely correct, but it's close.

我不确定这是否完全正确,但已经很接近了。

sort -n -t, will try to sort numerically by the given key(s). In this case, the key is a tuple consisting of an integer and a float. Such tuples cannot be sorted numerically.

sort -n -t,将尝试通过给定的键进行数值排序。在这种情况下,键是一个由整数和浮点数组成的元组。这样的元组不能按数字排序。

If you explicitly specify which single keys to sort on with

如果显式指定要对哪个键进行排序

sort -k1,1n -k2,2n -t,

it should work. Now you are explicitly telling sort to first sort on the first field (numerically), then on the second field (also numerically).

它应该工作。现在,您已经明确地告诉sort在第一个字段(数值上),然后在第二个字段(同样是数值上)。

I suspect that -n is useful as a global option only if each line of the input consists of a single numerical value. Otherwise, you need to use the -n option in conjunction with the -k option to specify exactly which fields are numbers.

我认为只有当输入的每一行都包含一个数值时,-n作为全局选项才有用。否则,您需要与-k选项一起使用-n选项来明确指定哪些字段是数字。

#2


1  

Use sort --debug to find out what's going on. I've used that to explain in detail your issue at: http://lists.gnu.org/archive/html/coreutils/2013-10/msg00004.html

使用sort—debug来找出发生了什么。我在http://lists.gnu.org/archive/html/coreutils/2013-10/msg00004.html中详细解释了您的问题

#3


0  

If you use

如果你使用

cat example.csv | sort

instead of

而不是

cat example.csv | sort -n --field-separator=,

then it would give correct output. Use this command, hope this is helpful to you.

然后它会给出正确的输出。使用此命令,希望对您有所帮助。

Note: I tested with "sort (GNU coreutils) 7.4"

注意:我测试了“sort (GNU coreutils) 7.4”

#1


7  

I'm not sure this is entirely correct, but it's close.

我不确定这是否完全正确,但已经很接近了。

sort -n -t, will try to sort numerically by the given key(s). In this case, the key is a tuple consisting of an integer and a float. Such tuples cannot be sorted numerically.

sort -n -t,将尝试通过给定的键进行数值排序。在这种情况下,键是一个由整数和浮点数组成的元组。这样的元组不能按数字排序。

If you explicitly specify which single keys to sort on with

如果显式指定要对哪个键进行排序

sort -k1,1n -k2,2n -t,

it should work. Now you are explicitly telling sort to first sort on the first field (numerically), then on the second field (also numerically).

它应该工作。现在,您已经明确地告诉sort在第一个字段(数值上),然后在第二个字段(同样是数值上)。

I suspect that -n is useful as a global option only if each line of the input consists of a single numerical value. Otherwise, you need to use the -n option in conjunction with the -k option to specify exactly which fields are numbers.

我认为只有当输入的每一行都包含一个数值时,-n作为全局选项才有用。否则,您需要与-k选项一起使用-n选项来明确指定哪些字段是数字。

#2


1  

Use sort --debug to find out what's going on. I've used that to explain in detail your issue at: http://lists.gnu.org/archive/html/coreutils/2013-10/msg00004.html

使用sort—debug来找出发生了什么。我在http://lists.gnu.org/archive/html/coreutils/2013-10/msg00004.html中详细解释了您的问题

#3


0  

If you use

如果你使用

cat example.csv | sort

instead of

而不是

cat example.csv | sort -n --field-separator=,

then it would give correct output. Use this command, hope this is helpful to you.

然后它会给出正确的输出。使用此命令,希望对您有所帮助。

Note: I tested with "sort (GNU coreutils) 7.4"

注意:我测试了“sort (GNU coreutils) 7.4”