我怎么能按字母顺序排序?

Here is the scenario.

这是场景。

I am given an array 'A' of integers. The size of the array is not fixed. The function that I am supposed to write may be called once with an array of just a few integers while another time, it might even contain thousands of integers. Additionally, each integer need not contain the same number of digits.

我给了一个数组'A'的整数。数组的大小不是固定的。我应该写的函数可以用一个整数数组来调用，而另一次，它甚至可能包含数千个整数。此外，每个整数不需要包含相同数量的数字。

I am supposed to 'sort' the numbers in the array such that the resulting array has the integers ordered in a lexicographic manner (i.e they are sorted based on their string representations. Here "123" is the string representation of 123). Please note that the output should contain integers only, not their string equivalents.

我应该对数组中的数字进行排序，这样得到的数组将以字典式的方式排序(I)。它们根据字符串表示形式进行排序。这里的“123”是123的字符串表示。请注意，输出应该只包含整数，而不是字符串等价物。

For example: if the input is:

例如:如果输入是:

[ 12 | 2434 | 23 | 1 | 654 | 222 | 56 | 100000 ]

[12 | 2434 | 23 | 654 | 222 | 56 | 100000]

Then the output should be:

那么输出应该是:

[ 1 | 100000 | 12 | 222 | 23 | 2434 | 56 | 654 ]

[1 | 100000 | 12 | 222 | 23 | 2434 | 56 | 654]

My initial approach: I converted each integer to its string format, then added zeros to its right to make all the integers contain the same number of digits (this was the messy step as it involved tracking etc making the solution very inefficient) and then did radix sort. Finally, I removed the padded zeros, converted the strings back to their integers and put them in the resulting array. This was a very inefficient solution.

我的初始方法是:我将每个整数转换为字符串格式，然后将0添加到它的右边，使所有的整数都包含相同数量的数字(这是一个混乱的步骤，因为它涉及到跟踪等问题，使得解决方案非常低效)，然后进行了基数排序。最后，我删除了padd zeros，将字符串转换回它们的整数，并将它们放在结果数组中。这是一个非常低效的解决方案。

I've been led to believe that the solution doesn't need padding etc and there is a simple solution where you just have to process the numbers in some way (some bit processing?) to get the result.

我已经被引导去相信解决方案不需要填充等等，有一个简单的解决方案，你只需要用某种方式处理这些数字(一些处理?)来得到结果。

What is the space-wise most efficient solution you can think of? Time-wise?

你能想到的空间最有效的解决方案是什么?时间吗?

If you are giving code, I'd prefer Java or pseudo-code. But if that doesn't suit you, any such language should be fine.

如果您正在提供代码，我更喜欢Java或伪代码。但如果这对你不合适的话，任何这样的语言都可以。

14 个解决方案

#1

Executable pseudo-code (aka Python): thenumbers.sort(key=str). Yeah, I know that using Python is kind of like cheating -- it's just too powerful;-). But seriously, this also means: if you can sort an array of strings lexicographically, as Python's sort intrinsically can, then just make the "key string" out of each number and sort that auxiliary array (you can then reconstruct the desired numbers array by a str->int transformation, or by doing the sort on the indices via indirection, etc etc); this is known as DSU (Decorate, Sort, Undecorate) and it's what the key= argument to Python's sort implements.

可执行伪代码(又名Python): thenumbers.sort(key=str)。是的，我知道使用Python有点像作弊——它太强大了;-)。但是说真的,这也意味着:如果你能一个字符串数组按字母排序,Python本质上是可以,那就使每个数字的字符串“关键”和那种辅助数组(您可以重构所需的数字数组的str - > int转换,或通过那种通过间接指标,等等);这就是所谓的DSU(装饰、排序、取消装饰)，它是Python的类实现的关键参数。

In more detail (pseudocode):

详细(伪代码):

allocate an array of char** aux as long as the numbers array
在数字数组中分配一个char** *数组。
for i from 0 to length of numbers-1, aux[i]=stringify(numbers[i])
i从0到numbers-1, aux[i]=stringify(数字[i])
allocate an array of int indices of the same length
分配相同长度的int索引数组。
for i from 0 to length of numbers-1, indices[i]=i
对于i从0到长度s-1，指数[i]=i。
sort indices, using as cmp(i,j) strcmp(aux[i],aux[j])
排序指标，用cmp(i,j) strcmp(aux[i]，aux[j])
allocate an array of int results of the same length
分配相同长度的int结果数组。
for i from 0 to length of numbers-1, results[i]=numbers[indices[i]]
i从0到number -1的长度，结果[i]=数字[索引[i]]
memcpy results over numbers
memcpy结果在数字
free every aux[i], and also aux, indices, results
免费的每一个aux[i]，以及aux，索引，结果。

#2

Since you mentioned Java is the actual language in question:

既然你提到了Java，那就是实际的语言:

You don't need to convert to and from strings. Instead, define your own comparator and use that in the sort.

您不需要转换为字符串或字符串。相反，定义您自己的比较器并在排序中使用它。

Specifically:

具体地说:

Comparator<Integer> lexCompare = new Comparator<Integer>(){
   int compareTo( Integer x, Integer y ) {
      return x.toString().compareTo( y.toString() );
   }
};

Then you can sort the array like this:

然后你可以对数组进行排序:

int[] array = /* whatever */;
Arrays.sort( array, lexCompare );

(Note: The int/Integer mismatch works automatically through auto-boxing)

(注:int/Integer不匹配自动通过自动装箱)

#3

I'd just turn them into strings, and then sort then sort using strcmp, which does lex comparisons.

我将它们转换成字符串，然后排序然后使用strcmp，它进行了lex比较。

Alternatively you can write a "lexcmp" function that compares two numbers using % 10 and /10 but that's basically the same thing as calling atoi many times, so not a good idea.

或者，您可以编写一个“lexcmp”函数，它使用% 10和/10来比较两个数字，但是这基本上与调用atoi相同，所以不是个好主意。

#4

The actual sorting can be done by any algorithm you like. The key to this problem is finding the comparison function that will properly identify which numbers should be "less than" others, according to this scheme:

实际的排序可以通过任何你喜欢的算法来完成。这个问题的关键在于找到一个比较函数，它能正确地识别出哪个数字应该小于其他数字，根据这个方案:

bool isLessThan(int a, int b)
{
    string aString = ToString(a);
    string bString = ToString(b);

    int charCount = min(aString.length(), bString.length())
    for (charIndex = 0; charIndex < charCount; charIndex++)
    {
        if (aString[charIndex] < bString[charIndex]) { return TRUE; }
    }

    // if the numbers are of different lengths, but identical
    // for the common digits (e.g. 123 and 12345)
    // the shorter string is considered "less"
    return (aString.length() < bString.length());
}

#5

My temptation would be to say that the int to string conversion would happen in the comparitor code rather than in bulk. Although this may be more elegant from a code-perspective I'd have to say that the execution effort would be greater as each number may be compared several times.

我的诱惑是说，int到string的转换将发生在comparitor代码中，而不是批量。虽然从代码的角度来看，这可能更加优雅，但我不得不说，由于每个数字可能会被比较几次，所以执行工作量会更大。

I'd be inclined to create a new array containing both the int and string representation (not sure that you need to pad the string versions for the string comparison to produce the order you've given), sort that on the string and then copy the int values back to the original array.

我倾向于创建一个新的数组包含整型和字符串表示(不确定你需要垫字符串比较的字符串版本生产订单你给),弦上的那种,然后复制回原始数组的int值。

I can't think of a smart mathematically way of sorting this as by your own statement you want to sort lexicographically so you need to transform the numbers to strings to do that.

我想不出一种聪明的数学方法来把它排序，就像你想要用字典来排序的那样你需要把数字转换成字符串来做。

#6

You definitely don't need to pad the result. It will not change the order of the lexicographical compare, it will be more error prone, and it will just waste CPU cycles. The most "space-wise" efficient method would be to convert the numbers to strings when they are compared. That way, you would not need to allocate an additional array, the numbers would be compared in place.

你肯定不需要填充结果。它不会改变字典的顺序，它会更容易出错，而且会浪费CPU周期。最“空间化”的有效方法是将数字转换为字符串。这样，您就不需要分配额外的数组，这些数字将会被比较。

You can get a reasonably good implementation quickly by just converting them to strings as needed. Stringifying a number isn't particularly expensive and, since you are only dealing with two strings at a time, it is quite likely that they will remain in the CPU cache at all times. So the comparisons will be much faster than the case where you convert the entire array to strings since they will not need to be loaded from main memory into the cache. People tend to forget that a CPU has a cache and that algorithms which do a lot of their work in a small local area of memory will benefit greatly from the much faster cache access. On some architectures, the cache is so much faster than the memory that you can do hundreds of operations on your data in the time it would have taken you to load it from main memory. So doing more work in the comparison function could actually be significantly faster than pre-processing the array. Especially if you have a large array.

只需根据需要将它们转换为字符串，就可以很快地得到一个相当好的实现。对一个数字进行字符串化并不是特别昂贵，因为您每次只处理两个字符串，所以它们很可能始终处于CPU缓存中。因此，比较将比将整个数组转换为字符串的情况要快得多，因为它们不需要从主内存加载到缓存中。人们倾向于忘记一个CPU有一个缓存，而在一个小的本地内存区域中做大量工作的算法将大大受益于更快的缓存访问。在某些架构上，缓存比内存快得多，您可以在数据上执行数百次操作，而这些操作将占用您从主内存加载数据的时间。因此，在比较函数中做更多的工作实际上比预先处理数组要快得多。特别是如果你有一个大的数组。

Try doing the string serialization and comparison in a comparator function and benchmark that. I think it will be a pretty good solution. Example java-ish pseudo-code:

尝试在比较器函数中执行字符串序列化和比较，并对其进行基准测试。我认为这将是一个很好的解决方案。示例java-ish伪代码:

public static int compare(Number numA, Number numB) {
    return numA.toString().compare(numB.toString());
}

I think that any fancy bit wise comparisons you could do would have to be approximately equivalent to the work involved in converting the numbers to strings. So you probably wouldn't get significant benefit. You can't just do a direct bit for bit comparison, that would give you a different order than lexicographical sort. You'll need to be able to figure out each digit for the number anyway, so it is most straightforward to just make them strings. There may be some slick trick, but every avenue I can think of off the top of my head is tricky, error-prone, and much more work than it is worth.

我认为，你所能做的任何精妙的比较，都必须相当于把数字转换成字符串的工作。所以你可能得不到显著的好处。你不能只做一个直接的比特比较，这会给你一个不同于字典排序的顺序。你需要能够计算出数字的每一个数字，所以最直接的方法就是让它们成为字符串。也许会有一些巧妙的技巧，但我能想到的每条路都是很微妙的，容易出错，而且比它的价值要多得多。

#7

Pseudocode:

伪代码:

sub sort_numbers_lexicographically (array) {
    for 0 <= i < array.length:
        array[i] = munge(array[i]);
    sort(array);  // using usual numeric comparisons
    for 0 <= i < array.length:
        array[i] = unmunge(array[i]);
}

So, what are munge and unmunge?

那么，什么是munge和unmunge?

munge is different depending on the integer size. For example:

munge的大小取决于整数的大小。例如:

sub munge (4-bit-unsigned-integer n) {
    switch (n):
        case 0:  return 0
        case 1:  return 1
        case 2:  return 8
        case 3:  return 9
        case 4:  return 10
        case 5:  return 11
        case 6:  return 12
        case 7:  return 13
        case 8:  return 14
        case 9:  return 15
        case 10:  return 2
        case 11:  return 3
        case 12:  return 4
        case 13:  return 5
        case 14:  return 6
        case 15:  return 7
}

Esentially what munge is doing is saying what order 4 bit integers come in when sorted lexigraphically. I'm sure you can see that there is a pattern here --- I didn't have to use a switch --- and that you can write a version of munge that handles 32 bit integers reasonably easily. Think about how you would write versions of munge for 5, 6, and 7 bit integers if you can't immediately see the pattern.

munge所做的是说，在按字母顺序进行排序时，会出现什么顺序4位整数。我确信你可以看到这里有一个模式——我不用转换——而且你可以编写一个版本的munge，它可以很容易地处理32位整数。如果您不能立即看到模式，请考虑如何为5、6和7位整数编写版本。

unmunge is the inverse of munge.

芒格是芒格的逆。

So you can avoid converting anything to a string --- you don't need any extra memory.

因此，您可以避免将任何内容转换为字符串——您不需要任何额外的内存。

#8

If you want to try a better preprocess-sort-postprocess, then note that an int is at most 10 decimal digits (ignoring signed-ness for the time being).

如果您想尝试更好的预处理-sort-postprocess，那么请注意，int数最多为10位小数(暂时忽略了signed-ness)。

So the binary-coded-decimal data for it fits in 64 bits. Map digit 0->1, 1->2 etc, and use 0 as a NUL terminator (to ensure that "1" comes out less than "10"). Shift each digit in turn, starting with the smallest, into the top of a long. Sort the longs, which will come out in lexicographical order for the original ints. Then convert back by shifting digits one at a time back out of the top of each long:

所以二进制编码的十进制数据适合64位。地图数字0->1,1->2等，并使用0作为NUL终结者(以确保“1”小于“10”)。依次将每个数字依次移动，从最小的开始，一直到最长的。排序的长度，它将在字典顺序为原始的int。然后，在每一段时间的顶部，通过移动的数字转换回来:

uint64_t munge(uint32_t i) {
    uint64_t acc = 0;
    while (i > 0) {
        acc = acc >> 4;
        uint64_t digit = (i % 10) + 1;
        acc += (digit << 60);
        i /= 10;
    }
    return acc;
}

uint32_t demunge(uint64_t l) {
    uint32_t acc = 0;
    while (l > 0) {
        acc *= 10;
        uint32_t digit = (l >> 60) - 1;
        acc += digit;
        l << 4;
    }
}

Or something like that. Since Java doesn't have unsigned ints, you'd have to modify it a little. It uses a lot of working memory (twice the size of the input), but that's still less than your initial approach. It might be faster than converting to strings on the fly in the comparator, but it uses more peak memory. Depending on the GC, it might churn its way through less memory total, though, and require less collection.

之类的。因为Java没有未签名的ints，所以您需要稍微修改一下。它使用了大量的工作内存(两倍于输入的大小)，但这仍然比最初的方法要小。它可能比在比较器中转换为字符串更快，但是它使用了更多的峰值内存。不过，根据GC的不同，它可能会在内存总量减少的情况下大量产生，并且需要较少的收集。

#9

If all the numbers are less than 1E+18, you could cast each number to UINT64, multiply by ten and add one, and then multiply by ten until they are at least 1E+19. Then sort those. To get back the original numbers, divide each number by ten until the last digit is non-zero (it should be one) and then divide by ten once more.

如果所有的数都小于1E+18，你可以把每个数投到UINT64，乘以10再加1，然后乘以10，直到它们至少是1E+19。然后这些。为了得到原始的数字，把每个数除以10，直到最后一个数字是非零(应该是1)，然后再除以10。

#10

The question doesn't indicate how to treat negative integers in the lexicographic collating order. The string-based methods presented earlier typically will sort negative values to the front; eg, { -123, -345, 0, 234, 78 } would be left in that order. But if the minus signs were supposed to be ignored, the output order should be { 0, -123, 234, -345, 78 }. One could adapt a string-based method to produce that order by somewhat-cumbersome additional tests.

这个问题并没有说明如何处理字典排序顺序中的负整数。前面介绍的基于字符串的方法通常会将负值排序到前面;{-123，-345,0,234,78}将按这个顺序。但是如果减号应该被忽略，输出顺序应该是{0，-123,234，-345,78}。我们可以采用基于字符串的方法，通过一些繁琐的附加测试来生成该命令。

It may be simpler, in both theory and code, to use a comparator that compares fractional parts of common logarithms of two integers. That is, it will compare the mantissas of base 10 logarithms of two numbers. A logarithm-based comparator will run faster or slower than a string-based comparator, depending on a CPU's floating-point performance specs and on quality of implementations.

在理论和代码中使用比较器，可以比较两个整数的常用对数的小数部分。也就是说，它将对两个数的10对数的曼蒂萨进行比较。一个基于对数的比较器将比基于字符串的比较器运行得更快或更慢，这取决于CPU的浮点性能规格和实现的质量。

The java code shown at the end of this answer includes two logarithm-based comparators: alogCompare and slogCompare. The former ignores signs, so would produce { 0, -123, 234, -345, 78 } from { -123, -345, 0, 234, 78 }.

在这个答案的末尾显示的java代码包括两个基于对数的比较器:alogCompare和slogCompare。前者忽略符号，因此产生{0，-123,234，-345,78}从{-123，-345,0,234,78}。

The number-groups shown next are the output produced by the java program.

下面显示的数字组是java程序生成的输出。

The “dar rand” section shows a random-data array dar as generated. It reads across and then down, 5 elements per line. Note, arrays sar, lara, and lars initially are unsorted copies of dar.

“dar rand”部分显示了生成的随机数据阵列dar。它读过，然后向下，每一行5个元素。注意，数组sar, lara和lars最初是未排序的dar副本。

The “dar sort” section is dar after sorting via Arrays.sort(dar);.

“dar排序”部分是通过Arrays.sort(dar)排序的。

The “sar lex” section shows array sar after sorting with Arrays.sort(sar,lexCompare);, where lexCompare is similar to the Comparator shown in Jason Cohen's answer.

“sar lex”部分在排序后显示了数组sar。sort(sar,lexCompare);在这里lexCompare与Jason Cohen的答案中所示的比较器相似。

The “lar s log” section shows array lars after sorting by Arrays.sort(lars,slogCompare);, illustrating a logarithm-based method that gives the same order as do lexCompare and other string-based methods.

在Arrays.sort(lars,slogCompare)排序后，“lar slog”部分显示了数组lars，说明了一个基于对数的方法，它给出了与lexCompare和其他基于字符串的方法相同的顺序。

The “lar a log” section shows array lara after sorting by Arrays.sort(lara,alogCompare);, illustrating a logarithm-based method that ignores minus signs.

“lar alog”部分显示了数组lara排序后的数组。sort(lara,alogCompare);说明了一个忽略负号的对数方法。

dar rand    -335768    115776     -9576    185484     81528
dar rand      79300         0      3128      4095    -69377
dar rand     -67584      9900    -50568   -162792     70992

dar sort    -335768   -162792    -69377    -67584    -50568
dar sort      -9576         0      3128      4095      9900
dar sort      70992     79300     81528    115776    185484

 sar lex    -162792   -335768    -50568    -67584    -69377
 sar lex      -9576         0    115776    185484      3128
 sar lex       4095     70992     79300     81528      9900

lar s log    -162792   -335768    -50568    -67584    -69377
lar s log      -9576         0    115776    185484      3128
lar s log       4095     70992     79300     81528      9900

lar a log          0    115776   -162792    185484      3128
lar a log    -335768      4095    -50568    -67584    -69377
lar a log      70992     79300     81528     -9576      9900

Java code is shown below.

Java代码如下所示。

// Code for "How can I sort numbers lexicographically?" - jw - 2 Jul 2014
import java.util.Random;
import java.util.Comparator;
import java.lang.Math;
import java.util.Arrays;
public class lex882954 {
// Comparator from Jason Cohen's answer
    public static Comparator<Integer> lexCompare = new Comparator<Integer>(){
        public int compare( Integer x, Integer y ) {
            return x.toString().compareTo( y.toString() );
        }
    };
// Comparator that uses "abs." logarithms of numbers instead of strings
    public static Comparator<Integer> alogCompare = new Comparator<Integer>(){
        public int compare( Integer x, Integer y ) {
            Double xl = (x==0)? 0 : Math.log10(Math.abs(x));
            Double yl = (y==0)? 0 : Math.log10(Math.abs(y));
            Double xf=xl-xl.intValue();
            return xf.compareTo(yl-yl.intValue());
        }
    };
// Comparator that uses "signed" logarithms of numbers instead of strings
    public static Comparator<Integer> slogCompare = new Comparator<Integer>(){
        public int compare( Integer x, Integer y ) {
            Double xl = (x==0)? 0 : Math.log10(Math.abs(x));
            Double yl = (y==0)? 0 : Math.log10(Math.abs(y));
            Double xf=xl-xl.intValue()+Integer.signum(x);
            return xf.compareTo(yl-yl.intValue()+Integer.signum(y));
        }
    };
// Print array before or after sorting
    public static void printArr(Integer[] ar, int asize, String aname) {
        int j;
        for(j=0; j < asize; ++j) {
            if (j%5==0)
                System.out.printf("%n%8s ", aname);
            System.out.printf(" %9d", ar[j]);
        }
        System.out.println();
    }
// Main Program -- to test comparators
    public static void main(String[] args) {
        int j, dasize=15, hir=99;
        Random rnd = new Random(12345);
        Integer[] dar = new Integer[dasize];
        Integer[] sar = new Integer[dasize];
        Integer[] lara = new Integer[dasize];
        Integer[] lars = new Integer[dasize];

        for(j=0; j < dasize; ++j) {
            lara[j] = lars[j] = sar[j] = dar[j] = rnd.nextInt(hir) * 
                rnd.nextInt(hir) * (rnd.nextInt(hir)-44);
        }
        printArr(dar, dasize, "dar rand");
        Arrays.sort(dar);
        printArr(dar, dasize, "dar sort");
        Arrays.sort(sar, lexCompare);
        printArr(sar, dasize, "sar lex");
        Arrays.sort(lars, slogCompare);
        printArr(lars, dasize, "lar s log");
        Arrays.sort(lara, alogCompare);
        printArr(lara, dasize, "lar a log");
    }
}

#11

If you're going for space-wise efficiency, I'd try just doing the work in the comparison function of the sort

如果你想要空间上的效率，我会尝试在排序的比较函数中做这项工作。

int compare(int a, int b) {
   // convert a to string
   // convert b to string
   // return -1 if a < b, 0 if they are equal, 1 if a > b
}

If it's too slow (it's slower than preprocessing, for sure), keep track of the conversions somewhere so that the comparison function doesn't keep having to do them.

如果速度太慢(当然要比预处理慢)，那么就在某个地方跟踪转换，这样比较函数就不会继续执行它们了。

#12

Possible optimization: Instead of this:

可能的优化:

I converted each integer to its string format, then added zeros to its right to make all the integers contain the same number of digits

我将每个整数转换为它的字符串格式，然后将0添加到它的右边，使所有整数都包含相同数量的数字。

you can multiply each number by (10^N - log10(number)), N being a number larger than log10 of any of your numbers.

你可以把每个数字乘以(10 ^ N - log10(数量),N是一个数字大于log10任何数字。

#13

#!/usr/bin/perl

use strict;
use warnings;

my @x = ( 12, 2434, 23, 1, 654, 222, 56, 100000 );

print $_, "\n" for sort @x;

__END__

Some timings ... First, with empty @x:

一些时间…首先,用空@x:

C:\Temp> timethis s-empty
TimeThis :  Elapsed Time :  00:00:00.188

Now, with 10,000 randomly generated elements:

现在，有10000个随机生成的元素:

TimeThis :  Elapsed Time :  00:00:00.219

This includes the time taken to generate the 10,000 elements but not the time to output them to the console. The output adds about a second.

这包括生成10,000个元素的时间，而不是将它们输出到控制台的时间。输出增加了大约一秒。

So, save some programmer time ;-)

所以，节省一些程序员的时间;-)

#14

One really hacky method (using C) would be:

一个真正的hacky方法(使用C)将是:

generate a new array of all the values converted to floats
生成所有转换为浮点数的所有值的新数组。
do a sort using the mantissa (significand) bits for the comparison
使用mantissa(重要的)位来进行比较吗?

In Java (from here):

在Java(离这里):

long bits = Double.doubleToLongBits(5894.349580349);

boolean negative = (bits & 0x8000000000000000L) != 0; 
long exponent = bits & 0x7ff0000000000000L >> 52;
long mantissa = bits & 0x000fffffffffffffL;

so you would sort on the long mantissa here.

所以你可以在这里的长尾数上排序。

#1