重构Java数组和原语（double [] []）到集合和泛型（List >）

I have been refactoring throwaway code which I wrote some years ago in a FORTRAN-like style. Most of the code is now much more organized and readable. However the heart of the algorithm (which is performance-critical) uses 1- and 2-dimensional Java arrays and is typified by:

我一直在重构一次性代码,这是我几年前以类似FORTRAN的方式编写的。大多数代码现在更加有条理和可读。然而,算法的核心(性能关键)使用1维和2维Java数组,其典型代表是:

    for (int j = 1; j < len[1]+1; j++) {
        int jj = (cont == BY_TYPE) ? seq[1][j-1] : j-1;
        for (int i = 1; i < len[0]+1; i++) {
            matrix[i][j] = matrix[i-1][j] + gap;
            double m = matrix[i][j-1] + gap;
            if (m > matrix[i][j]) {
                matrix[i][j] = m;
                pointers[i][j] = UP;
            }
            //...
        }
    }

For clarity, maintainability and interfacing with the rest of the code I would like to refactor it. However on reading Java Generics Syntax for arrays and Java Generics and numbers I have the following concerns:

为清楚起见,可维护性以及与其余代码的接口,我想重构它。但是,在阅读用于数组和Java泛型和数字的Java Generics语法时,我有以下问题:

Performance. The code is planned to use about 10^8 - 10^9 secs/yr and this is just about manageable. My reading suggests that changing double to Double can sometimes add a factor of 3 in performance. I'd like other experience on this. I would also expect that moving from foo[] to List would be a hit as well. I have no first-hand knowledge and again experience would be useful.

性能。该代码计划使用大约10 ^ 8 - 10 ^ 9秒/年,这几乎是可管理的。我的阅读表明,将double改为Double有时可以在性能上增加3倍。我想要其他经验。我也希望从foo []移动到List也会受到影响。我没有第一手知识,经验也很有用。
Array-bound checking. Is this treated differently in double[] and List and does it matter? I expect some problems to violate bounds as the algorithm is fairly simple and has only been applied to a few data sets.

数组绑定检查。这在double []和List中的处理方式有所不同吗?我期望一些问题违反界限,因为算法相当简单并且仅应用于少数数据集。
If I don't refactor then the code has an ugly and possibly fragile intermixture of the two approaches. I am already trying to write things such as:

如果我不重构那么代码就有两种方法的丑陋且可能是脆弱的混合。我已经在尝试写下这样的东西:

List<double[]> and List<Double>[]

List 和List []

and understand that the erasure does not make this pretty and at best gives rise to compiler warnings. It seems difficult to do this without very convoluted constructs.

并且理解擦除不会使这个漂亮,并且最多会引起编译器警告。没有非常复杂的结构,似乎很难做到这一点。

Obsolescence. One poster suggested that Java arrays should be obsoleted. I assume this isn't going to happen RSN but I would like to move away from outdated approaches.

过时。一张海报建议Java数组应该废弃。我认为这不会发生RSN,但我想摆脱过时的方法。

SUMMARY The consensus so far:

摘要迄今为止的共识:

Collections have a significant performance hit over primitive arrays, especially for constructs such as matrices. This is incurred in auto(un)boxing numerics and in accessing list items

集合在原始数组中具有显着的性能,特别是对于诸如矩阵之类的结构。这是在自动(非)拳击数字和访问列表项中引起的
For tight numerical (scientific) algorithms the array notation [][] is actually easier to read but the variables should named as helpfully as possible

对于紧密的数值(科学)算法,数组符号[] []实际上更容易阅读,但变量应尽可能有用地命名
Generics and arrays do not mix well. It may be useful to wrap the arrays in classes to transport them in/out of the tight algorithm.

泛型和数组不能很好地混合。将数组包装在类中以将它们输入/输出紧密算法可能是有用的。

There is little objective reason to make the change

做出改变的客观理由很少

QUESTION @SeanOwen has suggested that it would be useful to take constant values out of the loops. Assuming I haven't goofed this would look like:

问题@SeanOwen建议从循环中取出常量值是有用的。假设我没有这样做,这看起来像:

 int len1 = len[1];
 int len0 = len[0];
 int seq1 = seq[1];
 int[] pointersi;
 double[] matrixi;
 for (int i = 1; i < len0+1; i++) {
     matrixi = matrix[i];
     pointersi = pointers[i];
 }
 for (int j = 1; j < len1+1; j++) {
    int jj = (cont == BY_TYPE) ? seq1[j-1] : j-1;
    for (int i = 1; i < len0+1; i++) {
        matrixi[j] = matrixi[j] + gap;
        double m = matrixi[j-1] + gap;
        if (m > matrixi[j]) {
            matrixi[j] = m;
            pointersi[j] = UP;
        }
        //...
    }
}

I thought compilers were meant to be smart at doing this sort of thing. Do we need to still do this?

我认为编译器在做这类事情时应该很聪明。我们还需要这样做吗?

7 个解决方案

#1

I read an excellent book by Kent Beck on coding best-practices ( http://www.amazon.com/Implementation-Patterns/dp/B000XPRRVM ). There are also interesting performance figures. Specifically, there are comparison between arrays and various collections., and arrays are really much faster (maybe x3 compared to ArrayList).

我读了肯特贝克关于编写最佳实践的优秀书籍(http://www.amazon.com/Implementation-Patterns/dp/B000XPRRVM)。还有一些有趣的表现数字。具体来说,数组和各种集合之间存在比较,并且数组实际上要快得多(与ArrayList相比可能是x3)。

Also, if you use Double instead of double, you need to stick to it, and use no double, as auto(un)boxing will kill your performance.

此外,如果你使用Double而不是double,你需要坚持使用它,并且不使用double,因为auto(un)拳击会杀死你的表现。

Considering your performance need, I would stick to array of primitive type.

考虑到你的性能需求,我会坚持原始类型的数组。

Even more, I would calculate only once the upper bound for the condition in loops. This is typically done the line before the loop.

更重要的是,我只计算一次循环中条件的上限。这通常在循环之前完成。

However, if you don't like that the upper bound variable, used only in the loop, is accessible outside the loop, you can take advantage of the initialization phase of the for loop like this:

但是,如果你不喜欢只在循环中使用的上限变量可以在循环外部访问,你可以利用for循环的初始化阶段,如下所示:

    for (int i=0, max=list.size(); i<max; i++) {
      // do something
    }

I don't believe in obsolescence for arrays in java. For performance-critical loop, I can't see any language designer taking away the fastest option (especially if the difference is x3).

我不相信java中数组的过时。对于性能关键的循环,我看不到任何语言设计者拿走最快的选项(特别是如果差异是x3)。

I understand your concern for maintainability, and for coherence with the rest of the application. But I believe that a critical loop is entitled to some special practices.

我理解您对可维护性的关注,以及与应用程序其余部分的一致性。但我相信一个关键循环有权获得一些特殊的实践。

I would try to make the code the clearest possible without changing it:

我会尝试在不改变代码的情况下使代码最清晰:

by carefully questionning each variable name, ideally with a 10-min brainstorming session with my collegues

通过仔细询问每个变量名称,理想情况下与我的同事进行10分钟的头脑风暴会议

by writing coding comments (I'm against their use in general, as a code that is not clear should be made clear, not commented ; but a critical loop justifies it).

通过编写编码注释(我反对他们的使用一般,因为不清楚的代码应该清楚,不要注释;但是一个关键的循环证明它是正确的)。

by using private methods as needed (as Andreas_D pointed out in his answer). If made private final, chances are very good (as they would be short) that they will get inlined when running, so there would be no performance impact at runtime.

根据需要使用私有方法(正如Andreas_D在他的回答中指出的那样)。如果成为私人决赛,那么在运行时它们会被内联的机会非常好(因为它们很短),因此在运行时不会对性能产生影响。

#2

I fully agree with KLE's answer. Because the code is performance-critical, I'd keep the array based datastructures as well. And I believe, that just introducing collections, wrappers for primitive types and generics will not improve maintainability and clarity.

我完全同意KLE的回答。因为代码对性能至关重要,所以我也会保留基于数组的数据结构。我相信,只是引入集合,原始类型和泛型的包装器不会提高可维护性和清晰度。

In addition, if this algorithm is the heart of the application and has been in use for several years now, chance are fairly low, that it will need maintenance like bug fixing or improvements.

此外,如果这个算法是应用程序的核心并且已经使用了几年,那么机会相当低,它需要维护,如修复bug或改进。

For clarity, maintainability and interfacing with the rest of the code I would like to refactor it.

为清楚起见,可维护性以及与其余代码的接口,我想重构它。

Instead of changing datastructures I'd concentrate on renaming and maybe moving some part of the code to private methods. From looking at the code, I have no idea what's happening, and the problem, as I see it, are the more or less short and technical variable and field names.

我没有改变数据结构,而是专注于重命名,并可能将代码的某些部分移动到私有方法。通过查看代码,我不知道发生了什么,而且正如我所看到的那样,问题或多或少都是简短的技术变量和字段名称。

Just an example: one 2-dimensional array is just named 'matrix'. But it's obviously clear, that this is a matrix, so naming it 'matrix' is pretty redundant. It would be more helpful to rename it so that it becomes clear, what this matrix is really used for, what kind of data is inside.

举个例子:一个二维数组只是命名为“矩阵”。但很明显,这是一个矩阵,所以将它命名为“矩阵”是非常多余的。重命名它以使它变得清晰,这个矩阵真正用于什么,内部有什么样的数据会更有帮助。

Another candidate is your second line. With two refactorings, I'd rename 'jj' to something more meaningful and move the expression to a private method with a 'speaking' name.

另一位候选人是你的第二行。通过两次重构,我将'jj'重命名为更有意义的东西,并将表达式移动到具有'speak'名称的私有方法。

#3

The general guideline is to prefer generified collections over arrays in Java, but it's only a guideline. My first thought would be to NOT change this working code. If you really want to make this change, then benchmark both approaches.

一般原则是在Java中使用通用集合而不是数组,但它只是一个指导原则。我的第一个想法是不要改变这个工作代码。如果您真的想要进行此更改,那么请对两种方法进行基准测试。

As you say, performance is critical, in which case the code that meets the needed performance is better than code that doesn't.

正如您所说,性能至关重要,在这种情况下,满足所需性能的代码优于不具备所需性能的代码。

You might also run into auto-boxing issues when boxing/unboxing the doubles - a potentially more subtle problem.

在装箱/取消装箱双打时,您可能还会遇到自动装箱问题 - 这可能是一个更微妙的问题。

The Java language guys have been very strict about keeping the JVM compatible across different versions so I don't see arrays going anywhere - and I wouldn't call them obsolete, just more primitive than the other options.

Java语言人员一直非常严格地保持JVM在不同版本之间兼容,所以我看不到数组在任何地方 - 我不会把它们称为过时,只是比其他选项更原始。

#4

Well I think that arrays are the best way to store process data in algorithms. Since Java doesn't support operator overloading (one of the reasons why I think arrays won't be obsolete that soon) switching to collections would make the code quite hard to read:

我认为数组是在算法中存储过程数据的最佳方式。由于Java不支持运算符重载(我认为数组很快就不会过时的原因之一)切换到集合会使代码很难读取:

double[][] matrix = new double[10][10];
double t = matrix[0][0];

List<List<Double>> matrix = new ArrayList<List<Double>>(10);
Collections.fill(matrix, new ArrayList<Double>(10));
double t = matrix.get(0).get(0); // autoboxing => performance

As far as I know Java prestores some wrapper Object for Number instances (e.g. the first 100 integers), so that you can access them faster but I think that won't help much with that many data.

据我所知,Java预存了一些数据包实例的包装器(例如前100个整数),这样你就可以更快地访问它们,但我认为这对那么多数据都没有多大帮助。

#5

I thought compilers were meant to be smart at doing this sort of thing. Do we need to still do this?

我认为编译器在做这类事情时应该很聪明。我们还需要这样做吗?

You are probably right that the JIT takes care of it, but if this section is so performance critical, trying and benchmarking wouldn't hurt.

你可能是正确的JIT负责它,但如果这部分是如此性能关键,尝试和基准测试不会受到伤害。

#6

When you know the exact dimensions of the list you should stick with arrays. Arrays are not inherently bad, and they're not going anywhere. If you are performing a lot of (non-sequential) read and write operations you should use arrays and not lists, because the access methods of lists introduce a large overhead.

当你知道列表的确切尺寸时,你应该坚持使用数组。数组本身并不坏,它们不会去任何地方。如果要执行大量(非顺序)读取和写入操作,则应使用数组而不是列表,因为列表的访问方法会产生很大的开销。

#7

In addition to sticking with arrays, I think you can tighten up this code in some meaningful ways. For instance:

除了坚持使用数组之外,我认为你可以用一些有意义的方式收紧这些代码。例如:

Indeed, don't compute the loop bounds every time, save them off

实际上,不要每次都计算循环边界,将它们保存起来

You repeatedly reference matrix[i]. Just save off a reference to this subarray rather than dereferencing the 2D array every time

你反复引用矩阵[i]。只需保存对此子阵列的引用,而不是每次都取消引用2D数组

That trick gets even more useful if you can loop over i in the outer loop instead of inner loop

如果你可以在外部循环而不是内部循环中遍历i,那么这个技巧会变得更有用

It's getting extreme, but saving the value of j-1 in a local might even prove to be worth it rather than recomputing

它变得极端,但在本地保存j-1的价值甚至可能证明是值得的而不是重新计算

Finally if you are really really concerned about performance, run the ProGuard optimizer over the resulting byte code to have it perform some compiler optimizations like unrolling or peephole optimizations

最后,如果您真的非常关心性能,请在生成的字节代码上运行ProGuard优化器,让它执行一些编译器优化,如展开或窥孔优化

#1