Swift性能:map()和reduce() vs for循环

时间:2021-08-09 01:28:58

I'm writing some performance-critical code in Swift. After implementing all the optimizations I could think of, and profiling the application in Instruments, I came to realize that the vast majority of CPU cycles are spent performing map() and reduce() operations on arrays of Floats. So, just to see what would happen, I replaced all instances of map and reduce with good old-fashioned for loops. And to my amazement... the for loops were much, much faster!

我正在用Swift编写一些性能关键代码。在实现了我所能想到的所有优化并在工具中对应用程序进行分析之后,我开始意识到,绝大多数CPU周期都用于执行map()和reduce()对浮点数组的操作。为了看看会发生什么,我用老式的for循环替换了map和reduce的所有实例。和我的惊奇…for循环要快得多!

A bit puzzled by this, I decided to perform some rough benchmarks. In one test, I had map return an array of Floats after performing some simple arithmetic like so:

对此我有点困惑,于是我决定执行一些粗略的基准测试。在一个测试中,我让map在执行一些简单的算术之后返回一个浮点数组,比如:

// Populate array with 1,000,000,000 random numbers
var array = [Float](count: 1_000_000_000, repeatedValue: 0)
for i in 0..<array.count {
    array[i] = Float(random())
}
let start = NSDate()
// Construct a new array, with each element from the original multiplied by 5
let output = array.map({ (element) -> Float in
    return element * 5
})
// Log the elapsed time
let elapsed = NSDate().timeIntervalSinceDate(start)
print(elapsed)

And the equivalent for loop implementation:

和等价的循环实现:

var output = [Float]()
for element in array {
    output.append(element * 5)
}

Average execution time for map: 20.1 seconds. Average execution time for the for loop: 11.2 seconds. Results were similar using Integers instead of Floats.

map的平均执行时间:20.1秒。for循环的平均执行时间:11.2秒。使用整数而不是浮点数的结果是相似的。

I created a similar benchmark to test the performance of Swift's reduce. This time, reduce and for loops achieved nearly the same performance when summing the elements of one large array. But when I loop the test 100,000 times like this:

我创建了一个类似的基准来测试Swift的reduce性能。这一次,reduce和for循环在对一个大数组的元素进行求和时获得了几乎相同的性能。但是当我像这样对测试进行10万次循环时:

// Populate array with 1,000,000 random numbers
var array = [Float](count: 1_000_000, repeatedValue: 0)
for i in 0..<array.count {
    array[i] = Float(random())
}
let start = NSDate()
// Perform operation 100,000 times
for _ in 0..<100_000 {
    let sum = array.reduce(0, combine: {$0 + $1})
}
// Log the elapsed time
let elapsed = NSDate().timeIntervalSinceDate(start)
print(elapsed)

vs:

与:

for _ in 0..<100_000 {
    var sum: Float = 0
    for element in array {
        sum += element
    }
}

The reduce method takes 29 seconds while the for loop takes (apparently) 0.000003 seconds.

reduce方法需要29秒,而for循环需要0.000003秒。

Naturally I'm ready to disregard that last test as the result of a compiler optimization, but I think it may give some insight into how the compiler optimizes differently for loops vs Swift's built-in array methods. Note that all tests were performed with -Os optimization on a 2.5 GHz i7 MacBook Pro. Results varied depending on array size and number of iterations, but for loops always outperformed the other methods by at least 1.5x, sometimes up to 10x.

当然,我已经准备好忽略最后的测试作为编译器优化的结果,但是我认为它可以让我们了解编译器是如何对循环和Swift内置的数组方法进行不同的优化的。注意,所有测试都是在2.5 GHz的i7 MacBook Pro上使用-Os优化进行的。结果取决于数组大小和迭代次数,但对于循环,结果总是比其他方法多出至少1.5倍,有时甚至高达10倍。

I'm a bit perplexed about Swift's performance here. Shouldn't the built-in Array methods be faster than the naive approach for performing such operations? Maybe somebody with more low-level knowledge than I can shed some light on the situation.

我对斯威夫特在这里的表现有点困惑。内置的数组方法难道不应该比简单的方法更快地执行这些操作吗?也许有人比我更了解情况。

2 个解决方案

#1


27  

Shouldn't the built-in Array methods be faster than the naive approach for performing such operations? Maybe somebody with more low-level knowledge than I can shed some light on the situation.

内置的数组方法难道不应该比简单的方法更快地执行这些操作吗?也许有人比我更了解情况。

I just want to attempt to address this part of the question and more from the conceptual level (with little understanding of the nature of Swift's optimizer on my part) with a "not necessarily". It's coming more from a background in compiler design and computer architecture than deep-rooted knowledge of the nature of Swift's optimizer.

我只是想试着从概念层面(对Swift的优化器的本质知之甚少),用“不一定”来回答这部分问题。它更多地来自于编译器设计和计算机架构的背景知识,而不是对Swift优化器的本质的深入了解。

Calling Overhead

调用的开销

With functions like map and reduce accepting functions as inputs, it places a greater strain on the optimizer to put it one way. The natural temptation in such a case short of some very aggressive optimization is to constantly branch back and forth between the implementation of, say, map, and the closure you provided, and likewise transmit data across these disparate branches of code (through registers and stack, typically).

使用像map和reduce这样的函数作为输入,这会给优化器带来更大的压力。在这种情况下,缺少一些非常积极的优化的自然诱惑是不断地在实现(例如,map)和您提供的闭包之间进行分支,并同样地跨这些完全不同的代码分支(通常通过寄存器和堆栈)传输数据。

That kind of branching/calling overhead is very difficult for the optimizer to eliminate, especially given the flexibility of Swift's closures (not impossible but conceptually quite difficult). C++ optimizers can inline function object calls but with far more restrictions and code generation techniques required to do it where the compiler would effectively have to generate a whole new set of instructions for map for each type of function object you pass in (and with explicit aid of the programmer indicating a function template used for the code generation).

这种分支/调用开销对于优化器来说非常难以消除,特别是考虑到Swift闭包的灵活性(并非不可能,但概念上相当困难)。c++优化器可以内联函数对象调用,但更多的限制和代码生成技术要求做编译器实际上会产生一个全新的指令集映射为每个类型的函数对象你传入(和程序员的显式援助表示一个函数模板用于代码生成)。

So it shouldn't be of great surprise to find that your hand-rolled loops can perform faster -- they put a great deal of less strain on the optimizer. I have seen some people cite that these higher-order functions should be able to go faster as a result of the vendor being able to do things like parallelize the loop, but to effectively parallelize the loop would first require the kind of information that would typically allow the optimizer to inline the nested function calls within to a point where they become as cheap as the hand-rolled loops. Otherwise the function/closure implementation you pass in is going to be effectively opaque to functions like map/reduce: they can only call it and pay the overhead of doing so, and cannot parallelize it since they cannot assume anything about the nature of the side effects and thread-safety in doing so.

因此,当你发现手工循环的执行速度更快时,你应该不会感到惊讶——它们给优化器带来的压力更小。我见过有些人引用,这些高阶函数应该能够更快的供应商能够循环并行化,但有效地并行化循环首先需要的信息通常允许内联优化器内嵌套的函数调用,他们变得一样廉价手工循环。否则,您传入的函数/闭包实现对于map/reduce之类的函数将是不透明的:它们只能调用它并为此支付开销,并且不能并行化它,因为它们不能假定副作用的性质和线程安全性。

Of course this is all conceptual -- Swift may be able to optimize these cases in the future, or it may already be able to do so now (see -Ofast as a commonly-cited way to make Swift go faster at the cost of some safety). But it does place a heavier strain on the optimizer, at the very least, to use these kinds of functions over the hand-rolled loops, and the time differences you're seeing in the first benchmark seem to reflect the kind of differences one might expect with this additional calling overhead. Best way to find out is to look at the assembly and try various optimization flags.

当然,这都是概念性的——Swift将来可能能够优化这些情况,或者它现在可能已经能够这样做了(见-Ofast作为一种常见的方法,以牺牲安全为代价使Swift走得更快)。但是,它确实对优化器施加了更大的压力,至少在手工卷的循环中使用这些函数,而您在第一个基准测试中看到的时间差异似乎反映了这种额外调用开销可能带来的不同。最好的办法是查看程序集并尝试各种优化标志。

Standard Functions

标准函数

That's not to discourage the use of such functions. They do more concisely express intent, they can boost productivity. And relying on them could allow your codebase to get faster in future versions of Swift without any involvement on your part. But they aren't necessarily always going to be faster -- it is a good general rule to think that a higher-level library function that more directly expresses what you want to do is going to be faster, but there are always exceptions to the rule (but best discovered in hindsight with a profiler in hand since it's far better to err on the side of trust than distrust here).

这并不是要阻止这些函数的使用。他们更简洁地表达意图,可以提高生产力。依赖它们可以使您的代码库在未来版本的Swift中更快,而无需您的参与。但他们不一定总是要快,这是一个很好的一般认为一个更高级的库函数,更直接表达你所要做的就是快,但总有例外(但最好在事后发现分析器的手因为它是远不如宁可信任不信任这里)。

Artificial Benchmarks

人工基准

As for your second benchmark, it is almost certainly a result of the compiler optimizing away code that has no side effects that affect user output. Artificial benchmarks have a tendency to be notoriously misleading as a result of what optimizers do to eliminate irrelevant side effects (side effects that don't affect user output, essentially). So you have to be careful there when constructing benchmarks with times that seem too good to be true that they aren't the result of the optimizer merely skipping all the work you actually wanted to benchmark. At the very least, you want your tests to output some final result gathered from the computation.

至于第二个基准测试,几乎可以肯定是编译器优化远离代码的结果,这些代码不会产生影响用户输出的副作用。由于优化器消除不相关的副作用(本质上不影响用户输出的副作用)的结果,人工基准测试往往具有众所周知的误导性。因此,在构建基准测试时,您必须小心,因为这些基准测试的时间似乎太长了,不可能是真的,因为优化器不会仅仅跳过您真正想要进行基准测试的所有工作。至少,您希望您的测试输出从计算中收集的一些最终结果。

#2


13  

I cannot say much about your first test (map() vs append() in a loop) but I can confirm your results. The append loop becomes even faster if you add

对于您的第一个测试(map() vs append()),我不能说太多,但是我可以确认您的结果。如果添加,append循环会更快。

output.reserveCapacity(array.count)

after the array creation. It seems that Apple can improve things here and you might file a bug report.

在数组的创建。似乎苹果可以改进这里的东西,你可以提交一个错误报告。

In

for _ in 0..<100_000 {
    var sum: Float = 0
    for element in array {
        sum += element
    }
}

the compiler (probably) removes the entire loop because the computed results are not used at all. I can only speculate why a similar optimization does not happen in

编译器(可能)会删除整个循环,因为计算结果完全没有使用。我只能推测为什么不会发生类似的优化。

for _ in 0..<100_000 {
    let sum = array.reduce(0, combine: {$0 + $1})
}

but it would more difficult to decide if calling reduce() with the closure has any side-effects or not.

但是,很难确定调用reduce()是否有副作用。

If the test code is changed slightly to calculate and print a total sum

如果测试代码稍作修改以计算和打印一个总数

do {
    var total = Float(0.0)
    let start = NSDate()
    for _ in 0..<100_000 {
        total += array.reduce(0, combine: {$0 + $1})
    }
    let elapsed = NSDate().timeIntervalSinceDate(start)
    print("sum with reduce:", elapsed)
    print(total)
}

do {
    var total = Float(0.0)
    let start = NSDate()
    for _ in 0..<100_000 {
        var sum = Float(0.0)
        for element in array {
            sum += element
        }
        total += sum
    }
    let elapsed = NSDate().timeIntervalSinceDate(start)
    print("sum with loop:", elapsed)
    print(total)
}

then both variants take about 10 seconds in my test.

在我的测试中,这两个变量都需要10秒钟。

#1


27  

Shouldn't the built-in Array methods be faster than the naive approach for performing such operations? Maybe somebody with more low-level knowledge than I can shed some light on the situation.

内置的数组方法难道不应该比简单的方法更快地执行这些操作吗?也许有人比我更了解情况。

I just want to attempt to address this part of the question and more from the conceptual level (with little understanding of the nature of Swift's optimizer on my part) with a "not necessarily". It's coming more from a background in compiler design and computer architecture than deep-rooted knowledge of the nature of Swift's optimizer.

我只是想试着从概念层面(对Swift的优化器的本质知之甚少),用“不一定”来回答这部分问题。它更多地来自于编译器设计和计算机架构的背景知识,而不是对Swift优化器的本质的深入了解。

Calling Overhead

调用的开销

With functions like map and reduce accepting functions as inputs, it places a greater strain on the optimizer to put it one way. The natural temptation in such a case short of some very aggressive optimization is to constantly branch back and forth between the implementation of, say, map, and the closure you provided, and likewise transmit data across these disparate branches of code (through registers and stack, typically).

使用像map和reduce这样的函数作为输入,这会给优化器带来更大的压力。在这种情况下,缺少一些非常积极的优化的自然诱惑是不断地在实现(例如,map)和您提供的闭包之间进行分支,并同样地跨这些完全不同的代码分支(通常通过寄存器和堆栈)传输数据。

That kind of branching/calling overhead is very difficult for the optimizer to eliminate, especially given the flexibility of Swift's closures (not impossible but conceptually quite difficult). C++ optimizers can inline function object calls but with far more restrictions and code generation techniques required to do it where the compiler would effectively have to generate a whole new set of instructions for map for each type of function object you pass in (and with explicit aid of the programmer indicating a function template used for the code generation).

这种分支/调用开销对于优化器来说非常难以消除,特别是考虑到Swift闭包的灵活性(并非不可能,但概念上相当困难)。c++优化器可以内联函数对象调用,但更多的限制和代码生成技术要求做编译器实际上会产生一个全新的指令集映射为每个类型的函数对象你传入(和程序员的显式援助表示一个函数模板用于代码生成)。

So it shouldn't be of great surprise to find that your hand-rolled loops can perform faster -- they put a great deal of less strain on the optimizer. I have seen some people cite that these higher-order functions should be able to go faster as a result of the vendor being able to do things like parallelize the loop, but to effectively parallelize the loop would first require the kind of information that would typically allow the optimizer to inline the nested function calls within to a point where they become as cheap as the hand-rolled loops. Otherwise the function/closure implementation you pass in is going to be effectively opaque to functions like map/reduce: they can only call it and pay the overhead of doing so, and cannot parallelize it since they cannot assume anything about the nature of the side effects and thread-safety in doing so.

因此,当你发现手工循环的执行速度更快时,你应该不会感到惊讶——它们给优化器带来的压力更小。我见过有些人引用,这些高阶函数应该能够更快的供应商能够循环并行化,但有效地并行化循环首先需要的信息通常允许内联优化器内嵌套的函数调用,他们变得一样廉价手工循环。否则,您传入的函数/闭包实现对于map/reduce之类的函数将是不透明的:它们只能调用它并为此支付开销,并且不能并行化它,因为它们不能假定副作用的性质和线程安全性。

Of course this is all conceptual -- Swift may be able to optimize these cases in the future, or it may already be able to do so now (see -Ofast as a commonly-cited way to make Swift go faster at the cost of some safety). But it does place a heavier strain on the optimizer, at the very least, to use these kinds of functions over the hand-rolled loops, and the time differences you're seeing in the first benchmark seem to reflect the kind of differences one might expect with this additional calling overhead. Best way to find out is to look at the assembly and try various optimization flags.

当然,这都是概念性的——Swift将来可能能够优化这些情况,或者它现在可能已经能够这样做了(见-Ofast作为一种常见的方法,以牺牲安全为代价使Swift走得更快)。但是,它确实对优化器施加了更大的压力,至少在手工卷的循环中使用这些函数,而您在第一个基准测试中看到的时间差异似乎反映了这种额外调用开销可能带来的不同。最好的办法是查看程序集并尝试各种优化标志。

Standard Functions

标准函数

That's not to discourage the use of such functions. They do more concisely express intent, they can boost productivity. And relying on them could allow your codebase to get faster in future versions of Swift without any involvement on your part. But they aren't necessarily always going to be faster -- it is a good general rule to think that a higher-level library function that more directly expresses what you want to do is going to be faster, but there are always exceptions to the rule (but best discovered in hindsight with a profiler in hand since it's far better to err on the side of trust than distrust here).

这并不是要阻止这些函数的使用。他们更简洁地表达意图,可以提高生产力。依赖它们可以使您的代码库在未来版本的Swift中更快,而无需您的参与。但他们不一定总是要快,这是一个很好的一般认为一个更高级的库函数,更直接表达你所要做的就是快,但总有例外(但最好在事后发现分析器的手因为它是远不如宁可信任不信任这里)。

Artificial Benchmarks

人工基准

As for your second benchmark, it is almost certainly a result of the compiler optimizing away code that has no side effects that affect user output. Artificial benchmarks have a tendency to be notoriously misleading as a result of what optimizers do to eliminate irrelevant side effects (side effects that don't affect user output, essentially). So you have to be careful there when constructing benchmarks with times that seem too good to be true that they aren't the result of the optimizer merely skipping all the work you actually wanted to benchmark. At the very least, you want your tests to output some final result gathered from the computation.

至于第二个基准测试,几乎可以肯定是编译器优化远离代码的结果,这些代码不会产生影响用户输出的副作用。由于优化器消除不相关的副作用(本质上不影响用户输出的副作用)的结果,人工基准测试往往具有众所周知的误导性。因此,在构建基准测试时,您必须小心,因为这些基准测试的时间似乎太长了,不可能是真的,因为优化器不会仅仅跳过您真正想要进行基准测试的所有工作。至少,您希望您的测试输出从计算中收集的一些最终结果。

#2


13  

I cannot say much about your first test (map() vs append() in a loop) but I can confirm your results. The append loop becomes even faster if you add

对于您的第一个测试(map() vs append()),我不能说太多,但是我可以确认您的结果。如果添加,append循环会更快。

output.reserveCapacity(array.count)

after the array creation. It seems that Apple can improve things here and you might file a bug report.

在数组的创建。似乎苹果可以改进这里的东西,你可以提交一个错误报告。

In

for _ in 0..<100_000 {
    var sum: Float = 0
    for element in array {
        sum += element
    }
}

the compiler (probably) removes the entire loop because the computed results are not used at all. I can only speculate why a similar optimization does not happen in

编译器(可能)会删除整个循环,因为计算结果完全没有使用。我只能推测为什么不会发生类似的优化。

for _ in 0..<100_000 {
    let sum = array.reduce(0, combine: {$0 + $1})
}

but it would more difficult to decide if calling reduce() with the closure has any side-effects or not.

但是,很难确定调用reduce()是否有副作用。

If the test code is changed slightly to calculate and print a total sum

如果测试代码稍作修改以计算和打印一个总数

do {
    var total = Float(0.0)
    let start = NSDate()
    for _ in 0..<100_000 {
        total += array.reduce(0, combine: {$0 + $1})
    }
    let elapsed = NSDate().timeIntervalSinceDate(start)
    print("sum with reduce:", elapsed)
    print(total)
}

do {
    var total = Float(0.0)
    let start = NSDate()
    for _ in 0..<100_000 {
        var sum = Float(0.0)
        for element in array {
            sum += element
        }
        total += sum
    }
    let elapsed = NSDate().timeIntervalSinceDate(start)
    print("sum with loop:", elapsed)
    print(total)
}

then both variants take about 10 seconds in my test.

在我的测试中,这两个变量都需要10秒钟。