I was doing some performance metrics and I ran into something that seems quite odd to me. I time the following two functions:
我正在做一些性能指标,我遇到了一些对我来说很奇怪的事情。我计时以下两个功能:
private static void DoOne()
{
List<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
int s=0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < A.Count; c++) s += A[c];
}
}
private static void DoTwo()
{
List<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
IList<int> L = A;
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < L.Count; c++) s += L[c];
}
}
Even when compiling in release mode, the timings results were consistently showing that DoTwo takes ~100 longer then DoOne:
即使在发布模式下进行编译,时序结果仍然表明DoTwo比DoOne长约100倍:
DoOne took 0.06171706 seconds.
DoTwo took 8.841709 seconds.
Given the fact the List directly implements IList I was very surprised by the results. Can anyone clarify this behavior?
鉴于List直接实现了IList,我对结果感到非常惊讶。任何人都可以澄清这种行为吗?
The gory details
Responding to questions, here is the full code and an image of the project build preferences:
回答问题,这里是完整的代码和项目构建首选项的图像:
死图像链接
using System;
using System.Collections.Generic;
using System.Text;
using System.Diagnostics;
using System.Collections;
namespace TimingTests
{
class Program
{
static void Main(string[] args)
{
Stopwatch SW = new Stopwatch();
SW.Start();
DoOne();
SW.Stop();
Console.WriteLine(" DoOne took {0} seconds.", ((float)SW.ElapsedTicks) / Stopwatch.Frequency);
SW.Reset();
SW.Start();
DoTwo();
SW.Stop();
Console.WriteLine(" DoTwo took {0} seconds.", ((float)SW.ElapsedTicks) / Stopwatch.Frequency);
}
private static void DoOne()
{
List<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
int s=0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < A.Count; c++) s += A[c];
}
}
private static void DoTwo()
{
List<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
IList<int> L = A;
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < L.Count; c++) s += L[c];
}
}
}
}
Thanks for all the good answers (especially @kentaromiura). I would have closed the question, though I feel we still miss an important part of the puzzle. Why would accessing a class via an interface it implements be so much slower? The only difference I can see is that accessing a function via an Interface implies using virtual tables while the normally the functions can be called directly. To see whether this is the case I made a couple of changes to the above code. First I introduced two almost identical classes:
感谢所有的好答案(特别是@kentaromiura)。我会关闭这个问题,虽然我觉得我们仍然错过了这个难题的一个重要部分。为什么通过它实现的接口访问类会慢得多?我能看到的唯一区别是通过接口访问函数意味着使用虚拟表,而通常可以直接调用函数。为了查看是否是这种情况,我对上面的代码进行了一些更改。首先,我介绍了两个几乎相同的类:
public class VC
{
virtual public int f() { return 2; }
virtual public int Count { get { return 200; } }
}
public class C
{
public int f() { return 2; }
public int Count { get { return 200; } }
}
As you can see VC is using virtual functions and C doesn't. Now to DoOne and DoTwo:
正如您所看到的,VC正在使用虚拟功能而C则没有。现在到DoOne和DoTwo:
private static void DoOne()
{ C a = new C();
int s=0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < a.Count; c++) s += a.f();
}
}
private static void DoTwo()
{
VC a = new VC();
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < a.Count; c++) s += a.f();
}
}
And indeed:
DoOne took 0.01287789 seconds.
DoTwo took 8.982396 seconds.
This is even more scary - virtual function calls 800 times slower?? so a couple of question to the community:
这更可怕 - 虚函数调用速度慢800倍?所以社区有几个问题:
- Can you reproduce? (given the fact that all had worse performance before, but not as bad as mine)
- Can you explain?
- (this may be the most important) - can you think of a way to avoid?
你可以重现吗? (考虑到之前所有人的表现都比较差,但没有我的表现差)
你可以解释吗?
(这可能是最重要的) - 你能想到一种避免的方法吗?
Boaz
7 个解决方案
#1
A note to everyone out there who is trying to benchmark stuff like this.
给那些试图对这样的东西进行基准测试的所有人的注释。
Do not forget that the code is not jitted until the first time it runs. That means that the first time you run a method, the cost of running that method could be dominated by the time spent loading the IL, analyzing the IL, and jitting it into machine code, particularly if it is a trivial method.
不要忘记,代码在第一次运行之前不会被jitted。这意味着第一次运行方法时,运行该方法的成本可能由加载IL所花费的时间,分析IL以及将其嵌入到机器代码中所占用的时间决定,特别是如果它是一个简单的方法。
If what you're trying to do is compare the "marginal" runtime cost of two methods, it's a good idea to run both of them twice and consider only the second runs for comparison purposes.
如果你要做的是比较两种方法的“边际”运行时成本,最好同时运行它们两次并仅考虑第二次运行以进行比较。
#2
Profiling one on one:
一对一分析:
Testing with Snippet compiler.
使用Snippet编译器进行测试。
using your code results:
使用您的代码结果:
0.043s vs 0.116s
0.043s vs 0.116s
eliminating Temporary L
消除临时L
0.043s vs 0.116s - ininfluent
0.043s vs 0.116s - inInfluent
by caching A.count in cmax on both Methods
通过在两个方法的cmax中缓存A.count
0.041s vs 0.076s
0.041s vs 0.076s
IList<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0,cmax=A.Count;c< cmax; c++) s += A[c];
}
Now I will try to slow down DoOne, first try, casting to IList before add:
现在我将尝试减慢DoOne,首先尝试,在添加之前转换为IList:
for (int i = 0; i < 200; i++) ((IList<int>)A).Add(i);
0,041s 0,076s - so add is ininfluent
0,041s 0,076s - 所以add是不流利的
so it remains only one place where the slowdown can happen : s += A[c];
so I try this:
所以它仍然只有一个可以发生减速的地方:s + = A [c];所以我试试这个:
s += ((IList<int>)A)[c];
0.075s 0.075s - TADaaan!
0.075s 0.075s - TADaaan!
so seems that accessing Count or a index element is slower on the interfaced version:
所以看起来在接口版本*问Count或索引元素的速度较慢:
EDIT: Just for fun take a look at this:
编辑:只是为了好玩,看看这个:
for (int c = 0,cmax=A.Count;c< cmax; c++) s += ((List<int>)A)[c];
0.041s 0.050s
so is not a cast problem, but a reflection one!
所以不是演员问题,而是反思一个!
#3
First I want to thank all for their answers. It was really essential in the path figuring our what was going on. Special thanks goes to @kentaromiura which found the key needed to get to the bottom of things.
首先,我要感谢所有人的回答。在确定我们正在发生的事情的路径中,这是非常重要的。特别感谢@kentaromiura,它找到了解决问题所需的关键。
The source of the slow down of using List<T> via an IList<T> interface is the lack of the ability of the JIT complier to inline the Item property get function. The use of virtual tables caused by accessing the list through it's IList interface prevents that from happening.
通过IList
As a proof , I have written the following code:
作为证明,我写了以下代码:
public class VC
{
virtual public int f() { return 2; }
virtual public int Count { get { return 200; } }
}
public class C
{
//[MethodImpl( MethodImplOptions.NoInlining)]
public int f() { return 2; }
public int Count
{
// [MethodImpl(MethodImplOptions.NoInlining)]
get { return 200; }
}
}
and modified the DoOne and DoTwo classes to the following:
并将DoOne和DoTwo类修改为以下内容:
private static void DoOne()
{
C c = new C();
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int i = 0; i < c.Count; i++) s += c.f();
}
}
private static void DoTwo()
{
VC c = new VC();
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int i = 0; i < c.Count; i++) s += c.f();
}
}
Sure enough the function times are now very similar to before:
现在功能时间与以前非常相似:
DoOne took 0.01273598 seconds.
DoTwo took 8.524558 seconds.
Now, if you remove the comments before the MethodImpl in the C class (forcing the JIT not to inline) - the timing becomes:
现在,如果删除C类中MethodImpl之前的注释(强制JIT不要内联) - 时间变为:
DoOne took 8.734635 seconds.
DoTwo took 8.887354 seconds.
Voila - the methods take almost the same time. You can stil see that method DoOne is still slightly fast , which is consistent which the extra overhead of a virtual function.
瞧 - 这些方法几乎同时进行。你可以看到DoOne的方法仍然稍微快一点,这与虚函数的额外开销是一致的。
#4
I believe that the problem lies in your time metrics, what are you using to measure the elapsed time?
我认为问题在于你的时间指标,你用什么来衡量经过的时间?
Just for the record, here are my results:
仅供记录,以下是我的结果:
DoOne() -> 295 ms
DoTwo() -> 291 ms
And the code:
和代码:
Stopwatch sw = new Stopwatch();
sw.Start();
{
DoOne();
}
sw.Stop();
Console.WriteLine("DoOne() -> {0} ms", sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
{
DoTwo();
}
sw.Stop();
Console.WriteLine("DoTwo() -> {0} ms", sw.ElapsedMilliseconds);
#5
I am seeing some significant penalty for the interface version but nowhere near the magnitude penalty you are seeing.
我看到界面版本有一些重大的损失,但是你看到的幅度惩罚还远远不够。
Can you post a small, complete program that demonstrates the behaviour along with exactly how you are compiling it and exactly what version of the framework you are using?
您是否可以发布一个小型,完整的程序来演示行为以及您正在编译它的确切方式以及您正在使用的框架版本?
#6
My tests show the interface version to be about 3x slower when compiled in release mode. When compiled in debug mode they're almost neck-and-neck.
我的测试表明,在发布模式下编译时,接口版本的速度要慢约3倍。在调试模式下编译时,它们几乎是颈部和颈部。
--------------------------------------------------------
DoOne Release (ms) | 92 | 91 | 91 | 92 | 92 | 92
DoTwo Release (ms) | 313 | 313 | 316 | 352 | 320 | 318
--------------------------------------------------------
DoOne Debug (ms) | 535 | 534 | 548 | 536 | 534 | 537
DoTwo Debug (ms) | 566 | 570 | 569 | 565 | 568 | 571
--------------------------------------------------------
EDIT
In my tests I used a slightly modified version of the DoTwo
method so that it was directly comparable to DoOne
. (This change didn't make any discernible difference to the performance.)
在我的测试中,我使用了DoTwo方法的略微修改版本,因此它可以直接与DoOne进行比较。 (此更改未对性能产生任何明显差异。)
private static void DoTwo()
{
IList<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < A.Count; c++) s += A[c];
}
}
The only difference between the IL generated for DoOne
and (modified) DoTwo
is that the callvirt
instructions for Add
, get_Item
and get_Count
use IList
and ICollection
rather than List
itself.
为DoOne和(已修改的)DoTwo生成的IL之间的唯一区别是Add,get_Item和get_Count的callvirt指令使用IList和ICollection而不是List本身。
I'm guessing that the runtime has to do more work to find the actual method implementation when the callvirt
is through an interface (and that the JIT compiler/optimiser can do a better job with the non-interface calls than the interface calls when compiling in release mode).
我猜测运行时必须做更多工作才能在callvirt通过接口时找到实际的方法实现(并且JIT编译器/优化器可以在编译时使用非接口调用比接口调用做得更好在发布模式)。
Can anybody confirm this?
任何人都可以证实吗?
#7
I've ran this using Jon Skeet's Benchmark Helper and I am not seeing the results you are, the execution time is approximately the same between the two methods.
我使用Jon Skeet的Benchmark Helper来运行它,我没有看到你的结果,两种方法的执行时间大致相同。
#1
A note to everyone out there who is trying to benchmark stuff like this.
给那些试图对这样的东西进行基准测试的所有人的注释。
Do not forget that the code is not jitted until the first time it runs. That means that the first time you run a method, the cost of running that method could be dominated by the time spent loading the IL, analyzing the IL, and jitting it into machine code, particularly if it is a trivial method.
不要忘记,代码在第一次运行之前不会被jitted。这意味着第一次运行方法时,运行该方法的成本可能由加载IL所花费的时间,分析IL以及将其嵌入到机器代码中所占用的时间决定,特别是如果它是一个简单的方法。
If what you're trying to do is compare the "marginal" runtime cost of two methods, it's a good idea to run both of them twice and consider only the second runs for comparison purposes.
如果你要做的是比较两种方法的“边际”运行时成本,最好同时运行它们两次并仅考虑第二次运行以进行比较。
#2
Profiling one on one:
一对一分析:
Testing with Snippet compiler.
使用Snippet编译器进行测试。
using your code results:
使用您的代码结果:
0.043s vs 0.116s
0.043s vs 0.116s
eliminating Temporary L
消除临时L
0.043s vs 0.116s - ininfluent
0.043s vs 0.116s - inInfluent
by caching A.count in cmax on both Methods
通过在两个方法的cmax中缓存A.count
0.041s vs 0.076s
0.041s vs 0.076s
IList<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0,cmax=A.Count;c< cmax; c++) s += A[c];
}
Now I will try to slow down DoOne, first try, casting to IList before add:
现在我将尝试减慢DoOne,首先尝试,在添加之前转换为IList:
for (int i = 0; i < 200; i++) ((IList<int>)A).Add(i);
0,041s 0,076s - so add is ininfluent
0,041s 0,076s - 所以add是不流利的
so it remains only one place where the slowdown can happen : s += A[c];
so I try this:
所以它仍然只有一个可以发生减速的地方:s + = A [c];所以我试试这个:
s += ((IList<int>)A)[c];
0.075s 0.075s - TADaaan!
0.075s 0.075s - TADaaan!
so seems that accessing Count or a index element is slower on the interfaced version:
所以看起来在接口版本*问Count或索引元素的速度较慢:
EDIT: Just for fun take a look at this:
编辑:只是为了好玩,看看这个:
for (int c = 0,cmax=A.Count;c< cmax; c++) s += ((List<int>)A)[c];
0.041s 0.050s
so is not a cast problem, but a reflection one!
所以不是演员问题,而是反思一个!
#3
First I want to thank all for their answers. It was really essential in the path figuring our what was going on. Special thanks goes to @kentaromiura which found the key needed to get to the bottom of things.
首先,我要感谢所有人的回答。在确定我们正在发生的事情的路径中,这是非常重要的。特别感谢@kentaromiura,它找到了解决问题所需的关键。
The source of the slow down of using List<T> via an IList<T> interface is the lack of the ability of the JIT complier to inline the Item property get function. The use of virtual tables caused by accessing the list through it's IList interface prevents that from happening.
通过IList
As a proof , I have written the following code:
作为证明,我写了以下代码:
public class VC
{
virtual public int f() { return 2; }
virtual public int Count { get { return 200; } }
}
public class C
{
//[MethodImpl( MethodImplOptions.NoInlining)]
public int f() { return 2; }
public int Count
{
// [MethodImpl(MethodImplOptions.NoInlining)]
get { return 200; }
}
}
and modified the DoOne and DoTwo classes to the following:
并将DoOne和DoTwo类修改为以下内容:
private static void DoOne()
{
C c = new C();
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int i = 0; i < c.Count; i++) s += c.f();
}
}
private static void DoTwo()
{
VC c = new VC();
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int i = 0; i < c.Count; i++) s += c.f();
}
}
Sure enough the function times are now very similar to before:
现在功能时间与以前非常相似:
DoOne took 0.01273598 seconds.
DoTwo took 8.524558 seconds.
Now, if you remove the comments before the MethodImpl in the C class (forcing the JIT not to inline) - the timing becomes:
现在,如果删除C类中MethodImpl之前的注释(强制JIT不要内联) - 时间变为:
DoOne took 8.734635 seconds.
DoTwo took 8.887354 seconds.
Voila - the methods take almost the same time. You can stil see that method DoOne is still slightly fast , which is consistent which the extra overhead of a virtual function.
瞧 - 这些方法几乎同时进行。你可以看到DoOne的方法仍然稍微快一点,这与虚函数的额外开销是一致的。
#4
I believe that the problem lies in your time metrics, what are you using to measure the elapsed time?
我认为问题在于你的时间指标,你用什么来衡量经过的时间?
Just for the record, here are my results:
仅供记录,以下是我的结果:
DoOne() -> 295 ms
DoTwo() -> 291 ms
And the code:
和代码:
Stopwatch sw = new Stopwatch();
sw.Start();
{
DoOne();
}
sw.Stop();
Console.WriteLine("DoOne() -> {0} ms", sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
{
DoTwo();
}
sw.Stop();
Console.WriteLine("DoTwo() -> {0} ms", sw.ElapsedMilliseconds);
#5
I am seeing some significant penalty for the interface version but nowhere near the magnitude penalty you are seeing.
我看到界面版本有一些重大的损失,但是你看到的幅度惩罚还远远不够。
Can you post a small, complete program that demonstrates the behaviour along with exactly how you are compiling it and exactly what version of the framework you are using?
您是否可以发布一个小型,完整的程序来演示行为以及您正在编译它的确切方式以及您正在使用的框架版本?
#6
My tests show the interface version to be about 3x slower when compiled in release mode. When compiled in debug mode they're almost neck-and-neck.
我的测试表明,在发布模式下编译时,接口版本的速度要慢约3倍。在调试模式下编译时,它们几乎是颈部和颈部。
--------------------------------------------------------
DoOne Release (ms) | 92 | 91 | 91 | 92 | 92 | 92
DoTwo Release (ms) | 313 | 313 | 316 | 352 | 320 | 318
--------------------------------------------------------
DoOne Debug (ms) | 535 | 534 | 548 | 536 | 534 | 537
DoTwo Debug (ms) | 566 | 570 | 569 | 565 | 568 | 571
--------------------------------------------------------
EDIT
In my tests I used a slightly modified version of the DoTwo
method so that it was directly comparable to DoOne
. (This change didn't make any discernible difference to the performance.)
在我的测试中,我使用了DoTwo方法的略微修改版本,因此它可以直接与DoOne进行比较。 (此更改未对性能产生任何明显差异。)
private static void DoTwo()
{
IList<int> A = new List<int>();
for (int i = 0; i < 200; i++) A.Add(i);
int s = 0;
for (int j = 0; j < 100000; j++)
{
for (int c = 0; c < A.Count; c++) s += A[c];
}
}
The only difference between the IL generated for DoOne
and (modified) DoTwo
is that the callvirt
instructions for Add
, get_Item
and get_Count
use IList
and ICollection
rather than List
itself.
为DoOne和(已修改的)DoTwo生成的IL之间的唯一区别是Add,get_Item和get_Count的callvirt指令使用IList和ICollection而不是List本身。
I'm guessing that the runtime has to do more work to find the actual method implementation when the callvirt
is through an interface (and that the JIT compiler/optimiser can do a better job with the non-interface calls than the interface calls when compiling in release mode).
我猜测运行时必须做更多工作才能在callvirt通过接口时找到实际的方法实现(并且JIT编译器/优化器可以在编译时使用非接口调用比接口调用做得更好在发布模式)。
Can anybody confirm this?
任何人都可以证实吗?
#7
I've ran this using Jon Skeet's Benchmark Helper and I am not seeing the results you are, the execution time is approximately the same between the two methods.
我使用Jon Skeet的Benchmark Helper来运行它,我没有看到你的结果,两种方法的执行时间大致相同。