I have a C++ application, running on Linux, which I'm in the process of optimizing. How can I pinpoint which areas of my code are running slowly?
我有一个c++应用程序,运行在Linux上,我正在进行优化。我如何确定我的代码中哪些区域运行缓慢?
10 个解决方案
#1
1162
If your goal is to use a profiler, use one of the suggested ones.
如果你的目标是使用一个剖析器,那就使用其中的一个。
However, if you're in a hurry and you can manually interrupt your program under the debugger while it's being subjectively slow, there's a simple way to find performance problems.
但是,如果您非常匆忙,并且您可以手动地在调试器下中断您的程序,而它的主观上是缓慢的,那么就有一个简单的方法来发现性能问题。
Just halt it several times, and each time look at the call stack. If there is some code that is wasting some percentage of the time, 20% or 50% or whatever, that is the probability that you will catch it in the act on each sample. So that is roughly the percentage of samples on which you will see it. There is no educated guesswork required. If you do have a guess as to what the problem is, this will prove or disprove it.
只需停止它几次,每次查看调用堆栈。如果有一些代码在浪费一定比例的时间,20%或50%或者其他的,那就是你在每个样本的行为中抓住它的概率。这大概就是你能看到的样本的百分比。没有受过良好教育的猜测。如果你对这个问题有一个猜想,这将证明或证明它是错误的。
You may have multiple performance problems of different sizes. If you clean out any one of them, the remaining ones will take a larger percentage, and be easier to spot, on subsequent passes. This magnification effect, when compounded over multiple problems, can lead to truly massive speedup factors.
您可能有多个不同大小的性能问题。如果你清除掉其中的任何一个,剩下的将会占据更大的百分比,并且在随后的通行证中更容易被发现。这种放大效应,当混合多个问题时,会导致真正的大规模加速因子。
Caveat: Programmers tend to be skeptical of this technique unless they've used it themselves. They will say that profilers give you this information, but that is only true if they sample the entire call stack, and then let you examine a random set of samples. (The summaries are where the insight is lost.) Call graphs don't give you the same information, because
注意:程序员往往对这种技术持怀疑态度,除非他们自己使用过。他们会说分析器给你这个信息,但是只有当他们对整个调用堆栈进行采样,然后让你检查一组随机的样本时,这才是正确的。(这些总结是我们失去洞察力的地方。)调用图不会给你相同的信息,因为。
- they don't summarize at the instruction level, and
- 他们不会在指令层面进行总结。
- they give confusing summaries in the presence of recursion.
- 在递归的情况下,他们会给出令人困惑的总结。
They will also say it only works on toy programs, when actually it works on any program, and it seems to work better on bigger programs, because they tend to have more problems to find. They will say it sometimes finds things that aren't problems, but that is only true if you see something once. If you see a problem on more than one sample, it is real.
他们还会说,它只对玩具程序有用,实际上它对任何程序都有效,而且它在更大的程序上看起来效果更好,因为它们往往有更多的问题要找。他们会说,有时会发现一些不是问题的东西,但只有当你看到某样东西时才会这样。如果你在一个以上的样本中发现问题,它是真实的。
P.S. This can also be done on multi-thread programs if there is a way to collect call-stack samples of the thread pool at a point in time, as there is in Java.
如果有一种方法可以在某个时间点收集线程池的调用堆栈示例,就像在Java中一样,那么在多线程程序中也可以这样做。
P.P.S As a rough generality, the more layers of abstraction you have in your software, the more likely you are to find that that is the cause of performance problems (and the opportunity to get speedup).
最大功率作为一个粗略的一般性,你在你的软件中有越多的抽象层,你就越有可能发现这是性能问题的原因(以及获得加速的机会)。
Added: It might not be obvious, but the stack sampling technique works equally well in the presence of recursion. The reason is that the time that would be saved by removal of an instruction is approximated by the fraction of samples containing it, regardless of the number of times it may occur within a sample.
添加:可能不是很明显,但是堆栈抽样技术在递归的情况下同样有效。原因是,删除指令所节省下来的时间,是用包含它的样本的分数来近似的,而不管它在样本内发生的次数。
Another objection I often hear is: "It will stop someplace random, and it will miss the real problem". This comes from having a prior concept of what the real problem is. A key property of performance problems is that they defy expectations. Sampling tells you something is a problem, and your first reaction is disbelief. That is natural, but you can be sure if it finds a problem it is real, and vice-versa.
我经常听到的另一个反对意见是:“它会在某个地方停下来,它会错过真正的问题”。这来自于对实际问题的先验概念。性能问题的一个关键特性是它们不符合预期。抽样告诉你一个问题,你的第一反应是怀疑。这是很自然的,但是你可以确定如果它发现了问题,它是真实的,反之亦然。
ADDED: Let me make a Bayesian explanation of how it works. Suppose there is some instruction I
(call or otherwise) which is on the call stack some fraction f
of the time (and thus costs that much). For simplicity, suppose we don't know what f
is, but assume it is either 0.1, 0.2, 0.3, ... 0.9, 1.0, and the prior probability of each of these possibilities is 0.1, so all of these costs are equally likely a-priori.
补充:让我用贝叶斯解释它是如何工作的。假设有一些指令I(调用或其他)在调用堆栈上的时间(因此花费了很多)。为了简单起见,假设我们不知道f是多少,假设它是0。1 0。2 0。3。0。9,1。0,每一种可能性的先验概率是0。1,所有这些成本都是一样的。
Then suppose we take just 2 stack samples, and we see instruction I
on both samples, designated observation o=2/2
. This gives us new estimates of the frequency f
of I
, according to this:
然后假设我们只取2个堆栈样本,我们在两个样本上都看到了指令I,指定的观察o=2/2。这给了我们关于I的频率的新的估计,根据这个:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&&f=x) P(o=2/2&&f >= x) P(f >= x)
0.1 1 1 0.1 0.1 0.25974026
0.1 0.9 0.81 0.081 0.181 0.47012987
0.1 0.8 0.64 0.064 0.245 0.636363636
0.1 0.7 0.49 0.049 0.294 0.763636364
0.1 0.6 0.36 0.036 0.33 0.857142857
0.1 0.5 0.25 0.025 0.355 0.922077922
0.1 0.4 0.16 0.016 0.371 0.963636364
0.1 0.3 0.09 0.009 0.38 0.987012987
0.1 0.2 0.04 0.004 0.384 0.997402597
0.1 0.1 0.01 0.001 0.385 1
P(o=2/2) 0.385
The last column says that, for example, the probability that f
>= 0.5 is 92%, up from the prior assumption of 60%.
最后一列表示,例如,f >= 0.5的概率是92%,高于之前假设的60%。
Suppose the prior assumptions are different. Suppose we assume P(f=0.1) is .991 (nearly certain), and all the other possibilities are almost impossible (0.001). In other words, our prior certainty is that I
is cheap. Then we get:
假设先前的假设是不同的。假设我们假设P(f=0.1)是。991(几乎可以确定),所有其他可能性几乎是不可能的(0.001)。换句话说,我们事先确定的是我很便宜。然后我们得到:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&& f=x) P(o=2/2&&f >= x) P(f >= x)
0.001 1 1 0.001 0.001 0.072727273
0.001 0.9 0.81 0.00081 0.00181 0.131636364
0.001 0.8 0.64 0.00064 0.00245 0.178181818
0.001 0.7 0.49 0.00049 0.00294 0.213818182
0.001 0.6 0.36 0.00036 0.0033 0.24
0.001 0.5 0.25 0.00025 0.00355 0.258181818
0.001 0.4 0.16 0.00016 0.00371 0.269818182
0.001 0.3 0.09 0.00009 0.0038 0.276363636
0.001 0.2 0.04 0.00004 0.00384 0.279272727
0.991 0.1 0.01 0.00991 0.01375 1
P(o=2/2) 0.01375
Now it says P(f >= 0.5) is 26%, up from the prior assumption of 0.6%. So Bayes allows us to update our estimate of the probable cost of I
. If the amount of data is small, it doesn't tell us accurately what the cost is, only that it is big enough to be worth fixing.
现在P(f >= 0.5)是26%,高于之前的0.6%。所以贝叶斯允许我们更新我们对i的可能成本的估计,如果数据量很小,它不能准确地告诉我们成本是多少,只是它足够大,值得修理。
Yet another way to look at it is called the Rule Of Succession. If you flip a coin 2 times, and it comes up heads both times, what does that tell you about the probable weighting of the coin? The respected way to answer is to say that it's a Beta distribution, with average value (number of hits + 1) / (number of tries + 2) = (2+1)/(2+2) = 75%.
另一种看待它的方法叫做继承规则。如果你抛硬币2次,它同时出现两次,这告诉你硬币的可能权重是多少?受人尊敬的回答是,它是一个Beta分布,平均值(点击次数+1)/(尝试次数+2)=(2+1)/(2+2)= 75%。
(The key is that we see I
more than once. If we only see it once, that doesn't tell us much except that f
> 0.)
关键是我们不止一次看到我。如果我们只看一次,它就不会告诉我们很多,除了f > 0。
So, even a very small number of samples can tell us a lot about the cost of instructions that it sees. (And it will see them with a frequency, on average, proportional to their cost. If n
samples are taken, and f
is the cost, then I
will appear on nf+/-sqrt(nf(1-f))
samples. Example, n=10
, f=0.3
, that is 3+/-1.4
samples.)
所以,即使是很小数量的样本也能告诉我们很多关于它所看到的指令的成本。(而且它会以平均的频率与他们的成本成正比。如果取n个样本,f为代价,那么我将出现在nf+/-sqrt(nf(1-f))样本中。例如,n=10, f=0.3,即3+/-1.4样本。
ADDED, to give an intuitive feel for the difference between measuring and random stack sampling:
There are profilers now that sample the stack, even on wall-clock time, but what comes out is measurements (or hot path, or hot spot, from which a "bottleneck" can easily hide). What they don't show you (and they easily could) is the actual samples themselves. And if your goal is to find the bottleneck, the number of them you need to see is, on average, 2 divided by the fraction of time it takes. So if it takes 30% of time, 2/.3 = 6.7 samples, on average, will show it, and the chance that 20 samples will show it is 99.2%.
此外,为了直观感受测量和随机堆栈采样之间的差异:现在有一些分析器,即使是在墙上时钟的时候,也有一些分析器,但是出来的是测量(或者热的路径,或者是热点,这是一个“瓶颈”很容易隐藏的地方)。他们不展示给你的是真实的样本。如果你的目标是找到瓶颈,你需要看到的数字是,平均来说,2除以它花费的时间。如果需要30%的时间,2/。3 = 6.7个样本,平均会显示出来,20个样本显示的概率是99。2%
Here is an off-the-cuff illustration of the difference between examining measurements and examining stack samples. The bottleneck could be one big blob like this, or numerous small ones, it makes no difference.
这里有一个现成的例子,说明了检查测量和检查堆栈样本之间的区别。瓶颈可以是这样的一个大团,也可以是无数个小块,这没什么区别。
Measurement is horizontal; it tells you what fraction of time specific routines take. Sampling is vertical. If there is any way to avoid what the whole program is doing at that moment, and if you see it on a second sample, you've found the bottleneck. That's what makes the difference - seeing the whole reason for the time being spent, not just how much.
测量水平;它会告诉你具体的时间比例是多少。抽样是垂直的。如果有任何方法可以避免整个程序在那个时刻所做的事情,如果您在第二个示例中看到它,那么您已经找到了瓶颈。这就是不同之处——看到花费时间的全部原因,而不仅仅是花费多少。
#2
458
You can use Valgrind with the following options
您可以使用以下选项使用Valgrind。
valgrind --tool=callgrind ./(Your binary)
It will generate a file called callgrind.out.x
. You can then use kcachegrind
tool to read this file. It will give you a graphical analysis of things with results like which lines cost how much.
它将生成一个名为callgrind.out.x的文件。然后您可以使用kcachegrind工具来读取这个文件。它会给你一个图形化的分析结果,比如哪些行花费了多少。
#3
288
I assume you're using GCC. The standard solution would be to profile with gprof.
我假设您在使用GCC。标准的解决方案将是与gprof进行配置。
Be sure to add -pg
to compilation before profiling:
在进行概要分析之前,请确保将-pg添加到编译:
cc -o myprog myprog.c utils.c -g -pg
I haven't tried it yet but I've heard good things about google-perftools. It is definitely worth a try.
我还没试过,但我听说过一些关于google-perftools的好东西。这绝对值得一试。
Related question here.
相关的问题。
A few other buzzwords if gprof
does not do the job for you: Valgrind, Intel VTune, Sun DTrace.
如果gprof没有为你做这项工作的话,还有一些其他的热门词汇:Valgrind, Intel VTune, Sun DTrace。
#4
209
Newer kernels (e.g. the latest Ubuntu kernels) come with the new 'perf' tools (apt-get install linux-tools
) AKA perf_events.
新的内核(例如最新的Ubuntu内核)附带了新的“perf”工具(apt-get安装linux-tools),即perf_events。
These come with classic sampling profilers (man-page) as well as the awesome timechart!
这些都是经典的采样剖析器(man-page)和很棒的时间表!
The important thing is that these tools can be system profiling and not just process profiling - they can show the interaction between threads, processes and the kernel and let you understand the scheduling and I/O dependencies between processes.
重要的是,这些工具可以是系统剖析,而不仅仅是过程分析——它们可以显示线程、进程和内核之间的交互,并让您了解进程之间的调度和I/O依赖关系。
#5
63
I would use Valgrind and Callgrind as a base for my profiling tool suite. What is important to know is that Valgrind is basically a Virtual Machine:
我将使用Valgrind和Callgrind作为我的分析工具套件的基础。重要的是,Valgrind基本上是一个虚拟机:
(wikipedia) Valgrind is in essence a virtual machine using just-in-time (JIT) compilation techniques, including dynamic recompilation. Nothing from the original program ever gets run directly on the host processor. Instead, Valgrind first translates the program into a temporary, simpler form called Intermediate Representation (IR), which is a processor-neutral, SSA-based form. After the conversion, a tool (see below) is free to do whatever transformations it would like on the IR, before Valgrind translates the IR back into machine code and lets the host processor run it.
(wikipedia) Valgrind本质上是一个使用即时(JIT)编译技术(包括动态重新编译)的虚拟机。原始程序中的任何东西都不会直接在主机处理器上运行。相反,Valgrind首先将程序转换为一个临时的、更简单的形式,称为中间表示(IR),它是一个处理器中立的基于ssa的表单。在转换之后,一个工具(见下面)可以*地完成它想在IR上做的任何转换,在Valgrind将IR转换成机器码并让主机处理器运行它之前。
Callgrind is a profiler build upon that. Main benefit is that you don't have to run your aplication for hours to get reliable result. Even one second run is sufficient to get rock-solid, reliable results, because Callgrind is a non-probing profiler.
Callgrind是一个剖析器。主要的好处是你不必为了得到可靠的结果而花上几个小时的时间。即使是第二次运行,也足以获得可靠的、可靠的结果,因为Callgrind是一个非探测的探查器。
Another tool build upon Valgrind is Massif. I use it to profile heap memory usage. It works great. What it does is that it gives you snapshots of memory usage -- detailed information WHAT holds WHAT percentage of memory, and WHO had put it there. Such information is available at different points of time of application run.
另一个建立在Valgrind之上的工具是Massif。我使用它来配置堆内存使用。它的工作原理。它所做的是给你内存使用的快照——详细的信息是什么占内存的百分比,谁把它放在那里。这些信息在应用程序运行的不同时间点可用。
#6
49
This is a response to Nazgob's Gprof answer.
这是对Nazgob的Gprof回答的回应。
I've been using Gprof the last couple of days and have already found three significant limitations, one of which I've not seen documented anywhere else (yet):
我在过去的几天里一直在使用Gprof,并且已经发现了三个显著的局限性,其中一个我还没有在其他地方看到过。
-
It doesn't work properly on multi-threaded code, unless you use a workaround
它不能在多线程代码上正常工作,除非您使用一个变通方法。
-
The call graph gets confused by function pointers. Example: I have a function called multithread() which enables me to multi-thread a specified function over a specified array (both passed as arguments). Gprof however, views all calls to multithread() as equivalent for the purposes of computing time spent in children. Since some functions I pass to multithread() take much longer than others my call graphs are mostly useless. (To those wondering if threading is the issue here: no, multithread() can optionally, and did in this case, run everything sequentially on the calling thread only).
调用图被函数指针搞糊涂了。示例:我有一个名为multithread()的函数,它允许我在指定的数组上多线程指定的函数(都作为参数传递)。然而,Gprof将所有的对多线程()的调用都看作是用于计算在儿童中花费的时间。由于我传递给多线程的一些函数要比其他函数长得多,所以我的调用图基本上是无用的。(对于那些想知道线程是否是这个问题的人来说:不,多线程()可以选择,并且在这种情况下,只在调用线程上按顺序运行所有东西)。
-
It says here that "... the number-of-calls figures are derived by counting, not sampling. They are completely accurate...". Yet I find my call graph giving me 5345859132+784984078 as call stats to my most-called function, where the first number is supposed to be direct calls, and the second recursive calls (which are all from itself). Since this implied I had a bug, I put in long (64-bit) counters into the code and did the same run again. My counts: 5345859132 direct, and 78094395406 self-recursive calls. There are a lot of digits there, so I'll point out the recursive calls I measure are 78bn, versus 784m from Gprof: a factor of 100 different. Both runs were single threaded and unoptimised code, one compiled -g and the other -pg.
这里写着“……”电话号码的数字是通过计数而不是抽样得到的。他们是完全准确……”。然而,我找到了我的调用图,将5345859132+784984078作为我最常调用的函数的调用属性,其中第一个数字应该是直接调用,第二个递归调用(它们都来自于它本身)。因为这意味着我有一个错误,所以我将长(64位)计数器放入代码中,并再次执行相同的运行。我的计数:5345859132直接,和78094395406自我递归调用。这里有很多数字,所以我要指出我所测量的递归调用是780亿,而Gprof是784米,这是100个不同的因数。这两种运行方式都是单线程和未优化的代码,一个是编译的-g,另一个是-pg。
This was GNU Gprof (GNU Binutils for Debian) 2.18.0.20080103 running under 64-bit Debian Lenny, if that helps anyone.
这是GNU Gprof (Debian的GNU Binutils) 2.18.0.20080103在64位Debian Lenny下运行,如果这对任何人都有帮助的话。
#7
46
The answer to run valgrind --tool=callgrind
is not quite complete without some options. We usually do not want to profile 10 minutes of slow startup time under Valgrind and want to profile our program when it is doing some task.
运行valgrind——工具=callgrind的答案不太完整,没有一些选项。我们通常不希望在Valgrind下启动10分钟的慢启动时间,并希望在完成某些任务时配置我们的程序。
So this is what I recommend. Run program first:
这就是我的建议。运行程序:
valgrind --tool=callgrind --dump-instr=yes -v --instr-atstart=no ./binary > tmp
Now when it works and we want to start profiling we should run in another window:
当它工作的时候,我们想要开始分析我们应该在另一个窗口中运行:
callgrind_control -i on
This turns profiling on. To turn it off and stop whole task we might use:
这分析。把它关掉,停止我们可能使用的整个任务:
callgrind_control -k
Now we have some files named callgrind.out.* in current directory. To see profiling results use:
现在我们有一些文件叫callgrind.out。*在当前目录。查看分析结果使用:
kcachegrind callgrind.out.*
I recommend in next window to click on "Self" column header, otherwise it shows that "main()" is most time consuming task. "Self" shows how much each function itself took time, not together with dependents.
我建议在下一个窗口中单击“Self”列标题,否则它将显示“main()”是最耗时的任务。“Self”显示了每个函数本身花费了多少时间,而不是与依赖关系。
#8
8
Use Valgrind, callgrind and kcachegrind:
使用研磨、磨碎和kcachegrind:
valgrind --tool=callgrind ./(Your binary)
generates callgrind.out.x. Read it using kcachegrind.
生成callgrind.out.x。使用kcachegrind读它。
Use gprof (add -pg):
使用gprof(添加使用):
cc -o myprog myprog.c utils.c -g -pg
(not so good for multi-threads, function pointers)
(不太适合多线程,函数指针)
Use google-perftools:
使用google-perftools:
Uses time sampling, reveals I/O and CPU bottlenecks are revealed.
使用时间抽样,显示I/O和CPU瓶颈。
Intel VTune is the best (free for educational purposes).
英特尔VTune是最好的(免费为教育目的)。
Others: AMD Codeanalyst, OProfile, 'perf' tools (apt-get install linux-tools)
其他:AMD Codeanalyst, OProfile,“perf”工具(apt-get安装linux工具)
#9
3
These are the two methods I use for speeding up my code:
这是我用来加速代码的两种方法:
For CPU bound applications:
对于CPU绑定的应用程序:
- Use a profiler in DEBUG mode to identify questionable parts of your code
- 在调试模式中使用profiler来识别代码中可疑的部分。
- Then switch to RELEASE mode and comment out the questionable sections of your code (stub it with nothing) until you see changes in performance.
- 然后切换到发布模式,并注释掉代码中可疑的部分,直到看到性能的变化。
For I/O bound applications:
I / O绑定应用程序:
- Use a profiler in RELEASE mode to identify questionable parts of your code.
- 在发布模式中使用profiler来识别代码中可疑的部分。
N.B.
注意:
If you don't have a profiler, use the poor man's profiler. Hit pause while debugging your application. Most developer suites will break into assembly with commented line numbers. You're statistically likely to land in a region that is eating most of your CPU cycles.
如果你没有探查器,就用这个可怜的人的探查器。在调试应用程序时暂停。大多数开发人员套件将使用注释行号进入程序集。在统计上,你可能会在一个占用你大部分CPU周期的区域着陆。
For CPU, the reason for profiling in DEBUG mode is because if your tried profiling in RELEASE mode, the compiler is going to reduce math, vectorize loops, and inline functions which tends to glob your code into an un-mappable mess when it's assembled. An un-mappable mess means your profiler will not be able to clearly identify what is taking so long because the assembly may not correspond to the source code under optimization. If you need the performance (e.g. timing sensitive) of RELEASE mode, disable debugger features as needed to keep a usable performance.
对于CPU,在调试模式下进行分析的原因是,如果您在发布模式中尝试进行分析,那么编译器将会减少数学、矢量化循环和内联函数,这些函数会在组装时将代码隐藏到不可修改的混乱中。一个不可映射的混乱意味着您的分析器将无法清楚地识别出什么花费了这么长时间,因为程序集可能不符合优化的源代码。如果您需要发布模式的性能(例如定时敏感),请禁用调试器特性以保持可用性能。
For I/O-bound, the profiler can still identify I/O operations in RELEASE mode because I/O operations are either externally linked to a shared library (most of the time) or in the worst case, will result in a sys-call interrupt vector (which is also easily identifiable by the profiler).
对于I/O绑定,profiler仍然可以在发布模式中识别I/O操作,因为I/O操作要么是外部链接到一个共享库(大多数时间),要么是在最坏的情况下,会导致一个sys-call中断向量(它也很容易被分析器识别)。
#10
0
For single-threaded programs you can use igprof, The Ignominous Profiler: https://igprof.org/ .
对于单线程的程序,您可以使用igprof,这个不重要的分析器:https://igprof.org/。
It is a sampling profiler, along the lines of the... long... answer by Mike Dunlavey, which will gift wrap the results in a browsable call stack tree, annotated with the time or memory spent in each function, either cumulative or per-function.
它是一个采样分析器,沿着…长……由Mike Dunlavey给出的答案,它将把结果包装在一个可浏览的调用堆栈树中,并在每个函数中使用时间或内存进行注释,要么是累积的,要么是每个函数的。
#1
1162
If your goal is to use a profiler, use one of the suggested ones.
如果你的目标是使用一个剖析器,那就使用其中的一个。
However, if you're in a hurry and you can manually interrupt your program under the debugger while it's being subjectively slow, there's a simple way to find performance problems.
但是,如果您非常匆忙,并且您可以手动地在调试器下中断您的程序,而它的主观上是缓慢的,那么就有一个简单的方法来发现性能问题。
Just halt it several times, and each time look at the call stack. If there is some code that is wasting some percentage of the time, 20% or 50% or whatever, that is the probability that you will catch it in the act on each sample. So that is roughly the percentage of samples on which you will see it. There is no educated guesswork required. If you do have a guess as to what the problem is, this will prove or disprove it.
只需停止它几次,每次查看调用堆栈。如果有一些代码在浪费一定比例的时间,20%或50%或者其他的,那就是你在每个样本的行为中抓住它的概率。这大概就是你能看到的样本的百分比。没有受过良好教育的猜测。如果你对这个问题有一个猜想,这将证明或证明它是错误的。
You may have multiple performance problems of different sizes. If you clean out any one of them, the remaining ones will take a larger percentage, and be easier to spot, on subsequent passes. This magnification effect, when compounded over multiple problems, can lead to truly massive speedup factors.
您可能有多个不同大小的性能问题。如果你清除掉其中的任何一个,剩下的将会占据更大的百分比,并且在随后的通行证中更容易被发现。这种放大效应,当混合多个问题时,会导致真正的大规模加速因子。
Caveat: Programmers tend to be skeptical of this technique unless they've used it themselves. They will say that profilers give you this information, but that is only true if they sample the entire call stack, and then let you examine a random set of samples. (The summaries are where the insight is lost.) Call graphs don't give you the same information, because
注意:程序员往往对这种技术持怀疑态度,除非他们自己使用过。他们会说分析器给你这个信息,但是只有当他们对整个调用堆栈进行采样,然后让你检查一组随机的样本时,这才是正确的。(这些总结是我们失去洞察力的地方。)调用图不会给你相同的信息,因为。
- they don't summarize at the instruction level, and
- 他们不会在指令层面进行总结。
- they give confusing summaries in the presence of recursion.
- 在递归的情况下,他们会给出令人困惑的总结。
They will also say it only works on toy programs, when actually it works on any program, and it seems to work better on bigger programs, because they tend to have more problems to find. They will say it sometimes finds things that aren't problems, but that is only true if you see something once. If you see a problem on more than one sample, it is real.
他们还会说,它只对玩具程序有用,实际上它对任何程序都有效,而且它在更大的程序上看起来效果更好,因为它们往往有更多的问题要找。他们会说,有时会发现一些不是问题的东西,但只有当你看到某样东西时才会这样。如果你在一个以上的样本中发现问题,它是真实的。
P.S. This can also be done on multi-thread programs if there is a way to collect call-stack samples of the thread pool at a point in time, as there is in Java.
如果有一种方法可以在某个时间点收集线程池的调用堆栈示例,就像在Java中一样,那么在多线程程序中也可以这样做。
P.P.S As a rough generality, the more layers of abstraction you have in your software, the more likely you are to find that that is the cause of performance problems (and the opportunity to get speedup).
最大功率作为一个粗略的一般性,你在你的软件中有越多的抽象层,你就越有可能发现这是性能问题的原因(以及获得加速的机会)。
Added: It might not be obvious, but the stack sampling technique works equally well in the presence of recursion. The reason is that the time that would be saved by removal of an instruction is approximated by the fraction of samples containing it, regardless of the number of times it may occur within a sample.
添加:可能不是很明显,但是堆栈抽样技术在递归的情况下同样有效。原因是,删除指令所节省下来的时间,是用包含它的样本的分数来近似的,而不管它在样本内发生的次数。
Another objection I often hear is: "It will stop someplace random, and it will miss the real problem". This comes from having a prior concept of what the real problem is. A key property of performance problems is that they defy expectations. Sampling tells you something is a problem, and your first reaction is disbelief. That is natural, but you can be sure if it finds a problem it is real, and vice-versa.
我经常听到的另一个反对意见是:“它会在某个地方停下来,它会错过真正的问题”。这来自于对实际问题的先验概念。性能问题的一个关键特性是它们不符合预期。抽样告诉你一个问题,你的第一反应是怀疑。这是很自然的,但是你可以确定如果它发现了问题,它是真实的,反之亦然。
ADDED: Let me make a Bayesian explanation of how it works. Suppose there is some instruction I
(call or otherwise) which is on the call stack some fraction f
of the time (and thus costs that much). For simplicity, suppose we don't know what f
is, but assume it is either 0.1, 0.2, 0.3, ... 0.9, 1.0, and the prior probability of each of these possibilities is 0.1, so all of these costs are equally likely a-priori.
补充:让我用贝叶斯解释它是如何工作的。假设有一些指令I(调用或其他)在调用堆栈上的时间(因此花费了很多)。为了简单起见,假设我们不知道f是多少,假设它是0。1 0。2 0。3。0。9,1。0,每一种可能性的先验概率是0。1,所有这些成本都是一样的。
Then suppose we take just 2 stack samples, and we see instruction I
on both samples, designated observation o=2/2
. This gives us new estimates of the frequency f
of I
, according to this:
然后假设我们只取2个堆栈样本,我们在两个样本上都看到了指令I,指定的观察o=2/2。这给了我们关于I的频率的新的估计,根据这个:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&&f=x) P(o=2/2&&f >= x) P(f >= x)
0.1 1 1 0.1 0.1 0.25974026
0.1 0.9 0.81 0.081 0.181 0.47012987
0.1 0.8 0.64 0.064 0.245 0.636363636
0.1 0.7 0.49 0.049 0.294 0.763636364
0.1 0.6 0.36 0.036 0.33 0.857142857
0.1 0.5 0.25 0.025 0.355 0.922077922
0.1 0.4 0.16 0.016 0.371 0.963636364
0.1 0.3 0.09 0.009 0.38 0.987012987
0.1 0.2 0.04 0.004 0.384 0.997402597
0.1 0.1 0.01 0.001 0.385 1
P(o=2/2) 0.385
The last column says that, for example, the probability that f
>= 0.5 is 92%, up from the prior assumption of 60%.
最后一列表示,例如,f >= 0.5的概率是92%,高于之前假设的60%。
Suppose the prior assumptions are different. Suppose we assume P(f=0.1) is .991 (nearly certain), and all the other possibilities are almost impossible (0.001). In other words, our prior certainty is that I
is cheap. Then we get:
假设先前的假设是不同的。假设我们假设P(f=0.1)是。991(几乎可以确定),所有其他可能性几乎是不可能的(0.001)。换句话说,我们事先确定的是我很便宜。然后我们得到:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&& f=x) P(o=2/2&&f >= x) P(f >= x)
0.001 1 1 0.001 0.001 0.072727273
0.001 0.9 0.81 0.00081 0.00181 0.131636364
0.001 0.8 0.64 0.00064 0.00245 0.178181818
0.001 0.7 0.49 0.00049 0.00294 0.213818182
0.001 0.6 0.36 0.00036 0.0033 0.24
0.001 0.5 0.25 0.00025 0.00355 0.258181818
0.001 0.4 0.16 0.00016 0.00371 0.269818182
0.001 0.3 0.09 0.00009 0.0038 0.276363636
0.001 0.2 0.04 0.00004 0.00384 0.279272727
0.991 0.1 0.01 0.00991 0.01375 1
P(o=2/2) 0.01375
Now it says P(f >= 0.5) is 26%, up from the prior assumption of 0.6%. So Bayes allows us to update our estimate of the probable cost of I
. If the amount of data is small, it doesn't tell us accurately what the cost is, only that it is big enough to be worth fixing.
现在P(f >= 0.5)是26%,高于之前的0.6%。所以贝叶斯允许我们更新我们对i的可能成本的估计,如果数据量很小,它不能准确地告诉我们成本是多少,只是它足够大,值得修理。
Yet another way to look at it is called the Rule Of Succession. If you flip a coin 2 times, and it comes up heads both times, what does that tell you about the probable weighting of the coin? The respected way to answer is to say that it's a Beta distribution, with average value (number of hits + 1) / (number of tries + 2) = (2+1)/(2+2) = 75%.
另一种看待它的方法叫做继承规则。如果你抛硬币2次,它同时出现两次,这告诉你硬币的可能权重是多少?受人尊敬的回答是,它是一个Beta分布,平均值(点击次数+1)/(尝试次数+2)=(2+1)/(2+2)= 75%。
(The key is that we see I
more than once. If we only see it once, that doesn't tell us much except that f
> 0.)
关键是我们不止一次看到我。如果我们只看一次,它就不会告诉我们很多,除了f > 0。
So, even a very small number of samples can tell us a lot about the cost of instructions that it sees. (And it will see them with a frequency, on average, proportional to their cost. If n
samples are taken, and f
is the cost, then I
will appear on nf+/-sqrt(nf(1-f))
samples. Example, n=10
, f=0.3
, that is 3+/-1.4
samples.)
所以,即使是很小数量的样本也能告诉我们很多关于它所看到的指令的成本。(而且它会以平均的频率与他们的成本成正比。如果取n个样本,f为代价,那么我将出现在nf+/-sqrt(nf(1-f))样本中。例如,n=10, f=0.3,即3+/-1.4样本。
ADDED, to give an intuitive feel for the difference between measuring and random stack sampling:
There are profilers now that sample the stack, even on wall-clock time, but what comes out is measurements (or hot path, or hot spot, from which a "bottleneck" can easily hide). What they don't show you (and they easily could) is the actual samples themselves. And if your goal is to find the bottleneck, the number of them you need to see is, on average, 2 divided by the fraction of time it takes. So if it takes 30% of time, 2/.3 = 6.7 samples, on average, will show it, and the chance that 20 samples will show it is 99.2%.
此外,为了直观感受测量和随机堆栈采样之间的差异:现在有一些分析器,即使是在墙上时钟的时候,也有一些分析器,但是出来的是测量(或者热的路径,或者是热点,这是一个“瓶颈”很容易隐藏的地方)。他们不展示给你的是真实的样本。如果你的目标是找到瓶颈,你需要看到的数字是,平均来说,2除以它花费的时间。如果需要30%的时间,2/。3 = 6.7个样本,平均会显示出来,20个样本显示的概率是99。2%
Here is an off-the-cuff illustration of the difference between examining measurements and examining stack samples. The bottleneck could be one big blob like this, or numerous small ones, it makes no difference.
这里有一个现成的例子,说明了检查测量和检查堆栈样本之间的区别。瓶颈可以是这样的一个大团,也可以是无数个小块,这没什么区别。
Measurement is horizontal; it tells you what fraction of time specific routines take. Sampling is vertical. If there is any way to avoid what the whole program is doing at that moment, and if you see it on a second sample, you've found the bottleneck. That's what makes the difference - seeing the whole reason for the time being spent, not just how much.
测量水平;它会告诉你具体的时间比例是多少。抽样是垂直的。如果有任何方法可以避免整个程序在那个时刻所做的事情,如果您在第二个示例中看到它,那么您已经找到了瓶颈。这就是不同之处——看到花费时间的全部原因,而不仅仅是花费多少。
#2
458
You can use Valgrind with the following options
您可以使用以下选项使用Valgrind。
valgrind --tool=callgrind ./(Your binary)
It will generate a file called callgrind.out.x
. You can then use kcachegrind
tool to read this file. It will give you a graphical analysis of things with results like which lines cost how much.
它将生成一个名为callgrind.out.x的文件。然后您可以使用kcachegrind工具来读取这个文件。它会给你一个图形化的分析结果,比如哪些行花费了多少。
#3
288
I assume you're using GCC. The standard solution would be to profile with gprof.
我假设您在使用GCC。标准的解决方案将是与gprof进行配置。
Be sure to add -pg
to compilation before profiling:
在进行概要分析之前,请确保将-pg添加到编译:
cc -o myprog myprog.c utils.c -g -pg
I haven't tried it yet but I've heard good things about google-perftools. It is definitely worth a try.
我还没试过,但我听说过一些关于google-perftools的好东西。这绝对值得一试。
Related question here.
相关的问题。
A few other buzzwords if gprof
does not do the job for you: Valgrind, Intel VTune, Sun DTrace.
如果gprof没有为你做这项工作的话,还有一些其他的热门词汇:Valgrind, Intel VTune, Sun DTrace。
#4
209
Newer kernels (e.g. the latest Ubuntu kernels) come with the new 'perf' tools (apt-get install linux-tools
) AKA perf_events.
新的内核(例如最新的Ubuntu内核)附带了新的“perf”工具(apt-get安装linux-tools),即perf_events。
These come with classic sampling profilers (man-page) as well as the awesome timechart!
这些都是经典的采样剖析器(man-page)和很棒的时间表!
The important thing is that these tools can be system profiling and not just process profiling - they can show the interaction between threads, processes and the kernel and let you understand the scheduling and I/O dependencies between processes.
重要的是,这些工具可以是系统剖析,而不仅仅是过程分析——它们可以显示线程、进程和内核之间的交互,并让您了解进程之间的调度和I/O依赖关系。
#5
63
I would use Valgrind and Callgrind as a base for my profiling tool suite. What is important to know is that Valgrind is basically a Virtual Machine:
我将使用Valgrind和Callgrind作为我的分析工具套件的基础。重要的是,Valgrind基本上是一个虚拟机:
(wikipedia) Valgrind is in essence a virtual machine using just-in-time (JIT) compilation techniques, including dynamic recompilation. Nothing from the original program ever gets run directly on the host processor. Instead, Valgrind first translates the program into a temporary, simpler form called Intermediate Representation (IR), which is a processor-neutral, SSA-based form. After the conversion, a tool (see below) is free to do whatever transformations it would like on the IR, before Valgrind translates the IR back into machine code and lets the host processor run it.
(wikipedia) Valgrind本质上是一个使用即时(JIT)编译技术(包括动态重新编译)的虚拟机。原始程序中的任何东西都不会直接在主机处理器上运行。相反,Valgrind首先将程序转换为一个临时的、更简单的形式,称为中间表示(IR),它是一个处理器中立的基于ssa的表单。在转换之后,一个工具(见下面)可以*地完成它想在IR上做的任何转换,在Valgrind将IR转换成机器码并让主机处理器运行它之前。
Callgrind is a profiler build upon that. Main benefit is that you don't have to run your aplication for hours to get reliable result. Even one second run is sufficient to get rock-solid, reliable results, because Callgrind is a non-probing profiler.
Callgrind是一个剖析器。主要的好处是你不必为了得到可靠的结果而花上几个小时的时间。即使是第二次运行,也足以获得可靠的、可靠的结果,因为Callgrind是一个非探测的探查器。
Another tool build upon Valgrind is Massif. I use it to profile heap memory usage. It works great. What it does is that it gives you snapshots of memory usage -- detailed information WHAT holds WHAT percentage of memory, and WHO had put it there. Such information is available at different points of time of application run.
另一个建立在Valgrind之上的工具是Massif。我使用它来配置堆内存使用。它的工作原理。它所做的是给你内存使用的快照——详细的信息是什么占内存的百分比,谁把它放在那里。这些信息在应用程序运行的不同时间点可用。
#6
49
This is a response to Nazgob's Gprof answer.
这是对Nazgob的Gprof回答的回应。
I've been using Gprof the last couple of days and have already found three significant limitations, one of which I've not seen documented anywhere else (yet):
我在过去的几天里一直在使用Gprof,并且已经发现了三个显著的局限性,其中一个我还没有在其他地方看到过。
-
It doesn't work properly on multi-threaded code, unless you use a workaround
它不能在多线程代码上正常工作,除非您使用一个变通方法。
-
The call graph gets confused by function pointers. Example: I have a function called multithread() which enables me to multi-thread a specified function over a specified array (both passed as arguments). Gprof however, views all calls to multithread() as equivalent for the purposes of computing time spent in children. Since some functions I pass to multithread() take much longer than others my call graphs are mostly useless. (To those wondering if threading is the issue here: no, multithread() can optionally, and did in this case, run everything sequentially on the calling thread only).
调用图被函数指针搞糊涂了。示例:我有一个名为multithread()的函数,它允许我在指定的数组上多线程指定的函数(都作为参数传递)。然而,Gprof将所有的对多线程()的调用都看作是用于计算在儿童中花费的时间。由于我传递给多线程的一些函数要比其他函数长得多,所以我的调用图基本上是无用的。(对于那些想知道线程是否是这个问题的人来说:不,多线程()可以选择,并且在这种情况下,只在调用线程上按顺序运行所有东西)。
-
It says here that "... the number-of-calls figures are derived by counting, not sampling. They are completely accurate...". Yet I find my call graph giving me 5345859132+784984078 as call stats to my most-called function, where the first number is supposed to be direct calls, and the second recursive calls (which are all from itself). Since this implied I had a bug, I put in long (64-bit) counters into the code and did the same run again. My counts: 5345859132 direct, and 78094395406 self-recursive calls. There are a lot of digits there, so I'll point out the recursive calls I measure are 78bn, versus 784m from Gprof: a factor of 100 different. Both runs were single threaded and unoptimised code, one compiled -g and the other -pg.
这里写着“……”电话号码的数字是通过计数而不是抽样得到的。他们是完全准确……”。然而,我找到了我的调用图,将5345859132+784984078作为我最常调用的函数的调用属性,其中第一个数字应该是直接调用,第二个递归调用(它们都来自于它本身)。因为这意味着我有一个错误,所以我将长(64位)计数器放入代码中,并再次执行相同的运行。我的计数:5345859132直接,和78094395406自我递归调用。这里有很多数字,所以我要指出我所测量的递归调用是780亿,而Gprof是784米,这是100个不同的因数。这两种运行方式都是单线程和未优化的代码,一个是编译的-g,另一个是-pg。
This was GNU Gprof (GNU Binutils for Debian) 2.18.0.20080103 running under 64-bit Debian Lenny, if that helps anyone.
这是GNU Gprof (Debian的GNU Binutils) 2.18.0.20080103在64位Debian Lenny下运行,如果这对任何人都有帮助的话。
#7
46
The answer to run valgrind --tool=callgrind
is not quite complete without some options. We usually do not want to profile 10 minutes of slow startup time under Valgrind and want to profile our program when it is doing some task.
运行valgrind——工具=callgrind的答案不太完整,没有一些选项。我们通常不希望在Valgrind下启动10分钟的慢启动时间,并希望在完成某些任务时配置我们的程序。
So this is what I recommend. Run program first:
这就是我的建议。运行程序:
valgrind --tool=callgrind --dump-instr=yes -v --instr-atstart=no ./binary > tmp
Now when it works and we want to start profiling we should run in another window:
当它工作的时候,我们想要开始分析我们应该在另一个窗口中运行:
callgrind_control -i on
This turns profiling on. To turn it off and stop whole task we might use:
这分析。把它关掉,停止我们可能使用的整个任务:
callgrind_control -k
Now we have some files named callgrind.out.* in current directory. To see profiling results use:
现在我们有一些文件叫callgrind.out。*在当前目录。查看分析结果使用:
kcachegrind callgrind.out.*
I recommend in next window to click on "Self" column header, otherwise it shows that "main()" is most time consuming task. "Self" shows how much each function itself took time, not together with dependents.
我建议在下一个窗口中单击“Self”列标题,否则它将显示“main()”是最耗时的任务。“Self”显示了每个函数本身花费了多少时间,而不是与依赖关系。
#8
8
Use Valgrind, callgrind and kcachegrind:
使用研磨、磨碎和kcachegrind:
valgrind --tool=callgrind ./(Your binary)
generates callgrind.out.x. Read it using kcachegrind.
生成callgrind.out.x。使用kcachegrind读它。
Use gprof (add -pg):
使用gprof(添加使用):
cc -o myprog myprog.c utils.c -g -pg
(not so good for multi-threads, function pointers)
(不太适合多线程,函数指针)
Use google-perftools:
使用google-perftools:
Uses time sampling, reveals I/O and CPU bottlenecks are revealed.
使用时间抽样,显示I/O和CPU瓶颈。
Intel VTune is the best (free for educational purposes).
英特尔VTune是最好的(免费为教育目的)。
Others: AMD Codeanalyst, OProfile, 'perf' tools (apt-get install linux-tools)
其他:AMD Codeanalyst, OProfile,“perf”工具(apt-get安装linux工具)
#9
3
These are the two methods I use for speeding up my code:
这是我用来加速代码的两种方法:
For CPU bound applications:
对于CPU绑定的应用程序:
- Use a profiler in DEBUG mode to identify questionable parts of your code
- 在调试模式中使用profiler来识别代码中可疑的部分。
- Then switch to RELEASE mode and comment out the questionable sections of your code (stub it with nothing) until you see changes in performance.
- 然后切换到发布模式,并注释掉代码中可疑的部分,直到看到性能的变化。
For I/O bound applications:
I / O绑定应用程序:
- Use a profiler in RELEASE mode to identify questionable parts of your code.
- 在发布模式中使用profiler来识别代码中可疑的部分。
N.B.
注意:
If you don't have a profiler, use the poor man's profiler. Hit pause while debugging your application. Most developer suites will break into assembly with commented line numbers. You're statistically likely to land in a region that is eating most of your CPU cycles.
如果你没有探查器,就用这个可怜的人的探查器。在调试应用程序时暂停。大多数开发人员套件将使用注释行号进入程序集。在统计上,你可能会在一个占用你大部分CPU周期的区域着陆。
For CPU, the reason for profiling in DEBUG mode is because if your tried profiling in RELEASE mode, the compiler is going to reduce math, vectorize loops, and inline functions which tends to glob your code into an un-mappable mess when it's assembled. An un-mappable mess means your profiler will not be able to clearly identify what is taking so long because the assembly may not correspond to the source code under optimization. If you need the performance (e.g. timing sensitive) of RELEASE mode, disable debugger features as needed to keep a usable performance.
对于CPU,在调试模式下进行分析的原因是,如果您在发布模式中尝试进行分析,那么编译器将会减少数学、矢量化循环和内联函数,这些函数会在组装时将代码隐藏到不可修改的混乱中。一个不可映射的混乱意味着您的分析器将无法清楚地识别出什么花费了这么长时间,因为程序集可能不符合优化的源代码。如果您需要发布模式的性能(例如定时敏感),请禁用调试器特性以保持可用性能。
For I/O-bound, the profiler can still identify I/O operations in RELEASE mode because I/O operations are either externally linked to a shared library (most of the time) or in the worst case, will result in a sys-call interrupt vector (which is also easily identifiable by the profiler).
对于I/O绑定,profiler仍然可以在发布模式中识别I/O操作,因为I/O操作要么是外部链接到一个共享库(大多数时间),要么是在最坏的情况下,会导致一个sys-call中断向量(它也很容易被分析器识别)。
#10
0
For single-threaded programs you can use igprof, The Ignominous Profiler: https://igprof.org/ .
对于单线程的程序,您可以使用igprof,这个不重要的分析器:https://igprof.org/。
It is a sampling profiler, along the lines of the... long... answer by Mike Dunlavey, which will gift wrap the results in a browsable call stack tree, annotated with the time or memory spent in each function, either cumulative or per-function.
它是一个采样分析器,沿着…长……由Mike Dunlavey给出的答案,它将把结果包装在一个可浏览的调用堆栈树中,并在每个函数中使用时间或内存进行注释,要么是累积的,要么是每个函数的。