Might look similar to: ARM and NEON can work in parallel?, but its not, I have some other issue ( may be problem with my understanding):
可能看起来类似于:ARM和NEON可以并行工作吗?但是它没有,我有一些其他问题(可能是我理解的问题):
In the protocol stack, while we compute checksum, that is done on the GPP, I’m handing over that task now to NEON as part of a function:
在协议栈中,当我们计算校验和时,这是在GPP上完成的,我现在将该任务作为函数的一部分移交给NEON:
Here is the checksum function that I have written as a part of NEON, posted in Stack Overflow: Checksum code implementation for Neon in Intrinsics
这是我作为NEON的一部分编写的校验和函数,发布在Stack Overflow:Intrinsics中的Neon的校验和代码实现
Now, suppose from linux this function is called,
现在,假设从linux调用此函数,
ip_csum(){
…
…
csum = do_csum(); //function call from arm
…
…
}
do_csum(){
…
…
//NEON optimised code
…
…
returns the final checksum to ip_csum/linux/ARM
}
in this case.. what happens to ARM when NEON is doing the calculations? does ARM sit idle? or it moves on with other operations?
在这种情况下......当NEON进行计算时,ARM会发生什么? ARM闲置吗?还是继续进行其他操作?
as you can see do_csum is called and we are waiting on that result ( or that is what it looks like)..
你可以看到do_csum被调用,我们正在等待那个结果(或者它看起来像是这样)..
NOTE:
注意:
- Speaking in terms of cortex-a8
- 就cortex-a8而言
- do_csum as you can see from the link is coded with intrinsics
- 从链接中可以看到do_csum是用内在函数编码的
- compilation using gnu tool-chain
- 使用gnu工具链编译
- Will be good if you also take Multi-threading or any other concept involved or comes into picture when these inter operations happen.
- 如果您还采用多线程或任何其他相关概念或在这些相互操作发生时进入画面,那将会很好。
Questions:
问题:
- Does ARM sit idle while NEON is doing its operations? ( in this particular case)
- 在NEON进行操作时,ARM是否处于空闲状态? (在这种特殊情况下)
- Or does it shelve this current ip_csum related code, and take up another process/thread till NEON is done? ( I'm almost dumb as to what happens here)
- 或者它是否搁置了当前的ip_csum相关代码,并占用另一个进程/线程,直到NEON完成? (我对这里发生的事情几乎是愚蠢的)
- if its sitting idle, how can we make ARM work on something else till NEON is done?
- 如果它处于空闲状态,那么在NEON完成之前,我们怎样才能让ARM工作在其他方面呢?
3 个解决方案
#1
14
(Image from TI Wiki Cortex A8)
(来自TI Wiki Cortex A8的图片)
The ARM (or rather the Integer Pipeline) does not sit idle while NEON instructions are processing. In the Cortex A8, the NEON is at the "end" of the processor pipeline, instructions flow through the pipeline and if they are ARM instructions they are executed in the "beginning" of the pipeline and NEON instructions are executed in the end. Every clock pushes the instruction down the pipeline.
在NEON指令处理时,ARM(或更确切地说是整数流水线)不会处于空闲状态。在Cortex A8中,NEON位于处理器流水线的“末端”,指令流经管道,如果它们是ARM指令,则它们在流水线的“开始”执行,最后执行NEON指令。每个时钟都会将指令推送到管道中。
Here are some hints on how to read the diagram above:
以下是有关如何阅读上图的一些提示:
- Every cycle, if possible, the processor fetches an instruction pair (two instructions).
- 每个周期,如果可能,处理器获取指令对(两个指令)。
- Fetching is pipelined so it takes 3 cycles for the instructions to propagate into the decode unit.
- 提取是流水线的,因此指令需要3个周期才能传播到解码单元。
- It takes 5 cycles (D0-D4) for the instruction to be decoded. Again this is all pipelines so it affects the latency but not the throughput. More instructions keep flowing through the pipeline where possible.
- 要解码的指令需要5个周期(D0-D4)。同样,这是所有管道,因此它会影响延迟,但不会影响吞吐量。在可能的情况下,更多指令会继续流经管道。
- Now we reach the execute/load store portion. NEON instructions flow through this stage (but they do that while other instructions are possibly executing).
- 现在我们到达执行/加载存储部分。 NEON指令流经这个阶段(但他们这样做,而其他指令可能正在执行)。
- We get to the NEON portion, if the instruction fetched 13 cycles ago was a NEON instruction it is now decoded and executed in the NEON pipeline.
- 我们到达NEON部分,如果13个周期前取出的指令是NEON指令,它现在被解码并在NEON流水线中执行。
- While this is happening, integer instructions that followed that instruction can execute at the same time in the integer pipeline.
- 发生这种情况时,遵循该指令的整数指令可以在整数管道中同时执行。
- The pipeline is a fairly complex beast, some instructions are multi-cycle, some have dependencies and will stall if those dependencies are not met. Other events such as branches will flush the pipeline.
- 管道是一个相当复杂的野兽,一些指令是多周期的,一些指令具有依赖性,如果不满足这些依赖性,它们将停止。其他事件(如分支)将刷新管道。
If you are executing a sequence that is 100% NEON instructions (which is pretty rare, since there are usually some ARM registers involved, control flow etc.) then there is some period where the the integer pipeline isn't doing anything useful. Most code will have the two executing concurrently for at least some of the time while cleverly engineered code can maximize performance with the right instructions mix.
如果您正在执行一个100%NEON指令的序列(这是非常罕见的,因为通常涉及一些ARM寄存器,控制流等),那么有一段时间整数管道没有做任何有用的事情。大多数代码将至少在某些时间同时执行两个代码,而巧妙设计的代码可以通过正确的指令组合最大化性能。
This online tool Cycle Counter for Cortex A8 is great for analyzing the performance of your assembly code and gives information about what is executing in what units and what is stalling.
这个用于Cortex A8的在线工具循环计数器非常适合分析汇编代码的性能,并提供有关在哪些单元中执行的内容以及停止运行的信息。
#2
6
ARM is not "idle" while NEON operations are executed, but controls them.
To fully use the power of both units, one can carefully plan an interleaved sequence of operations:
执行NEON操作时ARM不是“空闲”,而是控制它们。为了充分利用两个单元的功能,可以仔细规划交错的操作序列:
loop:
SUBS r0,r0,r1 ; // ARM operation
addpq.16 q0,q0,q1 ; NEON operation
LDR r0, [r1, r2 LSL #2]; // ARM operation
vld1.32 d0, [r1]! ; // NEON operation using ARM register
bne loop; // ARM operation controlling the flow of both units...
ARM cortex-A8 can execute in each clock cycle up to 2 instructions. If both of them are independent NEON operations, it's no use to put an ARM instruction in between. OTOH if one knows that the latency of a VLD (load) is large, one can place many ARM instruction in between the load and first use of the loaded value. But in each case the combined usage must be planned in advance and interleaved.
ARM cortex-A8可在每个时钟周期内执行最多2条指令。如果它们都是独立的NEON操作,则在它们之间放置ARM指令是没有用的。 OTOH如果知道VLD(加载)的延迟很大,可以在加载和首次使用加载值之间放置许多ARM指令。但是在每种情况下,组合使用必须提前计划并交错。
#3
6
In Application Level Programmers’ Model
, you can't really distinguish between ARM and NEON units.
在应用程序级程序员模型中,您无法真正区分ARM和NEON单元。
While NEON being a separate hardware unit (that is available as an option on Cortex-A series processors), it is the ARM core who drives it in a tight fashion. It is not a separate DSP which you can communicate in an asynchronous fashion.
虽然NEON是一个独立的硬件单元(可在Cortex-A系列处理器上作为选件提供),但ARM核心却以紧凑的方式驱动它。它不是一个可以以异步方式通信的独立DSP。
You can write better code by fully utilizing pipelines on both units, but this is not same as having a separate core.
您可以通过充分利用两个单元上的管道来编写更好的代码,但这与具有单独的核心不同。
NEON unit is there because it can do some operations (SIMDs) much faster than ARM unit at a low frequency.
NEON单元就在那里,因为它可以在低频率下比ARM单元快一些操作(SIMD)。
This is like having a friend who is good at math, whenever you have a hard question you can ask him. While waiting for an answer you can do some small things like if answer is this I should do this or if not instead do that but if you depend on that answer to go on, you need to wait for him to answer before going further. You could calculate the answer yourself but it will be much faster even including the communication time between two of you compared to doing all the math yourself. I think you can even extend this analogy like "you also need to buy some lunch to that friend (energy consumption) but in many cases it worths it".
这就像拥有一位擅长数学的朋友一样,每当你有一个难题,你都可以问他。在等待答案时你可以做一些小事情,比如如果答案是这样我应该这样做,或者如果没有这样做,但如果你依赖那个答案继续,你需要等待他回答才能进一步。你可以自己计算答案但是它会更快,甚至包括你们两个人之间的沟通时间,而不是自己做所有数学。我想你甚至可以将这个类比扩展为“你还需要给那位朋友买一些午餐(能量消耗),但在很多情况下它值得”。
Anyone who is saying ARM core can do other things while NEON core is working on its stuff is talking about instruction-level parallelism not anything like task-level parallelism.
任何说ARM核心的人都可以做其他事情,而NEON核心正在研究它的东西是谈论指令级并行,而不是像任务级并行。
#1
14
(Image from TI Wiki Cortex A8)
(来自TI Wiki Cortex A8的图片)
The ARM (or rather the Integer Pipeline) does not sit idle while NEON instructions are processing. In the Cortex A8, the NEON is at the "end" of the processor pipeline, instructions flow through the pipeline and if they are ARM instructions they are executed in the "beginning" of the pipeline and NEON instructions are executed in the end. Every clock pushes the instruction down the pipeline.
在NEON指令处理时,ARM(或更确切地说是整数流水线)不会处于空闲状态。在Cortex A8中,NEON位于处理器流水线的“末端”,指令流经管道,如果它们是ARM指令,则它们在流水线的“开始”执行,最后执行NEON指令。每个时钟都会将指令推送到管道中。
Here are some hints on how to read the diagram above:
以下是有关如何阅读上图的一些提示:
- Every cycle, if possible, the processor fetches an instruction pair (two instructions).
- 每个周期,如果可能,处理器获取指令对(两个指令)。
- Fetching is pipelined so it takes 3 cycles for the instructions to propagate into the decode unit.
- 提取是流水线的,因此指令需要3个周期才能传播到解码单元。
- It takes 5 cycles (D0-D4) for the instruction to be decoded. Again this is all pipelines so it affects the latency but not the throughput. More instructions keep flowing through the pipeline where possible.
- 要解码的指令需要5个周期(D0-D4)。同样,这是所有管道,因此它会影响延迟,但不会影响吞吐量。在可能的情况下,更多指令会继续流经管道。
- Now we reach the execute/load store portion. NEON instructions flow through this stage (but they do that while other instructions are possibly executing).
- 现在我们到达执行/加载存储部分。 NEON指令流经这个阶段(但他们这样做,而其他指令可能正在执行)。
- We get to the NEON portion, if the instruction fetched 13 cycles ago was a NEON instruction it is now decoded and executed in the NEON pipeline.
- 我们到达NEON部分,如果13个周期前取出的指令是NEON指令,它现在被解码并在NEON流水线中执行。
- While this is happening, integer instructions that followed that instruction can execute at the same time in the integer pipeline.
- 发生这种情况时,遵循该指令的整数指令可以在整数管道中同时执行。
- The pipeline is a fairly complex beast, some instructions are multi-cycle, some have dependencies and will stall if those dependencies are not met. Other events such as branches will flush the pipeline.
- 管道是一个相当复杂的野兽,一些指令是多周期的,一些指令具有依赖性,如果不满足这些依赖性,它们将停止。其他事件(如分支)将刷新管道。
If you are executing a sequence that is 100% NEON instructions (which is pretty rare, since there are usually some ARM registers involved, control flow etc.) then there is some period where the the integer pipeline isn't doing anything useful. Most code will have the two executing concurrently for at least some of the time while cleverly engineered code can maximize performance with the right instructions mix.
如果您正在执行一个100%NEON指令的序列(这是非常罕见的,因为通常涉及一些ARM寄存器,控制流等),那么有一段时间整数管道没有做任何有用的事情。大多数代码将至少在某些时间同时执行两个代码,而巧妙设计的代码可以通过正确的指令组合最大化性能。
This online tool Cycle Counter for Cortex A8 is great for analyzing the performance of your assembly code and gives information about what is executing in what units and what is stalling.
这个用于Cortex A8的在线工具循环计数器非常适合分析汇编代码的性能,并提供有关在哪些单元中执行的内容以及停止运行的信息。
#2
6
ARM is not "idle" while NEON operations are executed, but controls them.
To fully use the power of both units, one can carefully plan an interleaved sequence of operations:
执行NEON操作时ARM不是“空闲”,而是控制它们。为了充分利用两个单元的功能,可以仔细规划交错的操作序列:
loop:
SUBS r0,r0,r1 ; // ARM operation
addpq.16 q0,q0,q1 ; NEON operation
LDR r0, [r1, r2 LSL #2]; // ARM operation
vld1.32 d0, [r1]! ; // NEON operation using ARM register
bne loop; // ARM operation controlling the flow of both units...
ARM cortex-A8 can execute in each clock cycle up to 2 instructions. If both of them are independent NEON operations, it's no use to put an ARM instruction in between. OTOH if one knows that the latency of a VLD (load) is large, one can place many ARM instruction in between the load and first use of the loaded value. But in each case the combined usage must be planned in advance and interleaved.
ARM cortex-A8可在每个时钟周期内执行最多2条指令。如果它们都是独立的NEON操作,则在它们之间放置ARM指令是没有用的。 OTOH如果知道VLD(加载)的延迟很大,可以在加载和首次使用加载值之间放置许多ARM指令。但是在每种情况下,组合使用必须提前计划并交错。
#3
6
In Application Level Programmers’ Model
, you can't really distinguish between ARM and NEON units.
在应用程序级程序员模型中,您无法真正区分ARM和NEON单元。
While NEON being a separate hardware unit (that is available as an option on Cortex-A series processors), it is the ARM core who drives it in a tight fashion. It is not a separate DSP which you can communicate in an asynchronous fashion.
虽然NEON是一个独立的硬件单元(可在Cortex-A系列处理器上作为选件提供),但ARM核心却以紧凑的方式驱动它。它不是一个可以以异步方式通信的独立DSP。
You can write better code by fully utilizing pipelines on both units, but this is not same as having a separate core.
您可以通过充分利用两个单元上的管道来编写更好的代码,但这与具有单独的核心不同。
NEON unit is there because it can do some operations (SIMDs) much faster than ARM unit at a low frequency.
NEON单元就在那里,因为它可以在低频率下比ARM单元快一些操作(SIMD)。
This is like having a friend who is good at math, whenever you have a hard question you can ask him. While waiting for an answer you can do some small things like if answer is this I should do this or if not instead do that but if you depend on that answer to go on, you need to wait for him to answer before going further. You could calculate the answer yourself but it will be much faster even including the communication time between two of you compared to doing all the math yourself. I think you can even extend this analogy like "you also need to buy some lunch to that friend (energy consumption) but in many cases it worths it".
这就像拥有一位擅长数学的朋友一样,每当你有一个难题,你都可以问他。在等待答案时你可以做一些小事情,比如如果答案是这样我应该这样做,或者如果没有这样做,但如果你依赖那个答案继续,你需要等待他回答才能进一步。你可以自己计算答案但是它会更快,甚至包括你们两个人之间的沟通时间,而不是自己做所有数学。我想你甚至可以将这个类比扩展为“你还需要给那位朋友买一些午餐(能量消耗),但在很多情况下它值得”。
Anyone who is saying ARM core can do other things while NEON core is working on its stuff is talking about instruction-level parallelism not anything like task-level parallelism.
任何说ARM核心的人都可以做其他事情,而NEON核心正在研究它的东西是谈论指令级并行,而不是像任务级并行。