I'm trying to optimize my code, taking advantage of multicore processors, to both copy any manipulate large dense arrays.
我正在尝试优化我的代码,利用多核处理器来复制任何操作大型密集阵列。
For copying: I have a large dense array (approximately 6000x100000) from which I need to pull 15x100000 subarrays to do several computations down the pipe. The pipe consists of a lot of linear algebra functions that are being handled by blas, which is multicore. Whether or not the time to pull data will really matter compared to the linear algebra is an open question, but I'd like to err on the side of caution and make sure the data copying is optimized.
复制:我有一个大的密集阵列(大约6000x100000),我需要从中拉出15x100000个子阵列,在管道上进行多次计算。该管道由许多线性代数函数组成,这些函数由blas处理,这是多核的。与线性代数相比,提取数据的时间是否真的重要是一个悬而未决的问题,但我想谨慎一点,并确保数据复制得到优化。
For manipulating: I have many different functions that manipulate arrays by with element or row. It would be best if each of these was done multicore.
对于操作:我有许多不同的函数,通过元素或行来操作数组。如果每个都做多核,那将是最好的。
My question is: is it best to use to right framework (OpenML, OpenCL) and let all the magic happen with the compiler, or are there good functions/libraries that do this faster?
我的问题是:最好是使用正确的框架(OpenML,OpenCL)并让编译器发生所有的魔术,还是有更好的功能/库可以更快地完成这项工作?
1 个解决方案
#1
7
Your starting point should be good old memcpy
. Some tips from someone who has for a long time been obsessed by "copying performance".
你的起点应该是很好的老记忆。长期以来一直被“复制表演”所困扰的人的一些提示。
- Read What Every Programmer Should Know About Memory.
- 阅读每个程序员应该了解的内存。
- Benchmark your systems
memcpy
performance e.gmemcpy_bench
function here. - 在此处对您的系统memcpy性能(例如memcpy_bench函数)进行基准测试。
- Benchmark the scalability of
memcpy
when it's run on multiple cores e.gmulti_memcpy_bench
here. (Unless you're on some multi-socket NUMA HW, I think you won't see much benefit to multithreaded copying). - 对memcpy在多个核心上运行时的可扩展性进行基准测试,例如multi_memcpy_bench。 (除非你使用的是多插槽NUMA硬件,否则我认为你不会看到多线程复制带来太多好处)。
- Dig into your system's implementation of memcpy and understand them. The days you'd find most of the time spent in a solitary
rep movsd
are long gone; last time I looked at gcc and Intel compiler's CRTs they both varied their strategy depending on the size of the copy relative to the CPU's cache size. - 深入了解系统的memcpy实现并理解它们。你发现大部分时间花在一个孤独的代表上的日子早已不复存在;上次我看了gcc和英特尔编译器的CRT时,他们都改变了策略,这取决于副本相对于CPU缓存大小的大小。
- On Intel, understand the advantages of the non cache-polluting store instructions (e.g
movntps
) as these can achieve significant throughput improvements vs. a conventional approach (you'll see these used in 4.) - 在英特尔,了解非缓存污染存储指令(例如movntps)的优势,因为与传统方法相比,这些指令可以实现显着的吞吐量改进(您将在4中看到这些)。
- Have access to and know how to use a sampling profiler to identify how much of your apps' time is spent in copying operations. There are also more advanced tools which can look at CPU performance counters and tell you all sorts of things about what the various caches are doing etc.
- 可以访问并了解如何使用抽样分析器来确定您的应用程序在复制操作中花费了多少时间。还有更高级的工具可以查看CPU性能计数器,并告诉您各种缓存正在做什么等各种事情。
- (Advanced topic) Be aware of the TLB and when huge pages can help.
- (高级主题)注意TLB以及大页面何时可以提供帮助。
But my expectation is that your copies will be pretty minor overhead compared with any linalg heavy lifting. It's good to be aware of what the numbers are though. I wouldn't expect OpenCL or whatever for CPU to magically offer any improvements here (unless your system's memcpy is poorly implemented); IMHO it's better to dig into this stuff in more detail, getting down to the basics of what's actually happening at the level of instructions, registers, cache lines and pages, than it is to move away from that by layering another level of abstraction on top.
但我的期望是,与任何linalg举重相比,你的副本将是相当小的开销。尽管知道数字是多少,但这很好。我不希望OpenCL或其他什么CPU在这里神奇地提供任何改进(除非你的系统的memcpy执行得不好);恕我直言,最好更详细地深入研究这些内容,深入了解在指令,寄存器,缓存行和页面层面实际发生的事情,而不是通过在顶层层叠另一层抽象而远离它。
Of course if you're considering porting your code from whatever multicore BLAS library you're using currently to a GPU accelerated linear algebra version, this becomes a completely different (and much more complicated) question (see JayC's comment below). If you want substantial performance gains you should certainly be considering it though.
当然,如果您正在考虑将您目前使用的多核BLAS库中的代码移植到GPU加速线性代数版本中,这将成为一个完全不同(并且复杂得多)的问题(请参阅下面的JayC评论)。如果你想获得可观的性能提升,你当然应该考虑它。
#1
7
Your starting point should be good old memcpy
. Some tips from someone who has for a long time been obsessed by "copying performance".
你的起点应该是很好的老记忆。长期以来一直被“复制表演”所困扰的人的一些提示。
- Read What Every Programmer Should Know About Memory.
- 阅读每个程序员应该了解的内存。
- Benchmark your systems
memcpy
performance e.gmemcpy_bench
function here. - 在此处对您的系统memcpy性能(例如memcpy_bench函数)进行基准测试。
- Benchmark the scalability of
memcpy
when it's run on multiple cores e.gmulti_memcpy_bench
here. (Unless you're on some multi-socket NUMA HW, I think you won't see much benefit to multithreaded copying). - 对memcpy在多个核心上运行时的可扩展性进行基准测试,例如multi_memcpy_bench。 (除非你使用的是多插槽NUMA硬件,否则我认为你不会看到多线程复制带来太多好处)。
- Dig into your system's implementation of memcpy and understand them. The days you'd find most of the time spent in a solitary
rep movsd
are long gone; last time I looked at gcc and Intel compiler's CRTs they both varied their strategy depending on the size of the copy relative to the CPU's cache size. - 深入了解系统的memcpy实现并理解它们。你发现大部分时间花在一个孤独的代表上的日子早已不复存在;上次我看了gcc和英特尔编译器的CRT时,他们都改变了策略,这取决于副本相对于CPU缓存大小的大小。
- On Intel, understand the advantages of the non cache-polluting store instructions (e.g
movntps
) as these can achieve significant throughput improvements vs. a conventional approach (you'll see these used in 4.) - 在英特尔,了解非缓存污染存储指令(例如movntps)的优势,因为与传统方法相比,这些指令可以实现显着的吞吐量改进(您将在4中看到这些)。
- Have access to and know how to use a sampling profiler to identify how much of your apps' time is spent in copying operations. There are also more advanced tools which can look at CPU performance counters and tell you all sorts of things about what the various caches are doing etc.
- 可以访问并了解如何使用抽样分析器来确定您的应用程序在复制操作中花费了多少时间。还有更高级的工具可以查看CPU性能计数器,并告诉您各种缓存正在做什么等各种事情。
- (Advanced topic) Be aware of the TLB and when huge pages can help.
- (高级主题)注意TLB以及大页面何时可以提供帮助。
But my expectation is that your copies will be pretty minor overhead compared with any linalg heavy lifting. It's good to be aware of what the numbers are though. I wouldn't expect OpenCL or whatever for CPU to magically offer any improvements here (unless your system's memcpy is poorly implemented); IMHO it's better to dig into this stuff in more detail, getting down to the basics of what's actually happening at the level of instructions, registers, cache lines and pages, than it is to move away from that by layering another level of abstraction on top.
但我的期望是,与任何linalg举重相比,你的副本将是相当小的开销。尽管知道数字是多少,但这很好。我不希望OpenCL或其他什么CPU在这里神奇地提供任何改进(除非你的系统的memcpy执行得不好);恕我直言,最好更详细地深入研究这些内容,深入了解在指令,寄存器,缓存行和页面层面实际发生的事情,而不是通过在顶层层叠另一层抽象而远离它。
Of course if you're considering porting your code from whatever multicore BLAS library you're using currently to a GPU accelerated linear algebra version, this becomes a completely different (and much more complicated) question (see JayC's comment below). If you want substantial performance gains you should certainly be considering it though.
当然,如果您正在考虑将您目前使用的多核BLAS库中的代码移植到GPU加速线性代数版本中,这将成为一个完全不同(并且复杂得多)的问题(请参阅下面的JayC评论)。如果你想获得可观的性能提升,你当然应该考虑它。