
时间:2021-01-09 21:37:48

I'm trying to optimize my code, taking advantage of multicore processors, to both copy any manipulate large dense arrays.


For copying: I have a large dense array (approximately 6000x100000) from which I need to pull 15x100000 subarrays to do several computations down the pipe. The pipe consists of a lot of linear algebra functions that are being handled by blas, which is multicore. Whether or not the time to pull data will really matter compared to the linear algebra is an open question, but I'd like to err on the side of caution and make sure the data copying is optimized.


For manipulating: I have many different functions that manipulate arrays by with element or row. It would be best if each of these was done multicore.


My question is: is it best to use to right framework (OpenML, OpenCL) and let all the magic happen with the compiler, or are there good functions/libraries that do this faster?


1 个解决方案



Your starting point should be good old memcpy. Some tips from someone who has for a long time been obsessed by "copying performance".


  1. Read What Every Programmer Should Know About Memory.
  2. 阅读每个程序员应该了解的内存。
  3. Benchmark your systems memcpy performance e.g memcpy_bench function here.
  4. 在此处对您的系统memcpy性能(例如memcpy_bench函数)进行基准测试。
  5. Benchmark the scalability of memcpy when it's run on multiple cores e.g multi_memcpy_bench here. (Unless you're on some multi-socket NUMA HW, I think you won't see much benefit to multithreaded copying).
  6. 对memcpy在多个核心上运行时的可扩展性进行基准测试,例如multi_memcpy_bench。 (除非你使用的是多插槽NUMA硬件,否则我认为你不会看到多线程复制带来太多好处)。
  7. Dig into your system's implementation of memcpy and understand them. The days you'd find most of the time spent in a solitary rep movsd are long gone; last time I looked at gcc and Intel compiler's CRTs they both varied their strategy depending on the size of the copy relative to the CPU's cache size.
  8. 深入了解系统的memcpy实现并理解它们。你发现大部分时间花在一个孤独的代表上的日子早已不复存在;上次我看了gcc和英特尔编译器的CRT时,他们都改变了策略,这取决于副本相对于CPU缓存大小的大小。
  9. On Intel, understand the advantages of the non cache-polluting store instructions (e.g movntps) as these can achieve significant throughput improvements vs. a conventional approach (you'll see these used in 4.)
  10. 在英特尔,了解非缓存污染存储指令(例如movntps)的优势,因为与传统方法相比,这些指令可以实现显着的吞吐量改进(您将在4中看到这些)。
  11. Have access to and know how to use a sampling profiler to identify how much of your apps' time is spent in copying operations. There are also more advanced tools which can look at CPU performance counters and tell you all sorts of things about what the various caches are doing etc.
  12. 可以访问并了解如何使用抽样分析器来确定您的应用程序在复制操作中花费了多少时间。还有更高级的工具可以查看CPU性能计数器,并告诉您各种缓存正在做什么等各种事情。
  13. (Advanced topic) Be aware of the TLB and when huge pages can help.
  14. (高级主题)注意TLB以及大页面何时可以提供帮助。

But my expectation is that your copies will be pretty minor overhead compared with any linalg heavy lifting. It's good to be aware of what the numbers are though. I wouldn't expect OpenCL or whatever for CPU to magically offer any improvements here (unless your system's memcpy is poorly implemented); IMHO it's better to dig into this stuff in more detail, getting down to the basics of what's actually happening at the level of instructions, registers, cache lines and pages, than it is to move away from that by layering another level of abstraction on top.


Of course if you're considering porting your code from whatever multicore BLAS library you're using currently to a GPU accelerated linear algebra version, this becomes a completely different (and much more complicated) question (see JayC's comment below). If you want substantial performance gains you should certainly be considering it though.




Your starting point should be good old memcpy. Some tips from someone who has for a long time been obsessed by "copying performance".


  1. Read What Every Programmer Should Know About Memory.
  2. 阅读每个程序员应该了解的内存。
  3. Benchmark your systems memcpy performance e.g memcpy_bench function here.
  4. 在此处对您的系统memcpy性能(例如memcpy_bench函数)进行基准测试。
  5. Benchmark the scalability of memcpy when it's run on multiple cores e.g multi_memcpy_bench here. (Unless you're on some multi-socket NUMA HW, I think you won't see much benefit to multithreaded copying).
  6. 对memcpy在多个核心上运行时的可扩展性进行基准测试,例如multi_memcpy_bench。 (除非你使用的是多插槽NUMA硬件,否则我认为你不会看到多线程复制带来太多好处)。
  7. Dig into your system's implementation of memcpy and understand them. The days you'd find most of the time spent in a solitary rep movsd are long gone; last time I looked at gcc and Intel compiler's CRTs they both varied their strategy depending on the size of the copy relative to the CPU's cache size.
  8. 深入了解系统的memcpy实现并理解它们。你发现大部分时间花在一个孤独的代表上的日子早已不复存在;上次我看了gcc和英特尔编译器的CRT时,他们都改变了策略,这取决于副本相对于CPU缓存大小的大小。
  9. On Intel, understand the advantages of the non cache-polluting store instructions (e.g movntps) as these can achieve significant throughput improvements vs. a conventional approach (you'll see these used in 4.)
  10. 在英特尔,了解非缓存污染存储指令(例如movntps)的优势,因为与传统方法相比,这些指令可以实现显着的吞吐量改进(您将在4中看到这些)。
  11. Have access to and know how to use a sampling profiler to identify how much of your apps' time is spent in copying operations. There are also more advanced tools which can look at CPU performance counters and tell you all sorts of things about what the various caches are doing etc.
  12. 可以访问并了解如何使用抽样分析器来确定您的应用程序在复制操作中花费了多少时间。还有更高级的工具可以查看CPU性能计数器,并告诉您各种缓存正在做什么等各种事情。
  13. (Advanced topic) Be aware of the TLB and when huge pages can help.
  14. (高级主题)注意TLB以及大页面何时可以提供帮助。

But my expectation is that your copies will be pretty minor overhead compared with any linalg heavy lifting. It's good to be aware of what the numbers are though. I wouldn't expect OpenCL or whatever for CPU to magically offer any improvements here (unless your system's memcpy is poorly implemented); IMHO it's better to dig into this stuff in more detail, getting down to the basics of what's actually happening at the level of instructions, registers, cache lines and pages, than it is to move away from that by layering another level of abstraction on top.


Of course if you're considering porting your code from whatever multicore BLAS library you're using currently to a GPU accelerated linear algebra version, this becomes a completely different (and much more complicated) question (see JayC's comment below). If you want substantial performance gains you should certainly be considering it though.
