
时间:2022-08-06 13:54:34

Once I had the theory that on modern operating systems multithreaded read access on the HDD should perform better.


I thought that:
the operating system queues all read requests, and rearranges them in such a way, that it could read from the HDD more sequentially. The more requests it would get, the better it could rearrange them to optimize the read sequence.
I was very sure that I read it somewhere few times.


But I did some benchmarking, and had to find out, that multithreaded read access mostly perform much worst, and never performs better.


I had the experience under Windows and Linux. I benchmarked pure searching of files using the operating system's tools, and also had written own little benchmarks.


Am I missing something?
Can someone explain to me the secrets of this topic?
Thank you!


6 个解决方案


Well apparently you're causing the read head to skip around all over the place. Your bottleneck is the disk, not the processor.


To re-phrase, the CPU might be parrallel but the disk isn't.



solution: use NCQ to boost the performance. to do so configure your SATA HDD controller to use AHCI.

解决方案:使用NCQ来提升性能。这样做配置您的SATA HDD控制器使用AHCI。

additional details below:


i had made similar observations when analyzing a particular application. on my quad-core system i compared the following configurations:


  • 1 core only: pretty fast
  • 仅限1个核心:非常快

  • 4 cores enabled: much slower! this was quite surprising and also confusing to me.
  • 4核启用:慢得多!这是非常令人惊讶的,也让我感到困惑。

it turned out the application was doing heavy, concurrent HDD access. in case of multiple cores (and hence multiple threads) this would noticeably slow down total execution time.


i did some research and learned that a feature called NCQ (native command queuing) will do the optimization of HDD access you are referring to.


in SCSI world this has been common standard for quite a while. and in SATA world it has been adapted some time back. to unlock this feature it's required to configure your HDD controller to operate in AHCI mode - this is a prerequisite to use NCQ!

在SCSI世界中,这已成为很长一段时间的通用标准。在SATA世界中它已经适应了一段时间。要解锁此功能,需要将HDD控制器配置为在AHCI模式下运行 - 这是使用NCQ的先决条件!

as regular desktop systems nowadays use on-board HDD controllers, this configuration part needs to be done in BIOS setup. for SATA configuration you can usually choose between the following operational modes:


  • compatible / legacy IDE
  • 兼容/遗留IDE

  • AHCI

i went ahead and implemented my own custom benchmark to compare one and the same system running with the following configurations:


  • 4 cores enabled, legacy IDE: pretty slow
  • 启用了4个内核,旧版IDE:非常慢

  • 4 cores enabled, AHCI / NCQ: much faster. particular benchmark sections performed 6 times faster!
  • 启用了4个内核,AHCI / NCQ:更快。特别基准部分的执行速度提高了6倍



to unleash the full power of systems with concurrent HDD access:


  1. switch over to AHCI (so you can utilize NCQ)
  2. 切换到AHCI(这样你就可以利用NCQ)

  3. don't use the generic AHCI drivers that come with the OS. instead, use the vendor-specific, optimized drivers. example: windows 7 comes with some generic AHCI drivers that support most of the common HDD controllers. however, when using an intel chipset, make sure to install the intel "matrix storage manager" or intel "rapid storage technology" (e.g. intel RST 11.7). optimized drivers have shown to additionally boost HDD performance.
  4. 不要使用操作系统附带的通用AHCI驱动程序。相反,使用特定于供应商的优化驱动程序。例如:Windows 7附带了一些支持大多数常见HDD控制器的通用AHCI驱动程序。但是,在使用intel芯片组时,请确保安装英特尔“矩阵存储管理器”或英特尔“快速存储技术”(例如英特尔RST 11.7)。优化的驱动程序已经显示出额外提升HDD性能。

not doing so will make some applications run slower when using multiple threads instead of a single thread. that's the surprising part you need to consider.



note: there's a myth out there that says: NCQ is only relevant for "server" environments (with hundreds of processes running in parallel). my benchmark results are pointing in a different direction: it's also relevant for "desktop" environments. whenever heavy, concurrent HDD access is happening.


additional notes:

  1. some older chipsets / SATA HDD controllers do not support AHCI mode. but that's not covered here.
  2. 一些较旧的芯片组/ SATA HDD控制器不支持AHCI模式。但这里没有涉及。

  3. some "old" OS need special actions when either installing in AHCI mode or migrating an already system from IDE mode to AHCI. but that's not covered here.
  4. 当在AHCI模式下安装或将已经系统从IDE模式迁移到AHCI时,某些“旧”操作系统需要特殊操作。但这里没有涉及。


Whether or not you are seeing speedup will almost assuredly depend on the scenario you are looking at and the hardware. More details on your benchmarking methodology would be useful here.


At a coarse level, the opportunity for a speedup arises when you're not utilizing the maximum throughput of the i/o controller and it's caches or when you are overlapping i/o with CPU intensive work and they are blocked waiting for each other.

在粗略的情况下,如果您没有利用I / O控制器的最大吞吐量和它的缓存,或者当您与CPU密集型工作重叠I / O并且它们被阻塞等待彼此时,则会出现加速的机会。

Are you comparing doing reads of multiple small files spread out across the system, or just reading a few large files sequentially? You'll see different performance characteristics here.


Have you profiled with a good systems profiler like the (free) windows performance toolkit to see what is going on in your benchmarks? This is practically a must.


These kind of benchmarks can be a lot of fun to write and profile, don't let a few false starts get in the way of digging in and looking for speedups.




I think your assumption about the OS optimizing concurrent disk access is simply false. I imagine it does this sort of re-ordering when you use scatter/gather I/O from a single thread, but there's no practical way for it to optimize concurrent requests in this way. Any such scheme would introduce unnecessary latency in single-threaded reads. (The OS would have to wait a bit just in case a concurrent request came in.) Anyway, the short answer is that your concurrent requests are causing the read heads to jump all over the place. The OS cannot optimize this away.

我认为你对操作系统优化并发磁盘访问的假设是假的。我想当你使用来自单个线程的分散/收集I / O时,它会进行这种重新排序,但是没有实际的方法来以这种方式优化并发请求。任何此类方案都会在单线程读取中引入不必要的延迟。 (操作系统必须等待,以防并发请求进入。)无论如何,简短的回答是你的并发请求导致读头跳到了整个地方。操作系统无法优化此功能。


I think you are talking about native command queuing, which may or may not be enabled on the system you are testing with. From the Wikipedia entry:


In fact, newer mainstream Linux kernels support AHCI natively. Windows XP requires the installation of a vendor-specific driver even if AHCI is present on the host bus adapter. Windows Vista natively supports both AHCI and NCQ. FreeBSD fully supports AHCI and NCQ since version 8.0.

事实上,较新的主流Linux内核本身支持AHCI。即使主机总线适配器上存在AHCI,Windows XP也需要安装特定于供应商的驱动程序。 Windows Vista本身支持AHCI和NCQ。 FreeBSD自8.0版以来完全支持AHCI和NCQ。

Also, I haven't done any tests, but NCQ may not be that effective for a directory walk that has to access small files/inodes all over the disk. It could be that the disk controller is able to service each request fast enough that a queue is never built up to reorder, thus you don't see any benefit.

此外,我还没有做过任何测试,但NCQ对于必须访问整个磁盘上的小文件/ inode的目录遍历可能没那么有效。可能是磁盘控制器能够足够快地为每个请求提供服务,以至于队列永远不会被重新排序,因此您看不到任何好处。


It's probably important here that you split the reading of the directory or file information away from the processing of that information. In other words, disk IO in one thread, processing and searching in another. Pass completed IO information to the processing thread with a bounded queue. By doing this you'll ensure that your IO thread is never waiting on the processing of results before getting busy on the read of the next block of data to process.



Well apparently you're causing the read head to skip around all over the place. Your bottleneck is the disk, not the processor.


To re-phrase, the CPU might be parrallel but the disk isn't.



solution: use NCQ to boost the performance. to do so configure your SATA HDD controller to use AHCI.

解决方案:使用NCQ来提升性能。这样做配置您的SATA HDD控制器使用AHCI。

additional details below:


i had made similar observations when analyzing a particular application. on my quad-core system i compared the following configurations:


  • 1 core only: pretty fast
  • 仅限1个核心:非常快

  • 4 cores enabled: much slower! this was quite surprising and also confusing to me.
  • 4核启用:慢得多!这是非常令人惊讶的,也让我感到困惑。

it turned out the application was doing heavy, concurrent HDD access. in case of multiple cores (and hence multiple threads) this would noticeably slow down total execution time.


i did some research and learned that a feature called NCQ (native command queuing) will do the optimization of HDD access you are referring to.


in SCSI world this has been common standard for quite a while. and in SATA world it has been adapted some time back. to unlock this feature it's required to configure your HDD controller to operate in AHCI mode - this is a prerequisite to use NCQ!

在SCSI世界中,这已成为很长一段时间的通用标准。在SATA世界中它已经适应了一段时间。要解锁此功能,需要将HDD控制器配置为在AHCI模式下运行 - 这是使用NCQ的先决条件!

as regular desktop systems nowadays use on-board HDD controllers, this configuration part needs to be done in BIOS setup. for SATA configuration you can usually choose between the following operational modes:


  • compatible / legacy IDE
  • 兼容/遗留IDE

  • AHCI

i went ahead and implemented my own custom benchmark to compare one and the same system running with the following configurations:


  • 4 cores enabled, legacy IDE: pretty slow
  • 启用了4个内核,旧版IDE:非常慢

  • 4 cores enabled, AHCI / NCQ: much faster. particular benchmark sections performed 6 times faster!
  • 启用了4个内核,AHCI / NCQ:更快。特别基准部分的执行速度提高了6倍



to unleash the full power of systems with concurrent HDD access:


  1. switch over to AHCI (so you can utilize NCQ)
  2. 切换到AHCI(这样你就可以利用NCQ)

  3. don't use the generic AHCI drivers that come with the OS. instead, use the vendor-specific, optimized drivers. example: windows 7 comes with some generic AHCI drivers that support most of the common HDD controllers. however, when using an intel chipset, make sure to install the intel "matrix storage manager" or intel "rapid storage technology" (e.g. intel RST 11.7). optimized drivers have shown to additionally boost HDD performance.
  4. 不要使用操作系统附带的通用AHCI驱动程序。相反,使用特定于供应商的优化驱动程序。例如:Windows 7附带了一些支持大多数常见HDD控制器的通用AHCI驱动程序。但是,在使用intel芯片组时,请确保安装英特尔“矩阵存储管理器”或英特尔“快速存储技术”(例如英特尔RST 11.7)。优化的驱动程序已经显示出额外提升HDD性能。

not doing so will make some applications run slower when using multiple threads instead of a single thread. that's the surprising part you need to consider.



note: there's a myth out there that says: NCQ is only relevant for "server" environments (with hundreds of processes running in parallel). my benchmark results are pointing in a different direction: it's also relevant for "desktop" environments. whenever heavy, concurrent HDD access is happening.


additional notes:

  1. some older chipsets / SATA HDD controllers do not support AHCI mode. but that's not covered here.
  2. 一些较旧的芯片组/ SATA HDD控制器不支持AHCI模式。但这里没有涉及。

  3. some "old" OS need special actions when either installing in AHCI mode or migrating an already system from IDE mode to AHCI. but that's not covered here.
  4. 当在AHCI模式下安装或将已经系统从IDE模式迁移到AHCI时,某些“旧”操作系统需要特殊操作。但这里没有涉及。


Whether or not you are seeing speedup will almost assuredly depend on the scenario you are looking at and the hardware. More details on your benchmarking methodology would be useful here.


At a coarse level, the opportunity for a speedup arises when you're not utilizing the maximum throughput of the i/o controller and it's caches or when you are overlapping i/o with CPU intensive work and they are blocked waiting for each other.

在粗略的情况下,如果您没有利用I / O控制器的最大吞吐量和它的缓存,或者当您与CPU密集型工作重叠I / O并且它们被阻塞等待彼此时,则会出现加速的机会。

Are you comparing doing reads of multiple small files spread out across the system, or just reading a few large files sequentially? You'll see different performance characteristics here.


Have you profiled with a good systems profiler like the (free) windows performance toolkit to see what is going on in your benchmarks? This is practically a must.


These kind of benchmarks can be a lot of fun to write and profile, don't let a few false starts get in the way of digging in and looking for speedups.




I think your assumption about the OS optimizing concurrent disk access is simply false. I imagine it does this sort of re-ordering when you use scatter/gather I/O from a single thread, but there's no practical way for it to optimize concurrent requests in this way. Any such scheme would introduce unnecessary latency in single-threaded reads. (The OS would have to wait a bit just in case a concurrent request came in.) Anyway, the short answer is that your concurrent requests are causing the read heads to jump all over the place. The OS cannot optimize this away.

我认为你对操作系统优化并发磁盘访问的假设是假的。我想当你使用来自单个线程的分散/收集I / O时,它会进行这种重新排序,但是没有实际的方法来以这种方式优化并发请求。任何此类方案都会在单线程读取中引入不必要的延迟。 (操作系统必须等待,以防并发请求进入。)无论如何,简短的回答是你的并发请求导致读头跳到了整个地方。操作系统无法优化此功能。


I think you are talking about native command queuing, which may or may not be enabled on the system you are testing with. From the Wikipedia entry:


In fact, newer mainstream Linux kernels support AHCI natively. Windows XP requires the installation of a vendor-specific driver even if AHCI is present on the host bus adapter. Windows Vista natively supports both AHCI and NCQ. FreeBSD fully supports AHCI and NCQ since version 8.0.

事实上,较新的主流Linux内核本身支持AHCI。即使主机总线适配器上存在AHCI,Windows XP也需要安装特定于供应商的驱动程序。 Windows Vista本身支持AHCI和NCQ。 FreeBSD自8.0版以来完全支持AHCI和NCQ。

Also, I haven't done any tests, but NCQ may not be that effective for a directory walk that has to access small files/inodes all over the disk. It could be that the disk controller is able to service each request fast enough that a queue is never built up to reorder, thus you don't see any benefit.

此外,我还没有做过任何测试,但NCQ对于必须访问整个磁盘上的小文件/ inode的目录遍历可能没那么有效。可能是磁盘控制器能够足够快地为每个请求提供服务,以至于队列永远不会被重新排序,因此您看不到任何好处。


It's probably important here that you split the reading of the directory or file information away from the processing of that information. In other words, disk IO in one thread, processing and searching in another. Pass completed IO information to the processing thread with a bounded queue. By doing this you'll ensure that your IO thread is never waiting on the processing of results before getting busy on the read of the next block of data to process.
