Once I had the theory that on modern operating systems multithreaded read access on the HDD should perform better.
有一个理论认为,在现代操作系统上,HDD上的多线程读取访问应该会更好。
I thought that:
the operating system queues all read requests, and rearranges them in such a way, that it could read from the HDD more sequentially. The more requests it would get, the better it could rearrange them to optimize the read sequence.
I was very sure that I read it somewhere few times.
我认为:操作系统将所有读取请求排队,并以这种方式重新排列它们,以便它可以更顺序地从HDD读取。它将获得的请求越多,它就可以更好地重新排列它们以优化读取序列。我很确定我几次在某处读到它。
But I did some benchmarking, and had to find out, that multithreaded read access mostly perform much worst, and never performs better.
但我做了一些基准测试,并且不得不发现,多线程读访问主要表现得差得多,而且从来没有表现得更好。
I had the experience under Windows and Linux. I benchmarked pure searching of files using the operating system's tools, and also had written own little benchmarks.
我有Windows和Linux下的经验。我使用操作系统的工具对文件的纯搜索进行了基准测试,并且还编写了自己的小基准测试。
Am I missing something?
Can someone explain to me the secrets of this topic?
Thank you!
我错过了什么吗?有人可以向我解释这个话题的秘密吗?谢谢!
6 个解决方案
#1
Well apparently you're causing the read head to skip around all over the place. Your bottleneck is the disk, not the processor.
很明显,你正在让读头跳到各处。你的瓶颈是磁盘,而不是处理器。
To re-phrase, the CPU might be parrallel but the disk isn't.
要重新短语,CPU可能是并行的,但磁盘不是。
#2
solution: use NCQ to boost the performance. to do so configure your SATA HDD controller to use AHCI.
解决方案:使用NCQ来提升性能。这样做配置您的SATA HDD控制器使用AHCI。
additional details below:
其他详情如下:
i had made similar observations when analyzing a particular application. on my quad-core system i compared the following configurations:
在分析特定应用程序时,我做了类似的观察。在我的四核系统上,我比较了以下配置:
- 1 core only: pretty fast
- 4 cores enabled: much slower! this was quite surprising and also confusing to me.
仅限1个核心:非常快
4核启用:慢得多!这是非常令人惊讶的,也让我感到困惑。
it turned out the application was doing heavy, concurrent HDD access. in case of multiple cores (and hence multiple threads) this would noticeably slow down total execution time.
事实证明,该应用程序正在进行繁重的并发HDD访问。在多核(因此多线程)的情况下,这将显着减慢总执行时间。
i did some research and learned that a feature called NCQ (native command queuing) will do the optimization of HDD access you are referring to.
我做了一些研究,并了解到一个名为NCQ(本机命令排队)的功能将对你所指的硬盘访问进行优化。
in SCSI world this has been common standard for quite a while. and in SATA world it has been adapted some time back. to unlock this feature it's required to configure your HDD controller to operate in AHCI mode - this is a prerequisite to use NCQ!
在SCSI世界中,这已成为很长一段时间的通用标准。在SATA世界中它已经适应了一段时间。要解锁此功能,需要将HDD控制器配置为在AHCI模式下运行 - 这是使用NCQ的先决条件!
as regular desktop systems nowadays use on-board HDD controllers, this configuration part needs to be done in BIOS setup. for SATA configuration you can usually choose between the following operational modes:
如今常规桌面系统使用板载HDD控制器,此配置部分需要在BIOS设置中完成。对于SATA配置,您通常可以在以下操作模式之间进行选择:
- compatible / legacy IDE
- AHCI
兼容/遗留IDE
i went ahead and implemented my own custom benchmark to compare one and the same system running with the following configurations:
我继续实施自己的自定义基准测试,以比较使用以下配置运行的同一系统:
- 4 cores enabled, legacy IDE: pretty slow
- 4 cores enabled, AHCI / NCQ: much faster. particular benchmark sections performed 6 times faster!
启用了4个内核,旧版IDE:非常慢
启用了4个内核,AHCI / NCQ:更快。特别基准部分的执行速度提高了6倍
--
conclusion:
to unleash the full power of systems with concurrent HDD access:
通过并发HDD访问释放系统的全部功能:
- switch over to AHCI (so you can utilize NCQ)
- don't use the generic AHCI drivers that come with the OS. instead, use the vendor-specific, optimized drivers. example: windows 7 comes with some generic AHCI drivers that support most of the common HDD controllers. however, when using an intel chipset, make sure to install the intel "matrix storage manager" or intel "rapid storage technology" (e.g. intel RST 11.7). optimized drivers have shown to additionally boost HDD performance.
切换到AHCI(这样你就可以利用NCQ)
不要使用操作系统附带的通用AHCI驱动程序。相反,使用特定于供应商的优化驱动程序。例如:Windows 7附带了一些支持大多数常见HDD控制器的通用AHCI驱动程序。但是,在使用intel芯片组时,请确保安装英特尔“矩阵存储管理器”或英特尔“快速存储技术”(例如英特尔RST 11.7)。优化的驱动程序已经显示出额外提升HDD性能。
not doing so will make some applications run slower when using multiple threads instead of a single thread. that's the surprising part you need to consider.
不这样做会使一些应用程序在使用多个线程而不是单个线程时运行得更慢。这是你需要考虑的令人惊讶的部分。
--
note: there's a myth out there that says: NCQ is only relevant for "server" environments (with hundreds of processes running in parallel). my benchmark results are pointing in a different direction: it's also relevant for "desktop" environments. whenever heavy, concurrent HDD access is happening.
注意:有一个神话说:NCQ仅与“服务器”环境相关(数百个进程并行运行)。我的基准测试结果指向了一个不同的方向:它也与“桌面”环境相关。每当发生沉重的并发硬盘访问时。
additional notes:
- some older chipsets / SATA HDD controllers do not support AHCI mode. but that's not covered here.
- some "old" OS need special actions when either installing in AHCI mode or migrating an already system from IDE mode to AHCI. but that's not covered here.
一些较旧的芯片组/ SATA HDD控制器不支持AHCI模式。但这里没有涉及。
当在AHCI模式下安装或将已经系统从IDE模式迁移到AHCI时,某些“旧”操作系统需要特殊操作。但这里没有涉及。
#3
Whether or not you are seeing speedup will almost assuredly depend on the scenario you are looking at and the hardware. More details on your benchmarking methodology would be useful here.
您是否看到加速几乎肯定取决于您正在查看的场景和硬件。有关您的基准测试方法的更多详细信息将非常有用。
At a coarse level, the opportunity for a speedup arises when you're not utilizing the maximum throughput of the i/o controller and it's caches or when you are overlapping i/o with CPU intensive work and they are blocked waiting for each other.
在粗略的情况下,如果您没有利用I / O控制器的最大吞吐量和它的缓存,或者当您与CPU密集型工作重叠I / O并且它们被阻塞等待彼此时,则会出现加速的机会。
Are you comparing doing reads of multiple small files spread out across the system, or just reading a few large files sequentially? You'll see different performance characteristics here.
您是在比较读取遍布整个系统的多个小文件,还是只是按顺序读取几个大文件?你会在这里看到不同的性能特征。
Have you profiled with a good systems profiler like the (free) windows performance toolkit to see what is going on in your benchmarks? This is practically a must.
您是否已经使用(免费)Windows性能工具包等良好的系统分析器进行了分析,以了解基准测试中发生了什么?这几乎是必须的。
These kind of benchmarks can be a lot of fun to write and profile, don't let a few false starts get in the way of digging in and looking for speedups.
这些基准测试可以很有趣地编写和分析,不要让一些错误的开始妨碍挖掘和寻找加速。
-Rick
#4
I think your assumption about the OS optimizing concurrent disk access is simply false. I imagine it does this sort of re-ordering when you use scatter/gather I/O from a single thread, but there's no practical way for it to optimize concurrent requests in this way. Any such scheme would introduce unnecessary latency in single-threaded reads. (The OS would have to wait a bit just in case a concurrent request came in.) Anyway, the short answer is that your concurrent requests are causing the read heads to jump all over the place. The OS cannot optimize this away.
我认为你对操作系统优化并发磁盘访问的假设是假的。我想当你使用来自单个线程的分散/收集I / O时,它会进行这种重新排序,但是没有实际的方法来以这种方式优化并发请求。任何此类方案都会在单线程读取中引入不必要的延迟。 (操作系统必须等待,以防并发请求进入。)无论如何,简短的回答是你的并发请求导致读头跳到了整个地方。操作系统无法优化此功能。
#5
I think you are talking about native command queuing, which may or may not be enabled on the system you are testing with. From the Wikipedia entry:
我想你在谈论本机命令排队,它可能会或可能不会在你正在测试的系统上启用。来自*条目:
In fact, newer mainstream Linux kernels support AHCI natively. Windows XP requires the installation of a vendor-specific driver even if AHCI is present on the host bus adapter. Windows Vista natively supports both AHCI and NCQ. FreeBSD fully supports AHCI and NCQ since version 8.0.
事实上,较新的主流Linux内核本身支持AHCI。即使主机总线适配器上存在AHCI,Windows XP也需要安装特定于供应商的驱动程序。 Windows Vista本身支持AHCI和NCQ。 FreeBSD自8.0版以来完全支持AHCI和NCQ。
Also, I haven't done any tests, but NCQ may not be that effective for a directory walk that has to access small files/inodes all over the disk. It could be that the disk controller is able to service each request fast enough that a queue is never built up to reorder, thus you don't see any benefit.
此外,我还没有做过任何测试,但NCQ对于必须访问整个磁盘上的小文件/ inode的目录遍历可能没那么有效。可能是磁盘控制器能够足够快地为每个请求提供服务,以至于队列永远不会被重新排序,因此您看不到任何好处。
#6
It's probably important here that you split the reading of the directory or file information away from the processing of that information. In other words, disk IO in one thread, processing and searching in another. Pass completed IO information to the processing thread with a bounded queue. By doing this you'll ensure that your IO thread is never waiting on the processing of results before getting busy on the read of the next block of data to process.
在这里,您可能需要将目录或文件信息的读取与该信息的处理分开。换句话说,磁盘IO在一个线程中,在另一个线程中处理和搜索。将完成的IO信息传递给具有有界队列的处理线程。通过这样做,您将确保您的IO线程在忙于读取要处理的下一个数据块之前永远不会等待结果处理。
#1
Well apparently you're causing the read head to skip around all over the place. Your bottleneck is the disk, not the processor.
很明显,你正在让读头跳到各处。你的瓶颈是磁盘,而不是处理器。
To re-phrase, the CPU might be parrallel but the disk isn't.
要重新短语,CPU可能是并行的,但磁盘不是。
#2
solution: use NCQ to boost the performance. to do so configure your SATA HDD controller to use AHCI.
解决方案:使用NCQ来提升性能。这样做配置您的SATA HDD控制器使用AHCI。
additional details below:
其他详情如下:
i had made similar observations when analyzing a particular application. on my quad-core system i compared the following configurations:
在分析特定应用程序时,我做了类似的观察。在我的四核系统上,我比较了以下配置:
- 1 core only: pretty fast
- 4 cores enabled: much slower! this was quite surprising and also confusing to me.
仅限1个核心:非常快
4核启用:慢得多!这是非常令人惊讶的,也让我感到困惑。
it turned out the application was doing heavy, concurrent HDD access. in case of multiple cores (and hence multiple threads) this would noticeably slow down total execution time.
事实证明,该应用程序正在进行繁重的并发HDD访问。在多核(因此多线程)的情况下,这将显着减慢总执行时间。
i did some research and learned that a feature called NCQ (native command queuing) will do the optimization of HDD access you are referring to.
我做了一些研究,并了解到一个名为NCQ(本机命令排队)的功能将对你所指的硬盘访问进行优化。
in SCSI world this has been common standard for quite a while. and in SATA world it has been adapted some time back. to unlock this feature it's required to configure your HDD controller to operate in AHCI mode - this is a prerequisite to use NCQ!
在SCSI世界中,这已成为很长一段时间的通用标准。在SATA世界中它已经适应了一段时间。要解锁此功能,需要将HDD控制器配置为在AHCI模式下运行 - 这是使用NCQ的先决条件!
as regular desktop systems nowadays use on-board HDD controllers, this configuration part needs to be done in BIOS setup. for SATA configuration you can usually choose between the following operational modes:
如今常规桌面系统使用板载HDD控制器,此配置部分需要在BIOS设置中完成。对于SATA配置,您通常可以在以下操作模式之间进行选择:
- compatible / legacy IDE
- AHCI
兼容/遗留IDE
i went ahead and implemented my own custom benchmark to compare one and the same system running with the following configurations:
我继续实施自己的自定义基准测试,以比较使用以下配置运行的同一系统:
- 4 cores enabled, legacy IDE: pretty slow
- 4 cores enabled, AHCI / NCQ: much faster. particular benchmark sections performed 6 times faster!
启用了4个内核,旧版IDE:非常慢
启用了4个内核,AHCI / NCQ:更快。特别基准部分的执行速度提高了6倍
--
conclusion:
to unleash the full power of systems with concurrent HDD access:
通过并发HDD访问释放系统的全部功能:
- switch over to AHCI (so you can utilize NCQ)
- don't use the generic AHCI drivers that come with the OS. instead, use the vendor-specific, optimized drivers. example: windows 7 comes with some generic AHCI drivers that support most of the common HDD controllers. however, when using an intel chipset, make sure to install the intel "matrix storage manager" or intel "rapid storage technology" (e.g. intel RST 11.7). optimized drivers have shown to additionally boost HDD performance.
切换到AHCI(这样你就可以利用NCQ)
不要使用操作系统附带的通用AHCI驱动程序。相反,使用特定于供应商的优化驱动程序。例如:Windows 7附带了一些支持大多数常见HDD控制器的通用AHCI驱动程序。但是,在使用intel芯片组时,请确保安装英特尔“矩阵存储管理器”或英特尔“快速存储技术”(例如英特尔RST 11.7)。优化的驱动程序已经显示出额外提升HDD性能。
not doing so will make some applications run slower when using multiple threads instead of a single thread. that's the surprising part you need to consider.
不这样做会使一些应用程序在使用多个线程而不是单个线程时运行得更慢。这是你需要考虑的令人惊讶的部分。
--
note: there's a myth out there that says: NCQ is only relevant for "server" environments (with hundreds of processes running in parallel). my benchmark results are pointing in a different direction: it's also relevant for "desktop" environments. whenever heavy, concurrent HDD access is happening.
注意:有一个神话说:NCQ仅与“服务器”环境相关(数百个进程并行运行)。我的基准测试结果指向了一个不同的方向:它也与“桌面”环境相关。每当发生沉重的并发硬盘访问时。
additional notes:
- some older chipsets / SATA HDD controllers do not support AHCI mode. but that's not covered here.
- some "old" OS need special actions when either installing in AHCI mode or migrating an already system from IDE mode to AHCI. but that's not covered here.
一些较旧的芯片组/ SATA HDD控制器不支持AHCI模式。但这里没有涉及。
当在AHCI模式下安装或将已经系统从IDE模式迁移到AHCI时,某些“旧”操作系统需要特殊操作。但这里没有涉及。
#3
Whether or not you are seeing speedup will almost assuredly depend on the scenario you are looking at and the hardware. More details on your benchmarking methodology would be useful here.
您是否看到加速几乎肯定取决于您正在查看的场景和硬件。有关您的基准测试方法的更多详细信息将非常有用。
At a coarse level, the opportunity for a speedup arises when you're not utilizing the maximum throughput of the i/o controller and it's caches or when you are overlapping i/o with CPU intensive work and they are blocked waiting for each other.
在粗略的情况下,如果您没有利用I / O控制器的最大吞吐量和它的缓存,或者当您与CPU密集型工作重叠I / O并且它们被阻塞等待彼此时,则会出现加速的机会。
Are you comparing doing reads of multiple small files spread out across the system, or just reading a few large files sequentially? You'll see different performance characteristics here.
您是在比较读取遍布整个系统的多个小文件,还是只是按顺序读取几个大文件?你会在这里看到不同的性能特征。
Have you profiled with a good systems profiler like the (free) windows performance toolkit to see what is going on in your benchmarks? This is practically a must.
您是否已经使用(免费)Windows性能工具包等良好的系统分析器进行了分析,以了解基准测试中发生了什么?这几乎是必须的。
These kind of benchmarks can be a lot of fun to write and profile, don't let a few false starts get in the way of digging in and looking for speedups.
这些基准测试可以很有趣地编写和分析,不要让一些错误的开始妨碍挖掘和寻找加速。
-Rick
#4
I think your assumption about the OS optimizing concurrent disk access is simply false. I imagine it does this sort of re-ordering when you use scatter/gather I/O from a single thread, but there's no practical way for it to optimize concurrent requests in this way. Any such scheme would introduce unnecessary latency in single-threaded reads. (The OS would have to wait a bit just in case a concurrent request came in.) Anyway, the short answer is that your concurrent requests are causing the read heads to jump all over the place. The OS cannot optimize this away.
我认为你对操作系统优化并发磁盘访问的假设是假的。我想当你使用来自单个线程的分散/收集I / O时,它会进行这种重新排序,但是没有实际的方法来以这种方式优化并发请求。任何此类方案都会在单线程读取中引入不必要的延迟。 (操作系统必须等待,以防并发请求进入。)无论如何,简短的回答是你的并发请求导致读头跳到了整个地方。操作系统无法优化此功能。
#5
I think you are talking about native command queuing, which may or may not be enabled on the system you are testing with. From the Wikipedia entry:
我想你在谈论本机命令排队,它可能会或可能不会在你正在测试的系统上启用。来自*条目:
In fact, newer mainstream Linux kernels support AHCI natively. Windows XP requires the installation of a vendor-specific driver even if AHCI is present on the host bus adapter. Windows Vista natively supports both AHCI and NCQ. FreeBSD fully supports AHCI and NCQ since version 8.0.
事实上,较新的主流Linux内核本身支持AHCI。即使主机总线适配器上存在AHCI,Windows XP也需要安装特定于供应商的驱动程序。 Windows Vista本身支持AHCI和NCQ。 FreeBSD自8.0版以来完全支持AHCI和NCQ。
Also, I haven't done any tests, but NCQ may not be that effective for a directory walk that has to access small files/inodes all over the disk. It could be that the disk controller is able to service each request fast enough that a queue is never built up to reorder, thus you don't see any benefit.
此外,我还没有做过任何测试,但NCQ对于必须访问整个磁盘上的小文件/ inode的目录遍历可能没那么有效。可能是磁盘控制器能够足够快地为每个请求提供服务,以至于队列永远不会被重新排序,因此您看不到任何好处。
#6
It's probably important here that you split the reading of the directory or file information away from the processing of that information. In other words, disk IO in one thread, processing and searching in another. Pass completed IO information to the processing thread with a bounded queue. By doing this you'll ensure that your IO thread is never waiting on the processing of results before getting busy on the read of the next block of data to process.
在这里,您可能需要将目录或文件信息的读取与该信息的处理分开。换句话说,磁盘IO在一个线程中,在另一个线程中处理和搜索。将完成的IO信息传递给具有有界队列的处理线程。通过这样做,您将确保您的IO线程在忙于读取要处理的下一个数据块之前永远不会等待结果处理。