I have a method that creates a MessageDigest (a hash) from a file, and I need to do this to a lot of files (>= 100,000). How big should I make the buffer used to read from the files to maximize performance?
我有一个从文件中创建MessageDigest(散列)的方法,我需要对很多文件执行此操作(>= 100,000)。我应该使缓冲区的大小用于从文件中读取,以最大化性能?
Most everyone is familiar with the basic code (which I'll repeat here just in case):
大多数人都熟悉基本代码(我在这里重复一下):
MessageDigest md = MessageDigest.getInstance( "SHA" );
FileInputStream ios = new FileInputStream( "myfile.bmp" );
byte[] buffer = new byte[4 * 1024]; // what should this value be?
int read = 0;
while( ( read = ios.read( buffer ) ) > 0 )
md.update( buffer, 0, read );
ios.close();
md.digest();
What is the ideal size of the buffer to maximize throughput? I know this is system dependent, and I'm pretty sure its OS, FileSystem, and HDD dependent, and there maybe other hardware/software in the mix.
为了最大限度地提高吞吐量,缓冲区的理想大小是多少?我知道这是依赖于系统的,我很确定它的操作系统、文件系统和HDD依赖,并且在混合中可能还有其他的硬件/软件。
(I should point out that I'm somewhat new to Java, so this may just be some Java API call I don't know about.)
(我应该指出,我对Java有些陌生,所以这可能只是一些我不知道的Java API调用)。
Edit: I do not know ahead of time the kinds of systems this will be used on, so I can't assume a whole lot. (I'm using Java for that reason.)
编辑:我不知道会在什么类型的系统上使用它,所以我不能假设很多(我使用Java就是因为这个原因)。
Edit: The code above is missing things like try..catch to make the post smaller
编辑:上面的代码缺少了try.. .抓住使柱子变小
10 个解决方案
#1
177
Optimum buffer size is related to a number of things: file system block size, CPU cache size and cache latency.
最佳缓冲区大小与许多事情有关:文件系统块大小、CPU缓存大小和缓存延迟。
Most file systems are configured to use block sizes of 4096 or 8192. In theory, if you configure your buffer size so you are reading a few bytes more than the disk block, the operations with the file system can be extremely inefficient (i.e. if you configured your buffer to read 4100 bytes at a time, each read would require 2 block reads by the file system). If the blocks are already in cache, then you wind up paying the price of RAM -> L3/L2 cache latency. If you are unlucky and the blocks are not in cache yet, the you pay the price of the disk->RAM latency as well.
大多数文件系统都配置为使用4096或8192块大小。理论上,如果你配置你的缓冲区大小你读几个字节的磁盘块,文件系统的操作可以是极其低效的(即如果你配置一次缓冲区读取4100字节,每个读需要2块读取的文件系统)。如果块已经在缓存中,那么您将为RAM -> L3/L2缓存延迟付出代价。如果您不走运,而且块还没有在缓存中,那么您还需要付出磁盘的代价——>RAM延迟。
This is why you see most buffers sized as a power of 2, and generally larger than (or equal to) the disk block size. This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads.
这就是为什么大多数缓冲区的大小是2的幂,并且通常大于(或等于)磁盘块大小。这意味着您的一个流读取可能导致多个磁盘块读取——但是这些读取总是使用一个完整的块——没有浪费的读取。
Now, this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read (we are doing sequential reads here, after all) - so you wind up paying the RAM -> L3/L2 cache latency price on the next read, but not the disk->RAM latency. In terms of order of magnitude, disk->RAM latency is so slow that it pretty much swamps any other latency you might be dealing with.
现在,这是抵消不少在一个典型的流的场景,因为从磁盘读取的块将仍然是在内存中当你遇到下一个读(毕竟我们在这里所做的顺序读取)——所以你最终支付RAM - > L3 / L2高速缓存延迟价格在接下来的阅读,而不是磁盘- > RAM延迟。就数量级而言,磁盘->内存延迟是如此之慢,以至于它几乎淹没了您可能正在处理的任何其他延迟。
So, I suspect that if you ran a test with different cache sizes (haven't done this myself), you will probably find a big impact of cache size up to the size of the file system block. Above that, I suspect that things would level out pretty quickly.
因此,我怀疑如果您运行的测试具有不同的缓存大小(我自己还没有这样做),您可能会发现缓存大小对文件系统块大小的影响很大。除此之外,我猜想事情很快就会平稳下来。
There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering (just getting a handle on L3 -> L2 cache transfers is mind bogglingly complex, and it changes with every CPU type).
这里有大量的条件和异常——系统的复杂性实际上是非常惊人的(仅仅处理L3 -> L2缓存传输就非常复杂,并且随着每个CPU类型的变化而变化)。
This leads to the 'real world' answer: If your app is like 99% out there, set the cache size to 8192 and move on (even better, choose encapsulation over performance and use BufferedInputStream to hide the details). If you are in the 1% of apps that are highly dependent on disk throughput, craft your implementation so you can swap out different disk interaction strategies, and provide the knobs and dials to allow your users to test and optimize (or come up with some self optimizing system).
这就引出了“现实世界”的答案:如果你的应用程序是99%,那么将缓存大小设置为8192,然后继续前进(更好的是,选择封装而不是性能,并使用bufferedputstream来隐藏细节)。如果你所在的1%的应用程序高度依赖于磁盘吞吐量,那么你可以设计你的实现,这样你就可以交换不同的磁盘交互策略,并提供旋钮和拨号,让你的用户测试和优化(或者提供一些自我优化系统)。
#2
13
Yes, it's probably dependent on various things - but I doubt it will make very much difference. I tend to opt for 16K or 32K as a good balance between memory usage and performance.
是的,它可能依赖于各种各样的东西——但我怀疑它会带来多大的不同。我倾向于选择16K或32K作为内存使用和性能之间的良好平衡。
Note that you should have a try/finally block in the code to make sure the stream is closed even if an exception is thrown.
注意,在代码中应该有一个try/finally块,以确保即使抛出异常,流也是关闭的。
#3
7
In most cases, it really doesn't matter that much. Just pick a good size such as 4K or 16K and stick with it. If you're positive that this is the bottleneck in your application, then you should start profiling to find the optimal buffer size. If you pick a size that's too small, you'll waste time doing extra I/O operations and extra function calls. If you pick a size that's too big, you'll start seeing a lot of cache misses which will really slow you down. Don't use a buffer bigger than your L2 cache size.
在大多数情况下,它真的没有那么重要。选择一个好的尺寸,比如4K或16K,并坚持使用它。如果您确信这是应用程序中的瓶颈,那么您应该开始分析以找到最佳的缓冲区大小。如果选择的大小太小,将浪费时间进行额外的I/O操作和额外的函数调用。如果你选择了一个太大的尺寸,你会发现很多缓存丢失,这会让你慢下来。不要使用大于L2缓存大小的缓冲区。
#4
4
In the ideal case we should have enough memory to read the file in one read operation. That would be the best performer because we let the system manage File System , allocation units and HDD at will. In practice you are fortunate to know the file sizes in advance, just use the average file size rounded up to 4K (default allocation unit on NTFS). And best of all : create a benchmark to test multiple options.
在理想情况下,我们应该有足够的内存在一次读取操作中读取文件。这将是最佳的表现,因为我们允许系统管理文件系统、分配单元和HDD。在实践中,您很幸运地提前知道了文件大小,只需使用平均大小为4K的文件大小(NTFS上的默认分配单元)。最重要的是:创建一个测试多个选项的基准。
#5
4
Reading files using Java NIO's FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream. Basically, memory-map large files, and use direct buffers for small ones.
使用Java NIO的FileChannel和MappedByteBuffer读取文件很可能会生成比涉及FileInputStream的任何解决方案都快得多的解决方案。基本上,内存映射大型文件,并对小型文件使用直接缓冲区。
#6
3
You could use the BufferedStreams/readers and then use their buffer sizes.
您可以使用BufferedStreams/reader,然后使用它们的缓冲区大小。
I believe the BufferedXStreams are using 8192 as the buffer size, but like Ovidiu said, you should probably run a test on a whole bunch of options. Its really going to depend on the filesystem and disk configurations as to what the best sizes are.
我相信BufferedXStreams使用的缓冲区大小是8192,但是就像Ovidiu说的那样,您可能应该对一大堆选项进行测试。它的大小取决于文件系统和磁盘的配置。
#7
1
As already mentioned in other answers, use BufferedInputStreams.
正如在其他答案中已经提到的,使用BufferedInputStreams。
After that, I guess the buffer size does not really matter. Either the program is I/O bound, and growing buffer size over BIS default, will not make any big impact on performance.
在那之后,我想缓冲区大小并不重要。这个程序要么是I/O绑定的,在BIS默认情况下增加缓冲区大小不会对性能产生太大的影响。
Or the program is CPU bound inside the MessageDigest.update(), and majority of the time is not spent in the application code, so tweaking it will not help.
或者该程序在MessageDigest.update()中受CPU限制,大部分时间都没有花在应用程序代码中,因此调整它也没有帮助。
(Hmm... with multiple cores, threads might help.)
(嗯…使用多个内核,线程可能会有所帮助。
#8
1
In BufferedInputStream‘s source you will find: private static int DEFAULT_BUFFER_SIZE = 8192;
So it's okey for you to use that default value.
But if you can figure out some more information you will get more valueable answers.
For example, your adsl maybe preffer a buffer of 1454 bytes, thats because TCP/IP's payload. For disks, you may use a value that match your disk's block size.
在BufferedInputStream的源代码中,您将发现:private static DEFAULT_BUFFER_SIZE = 8192;所以你可以使用这个默认值。但如果你能找到更多的信息,你就能得到更有价值的答案。例如,您的adsl可能提供了1454字节的缓冲区,这是因为TCP/IP的有效负载。对于磁盘,可以使用与磁盘块大小匹配的值。
#9
0
Make the buffer big enough for most of the files to be read in one shot. Be sure to reuse the same buffer and the same MessageDigest for reading different files.
使缓冲区足够大,使大多数文件可以一次读取。确保重用相同的缓冲区和相同的MessageDigest以读取不同的文件。
Unrelated to the question: read Sun's code conventions, especially spacing around parens and usage of redundant curly braces. Avoid operator =
in a while
or if
statement
与这个问题无关:阅读Sun的代码约定,特别是空格和使用冗余的花括号。避免操作符=稍后或if语句
#10
0
1024 is appropriate for a wide variety of circumstances, although in practice you may see better performance with a larger or smaller buffer size.
1024适用于各种各样的情况,尽管在实践中您可能会看到更大或更小的缓冲区大小具有更好的性能。
This would depend on a number of factors including file system block size and CPU hardware.
这将取决于许多因素,包括文件系统块大小和CPU硬件。
It is also common to choose a power of 2 for the buffer size, since most underlying hardware is structured with fle block and cache sizes that are a power of 2. The Buffered classes allow you to specify the buffer size in the constructor. If none is provided, they use a default value, which is a power of 2 in most JVMs.
对于缓冲区大小,选择2的幂也是很常见的,因为大多数底层硬件的结构都是用2的次方块和缓存大小。缓冲类允许您在构造函数中指定缓冲区大小。如果没有提供,则使用默认值,这在大多数jvm中是2的幂。
Regardless of which buffer size you choose, the biggest performance increase you will see is moving from nonbuffered to buffered file access. Adjusting the buffer size may improve performance slightly, but unless you are using an extremely small or extremely large buffer size, it is unlikely to have a signifcant impact.
无论选择哪个缓冲区大小,您将看到最大的性能提升是从非缓冲文件访问转移到缓冲文件访问。调整缓冲区大小可能会稍微提高性能,但除非使用非常小或非常大的缓冲区大小,否则不太可能产生显著的影响。
#1
177
Optimum buffer size is related to a number of things: file system block size, CPU cache size and cache latency.
最佳缓冲区大小与许多事情有关:文件系统块大小、CPU缓存大小和缓存延迟。
Most file systems are configured to use block sizes of 4096 or 8192. In theory, if you configure your buffer size so you are reading a few bytes more than the disk block, the operations with the file system can be extremely inefficient (i.e. if you configured your buffer to read 4100 bytes at a time, each read would require 2 block reads by the file system). If the blocks are already in cache, then you wind up paying the price of RAM -> L3/L2 cache latency. If you are unlucky and the blocks are not in cache yet, the you pay the price of the disk->RAM latency as well.
大多数文件系统都配置为使用4096或8192块大小。理论上,如果你配置你的缓冲区大小你读几个字节的磁盘块,文件系统的操作可以是极其低效的(即如果你配置一次缓冲区读取4100字节,每个读需要2块读取的文件系统)。如果块已经在缓存中,那么您将为RAM -> L3/L2缓存延迟付出代价。如果您不走运,而且块还没有在缓存中,那么您还需要付出磁盘的代价——>RAM延迟。
This is why you see most buffers sized as a power of 2, and generally larger than (or equal to) the disk block size. This means that one of your stream reads could result in multiple disk block reads - but those reads will always use a full block - no wasted reads.
这就是为什么大多数缓冲区的大小是2的幂,并且通常大于(或等于)磁盘块大小。这意味着您的一个流读取可能导致多个磁盘块读取——但是这些读取总是使用一个完整的块——没有浪费的读取。
Now, this is offset quite a bit in a typical streaming scenario because the block that is read from disk is going to still be in memory when you hit the next read (we are doing sequential reads here, after all) - so you wind up paying the RAM -> L3/L2 cache latency price on the next read, but not the disk->RAM latency. In terms of order of magnitude, disk->RAM latency is so slow that it pretty much swamps any other latency you might be dealing with.
现在,这是抵消不少在一个典型的流的场景,因为从磁盘读取的块将仍然是在内存中当你遇到下一个读(毕竟我们在这里所做的顺序读取)——所以你最终支付RAM - > L3 / L2高速缓存延迟价格在接下来的阅读,而不是磁盘- > RAM延迟。就数量级而言,磁盘->内存延迟是如此之慢,以至于它几乎淹没了您可能正在处理的任何其他延迟。
So, I suspect that if you ran a test with different cache sizes (haven't done this myself), you will probably find a big impact of cache size up to the size of the file system block. Above that, I suspect that things would level out pretty quickly.
因此,我怀疑如果您运行的测试具有不同的缓存大小(我自己还没有这样做),您可能会发现缓存大小对文件系统块大小的影响很大。除此之外,我猜想事情很快就会平稳下来。
There are a ton of conditions and exceptions here - the complexities of the system are actually quite staggering (just getting a handle on L3 -> L2 cache transfers is mind bogglingly complex, and it changes with every CPU type).
这里有大量的条件和异常——系统的复杂性实际上是非常惊人的(仅仅处理L3 -> L2缓存传输就非常复杂,并且随着每个CPU类型的变化而变化)。
This leads to the 'real world' answer: If your app is like 99% out there, set the cache size to 8192 and move on (even better, choose encapsulation over performance and use BufferedInputStream to hide the details). If you are in the 1% of apps that are highly dependent on disk throughput, craft your implementation so you can swap out different disk interaction strategies, and provide the knobs and dials to allow your users to test and optimize (or come up with some self optimizing system).
这就引出了“现实世界”的答案:如果你的应用程序是99%,那么将缓存大小设置为8192,然后继续前进(更好的是,选择封装而不是性能,并使用bufferedputstream来隐藏细节)。如果你所在的1%的应用程序高度依赖于磁盘吞吐量,那么你可以设计你的实现,这样你就可以交换不同的磁盘交互策略,并提供旋钮和拨号,让你的用户测试和优化(或者提供一些自我优化系统)。
#2
13
Yes, it's probably dependent on various things - but I doubt it will make very much difference. I tend to opt for 16K or 32K as a good balance between memory usage and performance.
是的,它可能依赖于各种各样的东西——但我怀疑它会带来多大的不同。我倾向于选择16K或32K作为内存使用和性能之间的良好平衡。
Note that you should have a try/finally block in the code to make sure the stream is closed even if an exception is thrown.
注意,在代码中应该有一个try/finally块,以确保即使抛出异常,流也是关闭的。
#3
7
In most cases, it really doesn't matter that much. Just pick a good size such as 4K or 16K and stick with it. If you're positive that this is the bottleneck in your application, then you should start profiling to find the optimal buffer size. If you pick a size that's too small, you'll waste time doing extra I/O operations and extra function calls. If you pick a size that's too big, you'll start seeing a lot of cache misses which will really slow you down. Don't use a buffer bigger than your L2 cache size.
在大多数情况下,它真的没有那么重要。选择一个好的尺寸,比如4K或16K,并坚持使用它。如果您确信这是应用程序中的瓶颈,那么您应该开始分析以找到最佳的缓冲区大小。如果选择的大小太小,将浪费时间进行额外的I/O操作和额外的函数调用。如果你选择了一个太大的尺寸,你会发现很多缓存丢失,这会让你慢下来。不要使用大于L2缓存大小的缓冲区。
#4
4
In the ideal case we should have enough memory to read the file in one read operation. That would be the best performer because we let the system manage File System , allocation units and HDD at will. In practice you are fortunate to know the file sizes in advance, just use the average file size rounded up to 4K (default allocation unit on NTFS). And best of all : create a benchmark to test multiple options.
在理想情况下,我们应该有足够的内存在一次读取操作中读取文件。这将是最佳的表现,因为我们允许系统管理文件系统、分配单元和HDD。在实践中,您很幸运地提前知道了文件大小,只需使用平均大小为4K的文件大小(NTFS上的默认分配单元)。最重要的是:创建一个测试多个选项的基准。
#5
4
Reading files using Java NIO's FileChannel and MappedByteBuffer will most likely result in a solution that will be much faster than any solution involving FileInputStream. Basically, memory-map large files, and use direct buffers for small ones.
使用Java NIO的FileChannel和MappedByteBuffer读取文件很可能会生成比涉及FileInputStream的任何解决方案都快得多的解决方案。基本上,内存映射大型文件,并对小型文件使用直接缓冲区。
#6
3
You could use the BufferedStreams/readers and then use their buffer sizes.
您可以使用BufferedStreams/reader,然后使用它们的缓冲区大小。
I believe the BufferedXStreams are using 8192 as the buffer size, but like Ovidiu said, you should probably run a test on a whole bunch of options. Its really going to depend on the filesystem and disk configurations as to what the best sizes are.
我相信BufferedXStreams使用的缓冲区大小是8192,但是就像Ovidiu说的那样,您可能应该对一大堆选项进行测试。它的大小取决于文件系统和磁盘的配置。
#7
1
As already mentioned in other answers, use BufferedInputStreams.
正如在其他答案中已经提到的,使用BufferedInputStreams。
After that, I guess the buffer size does not really matter. Either the program is I/O bound, and growing buffer size over BIS default, will not make any big impact on performance.
在那之后,我想缓冲区大小并不重要。这个程序要么是I/O绑定的,在BIS默认情况下增加缓冲区大小不会对性能产生太大的影响。
Or the program is CPU bound inside the MessageDigest.update(), and majority of the time is not spent in the application code, so tweaking it will not help.
或者该程序在MessageDigest.update()中受CPU限制,大部分时间都没有花在应用程序代码中,因此调整它也没有帮助。
(Hmm... with multiple cores, threads might help.)
(嗯…使用多个内核,线程可能会有所帮助。
#8
1
In BufferedInputStream‘s source you will find: private static int DEFAULT_BUFFER_SIZE = 8192;
So it's okey for you to use that default value.
But if you can figure out some more information you will get more valueable answers.
For example, your adsl maybe preffer a buffer of 1454 bytes, thats because TCP/IP's payload. For disks, you may use a value that match your disk's block size.
在BufferedInputStream的源代码中,您将发现:private static DEFAULT_BUFFER_SIZE = 8192;所以你可以使用这个默认值。但如果你能找到更多的信息,你就能得到更有价值的答案。例如,您的adsl可能提供了1454字节的缓冲区,这是因为TCP/IP的有效负载。对于磁盘,可以使用与磁盘块大小匹配的值。
#9
0
Make the buffer big enough for most of the files to be read in one shot. Be sure to reuse the same buffer and the same MessageDigest for reading different files.
使缓冲区足够大,使大多数文件可以一次读取。确保重用相同的缓冲区和相同的MessageDigest以读取不同的文件。
Unrelated to the question: read Sun's code conventions, especially spacing around parens and usage of redundant curly braces. Avoid operator =
in a while
or if
statement
与这个问题无关:阅读Sun的代码约定,特别是空格和使用冗余的花括号。避免操作符=稍后或if语句
#10
0
1024 is appropriate for a wide variety of circumstances, although in practice you may see better performance with a larger or smaller buffer size.
1024适用于各种各样的情况,尽管在实践中您可能会看到更大或更小的缓冲区大小具有更好的性能。
This would depend on a number of factors including file system block size and CPU hardware.
这将取决于许多因素,包括文件系统块大小和CPU硬件。
It is also common to choose a power of 2 for the buffer size, since most underlying hardware is structured with fle block and cache sizes that are a power of 2. The Buffered classes allow you to specify the buffer size in the constructor. If none is provided, they use a default value, which is a power of 2 in most JVMs.
对于缓冲区大小,选择2的幂也是很常见的,因为大多数底层硬件的结构都是用2的次方块和缓存大小。缓冲类允许您在构造函数中指定缓冲区大小。如果没有提供,则使用默认值,这在大多数jvm中是2的幂。
Regardless of which buffer size you choose, the biggest performance increase you will see is moving from nonbuffered to buffered file access. Adjusting the buffer size may improve performance slightly, but unless you are using an extremely small or extremely large buffer size, it is unlikely to have a signifcant impact.
无论选择哪个缓冲区大小,您将看到最大的性能提升是从非缓冲文件访问转移到缓冲文件访问。调整缓冲区大小可能会稍微提高性能,但除非使用非常小或非常大的缓冲区大小,否则不太可能产生显著的影响。