I've written an application in C# that moves jpgs from one set of directories to another set of directories concurrently (one thread per fixed subdirectory). The code looks something like this:
我在C#中编写了一个应用程序,它将jpgs从一组目录同时移动到另一组目录(每个固定子目录一个线程)。代码看起来像这样:
string destination = "";
DirectoryInfo dir = new DirectoryInfo("");
DirectoryInfo subDirs = dir.GetDirectories();
foreach (DirectoryInfo d in subDirs)
{
FileInfo[] files = subDirs.GetFiles();
foreach (FileInfo f in files)
{
f.MoveTo(destination);
}
}
However, the performance of the application is horrendous - tons of page faults/sec. The number of files in each subdirectory can get quite large, so I think a big performance penalty comes from a context switch, to where it can't keep all the different file arrays in RAM at the same time, such that it's going to disk nearly every time.
但是,应用程序的性能是可怕的 - 大量的页面错误/秒。每个子目录中的文件数量可能会非常大,所以我认为一个很大的性能损失来自一个上下文切换,它不能同时将所有不同的文件阵列保存在RAM中,这样它就会进入磁盘几乎每一次。
There's a two different solutions that I can think of. The first is rewriting this in C or C++, and the second is to use multiple processes instead of multithreading.
我能想到两种不同的解决方案。第一种是用C或C ++重写它,第二种是使用多个进程而不是多线程。
Edit: The files are named based on a time stamp, and the directory they are moved to are based on that name. So the directories they are moved to would correspond to the hour it was created; 3-27-2009/10 for instance.
编辑:文件根据时间戳命名,它们移动到的目录基于该名称。因此,它们被移动到的目录将对应于它创建的小时;例如3-27-2009 / 10。
We are creating a background worker per directory for threading.
我们正在为每个目录创建一个后台工作程序用于线程。
Any suggestions?
8 个解决方案
#1
Reconsidered answer
I've been rethinking my original answer below. I still suspect that using fewer threads would probably be a good idea, but as you're just moving files, it shouldn't actually be that IO intensive. It's possible that just listing the files is taking a lot of disk work.
我一直在重新考虑下面的原始答案。我仍然怀疑使用更少的线程可能是一个好主意,但是因为你只是移动文件,它实际上不应该是IO密集型。只列出文件可能会占用大量磁盘。
However, I doubt that you're really running out of memory for the files. How much memory have you got? How much memory is the process taking up? How many threads are you using, and how many cores do you have? (Using significantly more threads than you have cores is a bad idea, IMO.)
但是,我怀疑你的文件内存真的不足。你有多少记忆?这个过程占用了多少内存?您使用了多少个线程,以及您拥有多少个核心? (使用明显多于核心的线程是一个坏主意,IMO。)
I suggest the following plan of attack:
我建议采取以下攻击计划:
- Work out where the bottlenecks actually are. Try fetching the list of files but not doing the moving them. See how hard the disk is hit, and how long it takes.
- Experiment with different numbers of threads, with a queue of directories still to process.
- Keep an eye on the memory use and garbage collections. The Windows performance counters for the CLR are good for this.
找出实际存在瓶颈的地方。尝试获取文件列表但不执行移动文件。了解磁盘的硬度以及耗时。
尝试不同数量的线程,目录队列仍在处理中。
密切关注内存使用和垃圾收集。 CLR的Windows性能计数器对此有好处。
Original answer
Rewriting in C or C++ wouldn't help. Using multiple processes wouldn't help. What you're doing is akin to giving a single processor a hundred threads - except you're doing it with the disk instead.
用C或C ++重写无济于事。使用多个过程无济于事。你正在做的就是给一个处理器一百个线程 - 除了你用磁盘代替它。
It makes sense to parallelise tasks which use IO if there's also a fair amount of computation involved, but if it's already disk bound, asking the disk to work with lots of files at the same time is only going to make things worse.
如果涉及相当多的计算量,并行使用IO的任务是有意义的,但是如果它已经是磁盘绑定的,那么要求磁盘同时处理大量文件只会让事情变得更糟。
You may be interested in a benchmark (description and initial results) I've recently been running, testing "encryption" of individual lines of a file. When the level of "encryption" is low (i.e. it's hardly doing any CPU work) the best results are always with a single thread.
您可能对我最近运行的基准测试(描述和初始结果)感兴趣,测试文件各行的“加密”。当“加密”级别较低时(即它几乎不做任何CPU工作),最好的结果始终是单个线程。
#2
Rule of thumb, don't parallelize operations with serial dependencies. In this case your hard drive is the bottleneck and to many threads are just going to make performance worse.
经验法则,不要将操作与串行依赖关系并行化。在这种情况下,您的硬盘驱动器是瓶颈,许多线程只会使性能变差。
If you are going to use threads try to limit the number to the number of resources you have available, cores and hard disks not the number of jobs you have pending, directories to copy.
如果要使用线程,请尝试将数量限制为可用资源数,核心数和硬盘数不是您要挂起的作业数,要复制的目录数。
#3
If you've got a block of work that is dependent on a system bottleneck, in this case disk IO, you would be better off not using multiple threads or processes. All that you will end up doing is generating a lot of extra CPU and memory activity while waiting for the disk. You would probably find the performance of your app improved if you used a single thread to do your moves.
如果你有一块依赖于系统瓶颈的工作,在这种情况下是磁盘IO,你最好不要使用多个线程或进程。您最终要做的就是在等待磁盘时产生大量额外的CPU和内存活动。如果您使用单个线程进行移动,您可能会发现应用程序的性能得到改善。
#4
It seems you are moving a directory, surely just renaming/moving the directory would be sufficient. If you are on the same source and hard disk, it would be instant.
看来你正在移动一个目录,当然只需重命名/移动目录即可。如果你在相同的源和硬盘上,它将是即时的。
Also capturing all the file info for every file would be unnecessary, the name of the file would suffice.
同时捕获每个文件的所有文件信息都是不必要的,文件的名称就足够了。
#5
the performence problem comes from the hard drive there is no point from redoing everything with C/C++ nor from multiple processes
性能问题来自硬盘驱动器,没有必要用C / C ++或多个进程重做所有东西
#6
Are you looking at the page-fault count and inferring memory pressure from that? You might well find that the underlying Win32/OS file copy is using mapped files/page faults to do its work, and the faults are not a sign of a problem anyway. Much of Window's own file handling is done via page faults (e.g. 'loading' executable code) - they're not a bad thing per se.
您是否正在查看页面错误计数并从中推断出内存压力?您可能会发现底层的Win32 / OS文件副本使用映射文件/页面错误来完成其工作,并且故障无论如何都不是问题的标志。 Window的大部分文件处理是通过页面错误完成的(例如'加载'可执行代码) - 它们本身并不是一件坏事。
If you are suffering from memory pressure, then I would surmise that it's more likely to be caused by creating a huge number of threads (which are very expensive), rather than by the file copying.
如果你正遭受内存压力,那么我猜测它更可能是由创建大量线程(非常昂贵)而不是文件复制引起的。
Don't change anything without profiling, and if you profile and find the time is spent in framework methods which are merely wrappers on Win32 functions (download the framework source and have a look at how those methods work), then don't waste time on C++.
如果没有分析,请不要更改任何内容,如果您分析并发现时间花在框架方法上,这些方法只是Win32函数的包装器(下载框架源并查看这些方法是如何工作的),那么不要浪费时间在C ++上。
#7
If GetFiles() is indeed returning a large set of data, you could write an enumerator, as in:
如果GetFiles()确实返回了大量数据,那么您可以编写一个枚举器,如下所示:
IEnumerable<string> GetFiles();
#8
So, you're moving files, one at a time, from one subfolder to another subfolder? Wouldn't you be causing lots of disk seeks as the drive head moves back and forth? You might get better performance from reading the files into memory (at least in batches if not all at once), writing them to disk, then deleting the originals from disk.
那么,您是将文件从一个子文件夹一次一个地移动到另一个子文件夹?当驱动器头来回移动时,您不会导致大量磁盘搜索吗?通过将文件读入内存可以获得更好的性能(至少是批量生成,如果不是全部一次),将它们写入磁盘,然后从磁盘中删除原件。
And if you're doing multiple sets of folders in separate threads, then you're moving the disk head around even more. This is one case where multiple threads isn't doing you a favor (although you might get some benefit if you have a RAID or SAN, etc).
如果你在不同的线程中执行多组文件夹,那么你就可以更进一步地移动磁盘头了。这是一个多线程对你没有帮助的情况(尽管如果你有RAID或SAN等可能会获得一些好处)。
If you were processing the files in some way, then mulptithreading could help if different CPUs could calculate on multiple files at once. But you can't get four CPUs to move one disk head to four different locations at once.
如果您以某种方式处理文件,那么如果不同的CPU可以同时计算多个文件,则mulptithreading可能会有所帮助。但是你不能让四个CPU同时将一个磁盘头移动到四个不同的位置。
#1
Reconsidered answer
I've been rethinking my original answer below. I still suspect that using fewer threads would probably be a good idea, but as you're just moving files, it shouldn't actually be that IO intensive. It's possible that just listing the files is taking a lot of disk work.
我一直在重新考虑下面的原始答案。我仍然怀疑使用更少的线程可能是一个好主意,但是因为你只是移动文件,它实际上不应该是IO密集型。只列出文件可能会占用大量磁盘。
However, I doubt that you're really running out of memory for the files. How much memory have you got? How much memory is the process taking up? How many threads are you using, and how many cores do you have? (Using significantly more threads than you have cores is a bad idea, IMO.)
但是,我怀疑你的文件内存真的不足。你有多少记忆?这个过程占用了多少内存?您使用了多少个线程,以及您拥有多少个核心? (使用明显多于核心的线程是一个坏主意,IMO。)
I suggest the following plan of attack:
我建议采取以下攻击计划:
- Work out where the bottlenecks actually are. Try fetching the list of files but not doing the moving them. See how hard the disk is hit, and how long it takes.
- Experiment with different numbers of threads, with a queue of directories still to process.
- Keep an eye on the memory use and garbage collections. The Windows performance counters for the CLR are good for this.
找出实际存在瓶颈的地方。尝试获取文件列表但不执行移动文件。了解磁盘的硬度以及耗时。
尝试不同数量的线程,目录队列仍在处理中。
密切关注内存使用和垃圾收集。 CLR的Windows性能计数器对此有好处。
Original answer
Rewriting in C or C++ wouldn't help. Using multiple processes wouldn't help. What you're doing is akin to giving a single processor a hundred threads - except you're doing it with the disk instead.
用C或C ++重写无济于事。使用多个过程无济于事。你正在做的就是给一个处理器一百个线程 - 除了你用磁盘代替它。
It makes sense to parallelise tasks which use IO if there's also a fair amount of computation involved, but if it's already disk bound, asking the disk to work with lots of files at the same time is only going to make things worse.
如果涉及相当多的计算量,并行使用IO的任务是有意义的,但是如果它已经是磁盘绑定的,那么要求磁盘同时处理大量文件只会让事情变得更糟。
You may be interested in a benchmark (description and initial results) I've recently been running, testing "encryption" of individual lines of a file. When the level of "encryption" is low (i.e. it's hardly doing any CPU work) the best results are always with a single thread.
您可能对我最近运行的基准测试(描述和初始结果)感兴趣,测试文件各行的“加密”。当“加密”级别较低时(即它几乎不做任何CPU工作),最好的结果始终是单个线程。
#2
Rule of thumb, don't parallelize operations with serial dependencies. In this case your hard drive is the bottleneck and to many threads are just going to make performance worse.
经验法则,不要将操作与串行依赖关系并行化。在这种情况下,您的硬盘驱动器是瓶颈,许多线程只会使性能变差。
If you are going to use threads try to limit the number to the number of resources you have available, cores and hard disks not the number of jobs you have pending, directories to copy.
如果要使用线程,请尝试将数量限制为可用资源数,核心数和硬盘数不是您要挂起的作业数,要复制的目录数。
#3
If you've got a block of work that is dependent on a system bottleneck, in this case disk IO, you would be better off not using multiple threads or processes. All that you will end up doing is generating a lot of extra CPU and memory activity while waiting for the disk. You would probably find the performance of your app improved if you used a single thread to do your moves.
如果你有一块依赖于系统瓶颈的工作,在这种情况下是磁盘IO,你最好不要使用多个线程或进程。您最终要做的就是在等待磁盘时产生大量额外的CPU和内存活动。如果您使用单个线程进行移动,您可能会发现应用程序的性能得到改善。
#4
It seems you are moving a directory, surely just renaming/moving the directory would be sufficient. If you are on the same source and hard disk, it would be instant.
看来你正在移动一个目录,当然只需重命名/移动目录即可。如果你在相同的源和硬盘上,它将是即时的。
Also capturing all the file info for every file would be unnecessary, the name of the file would suffice.
同时捕获每个文件的所有文件信息都是不必要的,文件的名称就足够了。
#5
the performence problem comes from the hard drive there is no point from redoing everything with C/C++ nor from multiple processes
性能问题来自硬盘驱动器,没有必要用C / C ++或多个进程重做所有东西
#6
Are you looking at the page-fault count and inferring memory pressure from that? You might well find that the underlying Win32/OS file copy is using mapped files/page faults to do its work, and the faults are not a sign of a problem anyway. Much of Window's own file handling is done via page faults (e.g. 'loading' executable code) - they're not a bad thing per se.
您是否正在查看页面错误计数并从中推断出内存压力?您可能会发现底层的Win32 / OS文件副本使用映射文件/页面错误来完成其工作,并且故障无论如何都不是问题的标志。 Window的大部分文件处理是通过页面错误完成的(例如'加载'可执行代码) - 它们本身并不是一件坏事。
If you are suffering from memory pressure, then I would surmise that it's more likely to be caused by creating a huge number of threads (which are very expensive), rather than by the file copying.
如果你正遭受内存压力,那么我猜测它更可能是由创建大量线程(非常昂贵)而不是文件复制引起的。
Don't change anything without profiling, and if you profile and find the time is spent in framework methods which are merely wrappers on Win32 functions (download the framework source and have a look at how those methods work), then don't waste time on C++.
如果没有分析,请不要更改任何内容,如果您分析并发现时间花在框架方法上,这些方法只是Win32函数的包装器(下载框架源并查看这些方法是如何工作的),那么不要浪费时间在C ++上。
#7
If GetFiles() is indeed returning a large set of data, you could write an enumerator, as in:
如果GetFiles()确实返回了大量数据,那么您可以编写一个枚举器,如下所示:
IEnumerable<string> GetFiles();
#8
So, you're moving files, one at a time, from one subfolder to another subfolder? Wouldn't you be causing lots of disk seeks as the drive head moves back and forth? You might get better performance from reading the files into memory (at least in batches if not all at once), writing them to disk, then deleting the originals from disk.
那么,您是将文件从一个子文件夹一次一个地移动到另一个子文件夹?当驱动器头来回移动时,您不会导致大量磁盘搜索吗?通过将文件读入内存可以获得更好的性能(至少是批量生成,如果不是全部一次),将它们写入磁盘,然后从磁盘中删除原件。
And if you're doing multiple sets of folders in separate threads, then you're moving the disk head around even more. This is one case where multiple threads isn't doing you a favor (although you might get some benefit if you have a RAID or SAN, etc).
如果你在不同的线程中执行多组文件夹,那么你就可以更进一步地移动磁盘头了。这是一个多线程对你没有帮助的情况(尽管如果你有RAID或SAN等可能会获得一些好处)。
If you were processing the files in some way, then mulptithreading could help if different CPUs could calculate on multiple files at once. But you can't get four CPUs to move one disk head to four different locations at once.
如果您以某种方式处理文件,那么如果不同的CPU可以同时计算多个文件,则mulptithreading可能会有所帮助。但是你不能让四个CPU同时将一个磁盘头移动到四个不同的位置。