你如何处理许多小文件？

A product that I am working on collects several thousand readings a day and stores them as 64k binary files on a NTFS partition (Windows XP). After a year in production there is over 300000 files in a single directory and the number keeps growing. This has made accessing the parent/ancestor directories from windows explorer very time consuming.

我正在处理的产品每天收集数千个读数,并将它们存储为NTFS分区(Windows XP)上的64k二进制文件。经过一年的生产,一个目录中有超过300000个文件,而且这个数字还在不断增长。这使得从Windows资源管理器访问父/祖先目录非常耗时。

I have tried turning off the indexing service but that made no difference. I have also contemplated moving the file content into a database/zip files/tarballs but it is beneficial for us to access the files individually; basically, the files are still needed for research purposes and the researchers are not willing to deal with anything else.

我试过关闭索引服务,但没有区别。我还考虑将文件内容移动到数据库/ zip文件/ tarball中,但对我们来说单独访问文件是有益的。基本上,这些文件仍然需要用于研究目的,研究人员不愿意处理其他任何事情。

Is there a way to optimize NTFS or Windows so that it can work with all these small files?

有没有办法优化NTFS或Windows,以便它可以使用所有这些小文件?

14 个解决方案

#1

NTFS performance severely degrades after 10,000 files in a directory. What you do is create an additional level in the directory hierarchy, with each subdirectory having 10,000 files.

在目录中的10,000个文件之后,NTFS性能严重下降。您所做的是在目录层次结构中创建一个附加级别,每个子目录包含10,000个文件。

For what it's worth, this is the approach that the SVN folks took in version 1.5. They used 1,000 files as the default threshold.

对于它的价值,这是SVN人员在1.5版本中采用的方法。他们使用1,000个文件作为默认阈值。

#2

NTFS actually will perform fine with many more than 10,000 files in a directory as long as you tell it to stop creating alternative file names compatible with 16 bit Windows platforms. By default NTFS automatically creates an '8 dot 3' file name for every file that is created. This becomes a problem when there are many files in a directory because Windows looks at the files in the directory to make sure the name they are creating isn't already in use. You can disable '8 dot 3' naming by setting the NtfsDisable8dot3NameCreation registry value to 1. The value is found in the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\FileSystem registry path. It is safe to make this change as '8 dot 3' name files are only required by programs written for very old versions of Windows.

只要您告诉它停止创建与16位Windows平台兼容的备用文件名,NTFS实际上可以在目录中的10,000多个文件中正常运行。默认情况下,NTFS会自动为每个创建的文件创建“8点3”文件名。当目录中有许多文件时,这会成为一个问题,因为Windows会查看目录中的文件,以确保它们正在创建的名称尚未使用。您可以通过将NtfsDisable8dot3NameCreation注册表值设置为1来禁用“8点3”命名。该值可在HKEY_LOCAL_MACHINE \ System \ CurrentControlSet \ Control \ FileSystem注册表路径中找到。进行此更改是安全的,因为只有为非常旧版本的Windows编写的程序才需要“8点3”名称文件。

A reboot is required before this setting will take effect.

在此设置生效之前,需要重新启动。

#3

The performance issue is being caused by the huge amount of files in a single directory: once you eliminate that, you should be fine. This isn't a NTFS-specific problem: in fact, it's commonly encountered with user home/mail files on large UNIX systems.

性能问题是由单个目录中的大量文件引起的:一旦你消除了它,你应该没问题。这不是NTFS特有的问题:实际上,在大型UNIX系统上通常会遇到用户主页/邮件文件。

One obvious way to resolve this issue, is moving the files to folders with a name based on the file name. Assuming all your files have file names of similar length, e.g. ABCDEFGHI.db, ABCEFGHIJ.db, etc, create a directory structure like this:

解决此问题的一种显而易见的方法是将文件移动到具有基于文件名的名称的文件夹。假设您的所有文件都有相似长度的文件名,例如ABCDEFGHI.db,ABCEFGHIJ.db等创建如下目录结构:

ABC\
    DEF\
        ABCDEFGHI.db
    EFG\
        ABCEFGHIJ.db

Using this structure, you can quickly locate a file based on its name. If the file names have variable lengths, pick a maximum length, and prepend zeroes (or any other character) in order to determine the directory the file belongs in.

使用此结构,您可以根据文件名快速查找文件。如果文件名具有可变长度,请选择最大长度,并在前面添加零(或任何其他字符)以确定文件所属的目录。

#4

I have seen vast improvements in the past from splitting the files up into a nested hierarchy of directories by, e.g., first then second letter of filename; then each directory does not contain an excessive number of files. Manipulating the whole database is still slow, however.

我已经看到过去通过将文件分割成目录的嵌套层次结构,例如,首先是文件名的第二个字母;那么每个目录都不包含过多的文件。但是,操纵整个数据库仍然很慢。

#5

If you can calculate names of files, you might be able to sort them into folders by date, so that each folder only have files for a particular date. You might also want to create month and year hierarchies.

如果您可以计算文件的名称,则可以按日期将它们排序到文件夹中,以便每个文件夹只包含特定日期的文件。您可能还想创建月份和年份层次结构。

Also, could you move files older than say, a year, to a different (but still accessible) location?

此外,您是否可以将比一年前更早的文件移动到不同(但仍可访问)的位置?

Finally, and again, this requires you to be able to calculate names, you'll find that directly accessing a file is much faster than trying to open it via explorer. For example, saying
notepad.exe "P:\ath\to\your\filen.ame"
from the command line should actually be pretty quick, assuming you know the path of the file you need without having to get a directory listing.

最后,再次,这要求您能够计算名称,您会发现直接访问文件比尝试通过资源管理器打开它要快得多。例如,假设您知道所需文件的路径而不必获取目录列表,从命令行说notepad.exe“P:\ ath \ to \ your \ filen.ame”实际上应该非常快。

#6

Having hundreds of thousands of files in a single directory will indeed cripple NTFS, and there is not really much you can do about that. You should reconsider storing the data in a more practical format, like one big tarball or in a database.

在一个目录中拥有数十万个文件确实会使NTFS瘫痪,而且你无法做到这一点。您应该重新考虑以更实用的格式存储数据,例如一个大型tarball或数据库。

If you really need a separate file for each reading, you should sort them into several sub directories instead of having all of them in the same directory. You can do this by creating a hierarchy of directories and put the files in different ones depending on the file name. This way you can still store and load your files knowing just the file name.

如果每个读取确实需要一个单独的文件,则应将它们分类到多个子目录中,而不是将它们全部放在同一目录中。您可以通过创建目录层次结构并根据文件名将文件放在不同的目录中来完成此操作。这样,您只需知道文件名即可存储和加载文件。

The method we use is to take the last few letters of the file name, reversing them, and creating one letter directories from that. Consider the following files for example:

我们使用的方法是获取文件名的最后几个字母,将它们反转,然后从中创建一个字母目录。例如,请考虑以下文件:

1.xml
24.xml
12331.xml
2304252.xml

you can sort them into directories like so:

你可以将它们分类到这样的目录中:

data/1.xml
data/24.xml
data/1/3/3/12331.xml
data/2/5/2/4/0/2304252.xml

This scheme will ensure that you will never have more than 100 files in each directory.

此方案将确保每个目录中永远不会有超过100个文件。

#7

One common trick is to simply create a handful of subdirectories and divvy up the files.

一个常见的技巧是简单地创建一些子目录并对文件进行分割。

For instance, Doxygen, an automated code documentation program which can produce tons of html pages, has an option for creating a two-level deep directory hierarchy. The files are then evenly distributed across the bottom directories.

例如,Doxygen是一个可以生成大量html页面的自动代码文档程序,可以选择创建两级深层目录层次结构。然后将文件均匀分布在底层目录中。

#8

Aside from placing the files in sub-directories..

除了将文件放在子目录中..

Personally, I would develop an application that keeps the interface to that folder the same, ie all files are displayed as being individual files. Then in the application background actually takes these files and combine them into a larger files(and since the sizes are always 64k getting the data you need should be relatively easy) To get rid of the mess you have.

就个人而言,我会开发一个应用程序,使该文件夹的界面保持不变,即所有文件都显示为单个文件。然后在应用程序背景中实际上采用这些文件并将它们组合成一个更大的文件(因为大小总是64k,所以你需要的数据应该相对容易)去摆脱你所拥有的混乱。

So you can still make it easy for them to access the files they want, but also lets you have more control how everything is structured.

因此,您仍然可以让他们轻松访问他们想要的文件,同时还可以让您更好地控制所有内容的结构。

#9

You could try using something like Solid File System.

您可以尝试使用Solid File System之类的东西。

This gives you a virtual file system that applications can mount as if it were a physical disk. Your application sees lots of small files, but just one file sits on your hard drive.

这为您提供了一个虚拟文件系统,应用程序可以像物理磁盘一样挂载。您的应用程序会看到许多小文件,但只有一个文件位于您的硬盘上。

http://www.eldos.com/solfsdrv/

#10

I have run into this problem lots of times in the past. We tried storing by date, zipping files below the date so you don't have lots of small files, etc. All of them were bandaids to the real problem of storing the data as lots of small files on NTFS.

我过去曾多次遇到过这个问题。我们尝试按日期存储,在日期下面压缩文件,这样你就没有很多小文件等等。所有这些都是在NTFS上存储数据作为大量小文件的真正问题。

You can go to ZFS or some other file system that handles small files better, but still stop and ask if you NEED to store the small files.

您可以转到ZFS或其他更好地处理小文件的文件系统,但仍然停下来询问您是否需要存储小文件。

In our case we eventually went to a system were all of the small files for a certain date were appended in a TAR type of fashion with simple delimiters to parse them. The disk files went from 1.2 million to under a few thousand. They actually loaded faster because NTFS can't handle the small files very well, and the drive was better able to cache a 1MB file anyway. In our case the access and parse time to find the right part of the file was minimal compared to the actual storage and maintenance of stored files.

在我们的案例中,我们最终进入了一个系统,所有特定日期的小文件都以TAR类型的方式附加,并使用简单的分隔符来解析它们。磁盘文件从120万到几千。它们实际加载速度更快,因为NTFS无法很好地处理小文件,并且驱动器能够更好地缓存1MB文件。在我们的例子中,与存储文件的实际存储和维护相比,查找文件正确部分的访问和解析时间是最小的。

#11

If there are any meaningful, categorical, aspects of the data you could nest them in a directory tree. I believe the slowdown is due to the number of files in one directory, not the sheer number of files itself.

如果数据有任何有意义的,明确的方面,您可以将它们嵌套在目录树中。我认为减速是由于一个目录中的文件数量,而不是文件本身的数量。

The most obvious, general grouping is by date, and gives you a three-tiered nesting structure (year, month, day) with a relatively safe bound on the number of files in each leaf directory (1-3k).

最明显的一般分组是按日期,并为您提供三层嵌套结构(年,月,日),每个叶目录(1-3k)中的文件数相对安全。

Even if you are able to improve the filesystem/file browser performance, it sounds like this is a problem you will run into in another 2 years, or 3 years... just looking at a list of 0.3-1mil files is going to incur a cost, so it may be better in the long-term to find ways to only look at smaller subsets of the files.

即使您能够提高文件系统/文件浏览器的性能,听起来这是一个问题,您将在另外2年或3年内遇到...只需查看0.3-1mil文件列表即将发生因此,从长远来看,找到仅查看较小文件子集的方法可能会更好。

Using tools like 'find' (under cygwin, or mingw) can make the presence of the subdirectory tree a non-issue when browsing files.

使用像'find'这样的工具(在cygwin或mingw下)可以在浏览文件时使子目录树的存在成为非问题。

#12

Rename the folder each day with a time stamp.

每天使用时间戳重命名文件夹。

If the application is saving the files into c:\Readings, then set up a scheduled task to rename Reading at midnight and create a new empty folder.

如果应用程序将文件保存到c:\ Readings,则设置计划任务以在午夜重命名读取并创建新的空文件夹。

Then you will get one folder for each day, each containing several thousand files.

然后,您将获得每天一个文件夹,每个文件夹包含数千个文件。

You can extend the method further to group by month. For example, C:\Reading become c:\Archive\September\22.

您可以将该方法进一步扩展到逐月分组。例如,C:\ Reading变为c:\ Archive \ September \ 22。

You have to be careful with your timing to ensure you are not trying to rename the folder while the product is saving to it.

您必须小心时间,以确保在产品保存时不要重命名该文件夹。

#13

Consider pushing them to another server that uses a filesystem friendlier to massive quantities of small files (Solaris w/ZFS for example)?

考虑将它们推送到另一个使用文件系统更友好的大量小文件的服务器(例如Solaris w / ZFS)?

#14

To create a folder structure that will scale to a large unknown number of files, I like the following system:

要创建一个可扩展到大量未知文件的文件夹结构,我喜欢以下系统:

Split the filename into fixed length pieces, and then create nested folders for each piece except the last.

将文件名拆分为固定长度的片段,然后为除最后一个片段之外的每个片段创建嵌套文件夹。

The advantage of this system is that the depth of the folder structure only grows as deep as the length of the filename. So if your files are automatically generated in a numeric sequence, the structure is only is deep is it needs to be.

该系统的优点是文件夹结构的深度只会增长到文件名长度的深度。因此,如果您的文件是以数字顺序自动生成的,那么结构只需要很深。

12.jpg -> 12.jpg
123.jpg -> 12\123.jpg
123456.jpg -> 12\34\123456.jpg

This approach does mean that folders contain files and sub-folders, but I think it's a reasonable trade off.

这种方法确实意味着文件夹包含文件和子文件夹,但我认为这是一个合理的权衡。

And here's a beautiful PowerShell one-liner to get you going!

这里有一个漂亮的PowerShell单行程,让你前进!

$s = '123456'

-join  (( $s -replace '(..)(?!$)', '$1\' -replace '[^\\]*$','' ), $s )

#1