很多小文件或几个大文件？

In terms of performance and efficiency, is it better to use lots of small files (by lots I mean as much as a few million) or a couple (ten or so) huge (several gigabyte) files? Let's just say I'm building a database (not entirely true, but all that matters is that it's going to be accessed a LOT).

在性能和效率方面,使用大量小文件(通过批量表示多达几百万)或一对(十个左右)巨大(几千兆字节)文件是否更好?我们只是说我正在构建一个数据库(不完全正确,但重要的是它将被访问很多)。

I'm mainly concerned with read performance. My filesystem is currently ext3 on Linux (Ubuntu Server Edition if it matters), although I'm in a position where I can still switch, so comparisons between different filesystems would be fabulous. For technical reasons I can't use an actual DBMS for this (hence the question), so "just use MySQL" is not a good answer.

我主要关注读取性能。我的文件系统目前在Linux上是ext3(Ubuntu Server Edition,如果它很重要),虽然我处于可以切换的位置,因此不同文件系统之间的比较将是非常棒的。由于技术原因,我不能使用实际的DBMS(因此问题),所以“只使用MySQL”不是一个好的答案。

Thanks in advance, and let me know if I need to be more specific.

在此先感谢,如果我需要更具体,请告诉我。

EDIT: I'm going to be storing lots of relatively small pieces of data, which is why using lots of small files would be easier for me. So if I went with using a few large files, I'd only be retrieving a few KB out of them at a time. I'd also be using an index, so that's not really a problem. Also, some of the data points to other pieces of data (it would point to the file in the lots-of-small-files case, and point to the data's location within the file in the large-files case).

编辑:我将存储大量相对较小的数据,这就是为什么使用大量小文件对我来说会更容易。因此,如果我使用一些大文件,我一次只能从它们中检索几KB。我也会使用索引,所以这不是一个真正的问题。此外,一些数据指向其他数据(它将指向大量小文件中的文件,并指向大文件情况下文件中数据的位置)。

5 个解决方案

#1

There are a lot of assumptions here but, for all intents and purposes, searching through a large file will much be quicker than searching through a bunch of small files.

这里有很多假设,但是,对于所有意图和目的,搜索大文件比搜索一堆小文件要快得多。

Let's say you are looking for a string of text contained in a text file. Searching a 1TB file will be much faster than opening 1,000,000 MB files and searching through those.

假设您正在寻找文本文件中包含的一串文本。搜索1TB文件将比打开1,000,000 MB文件并搜索这些文件快得多。

Each file-open operation takes time. A large file only has to be opened once.

每个文件打开操作都需要时间。一个大文件只需要打开一次。

And, in considering disk performance, a single file is much more likely to be stored contigously than a large series of files.

而且,在考虑磁盘性能时,单个文件比大量文件更有可能被存储。

...Again, these are generalizations without knowing more about your specific application.

...同样,这些是概括而不了解您的具体应用。

Enjoy,

Robert C. Cartaino

#2

It depends. really. Different filesystems are optimized in a different way, but in general, small files are packed efficiently. The advantage of having large files is that you don't have to open and close a lot of stuff. open and close are operations that take time. If you have a large file, you normally open and close only once and you use seek operations

这取决于。真。不同的文件系统以不同的方式进行优化,但通常,小文件被有效地打包。拥有大文件的好处是你不必打开和关闭很多东西。打开和关闭是需要时间的操作。如果您有一个大文件,通常只打开和关闭一次,并使用搜索操作

If you go for the lots-of-files solution, I suggest you a structure like

如果您选择大量文件解决方案,我建议您使用类似的结构

b/a/bar
b/a/baz
f/o/foo

because you have limits on the number of files in a directory.

因为您对目录中的文件数有限制。

#3

The main issue here TMO is about indexing. If you're going to search information in a huge file without a good index, you'll have to scan the whole file for the correct information which can be long. If you think you can build strong indexing mechanisms then fine, you should go with the huge file.

TMO的主要问题是关于索引。如果您要在没有良好索引的大文件中搜索信息,则必须扫描整个文件以获取可能很长的正确信息。如果你认为你可以建立强大的索引机制那么好,你应该使用庞大的文件。

I'd prefer to delegate this task to ext3 which should be rather good at it.

我更愿意将这个任务委托给ext3,它应该相当擅长。

edit :

A thing to consider according to this wikipedia article on ext3 is that fragmentation does happen over time. So if you have a huge number of small files which take a significant percentage of the file system then you will lose performances over time.

根据这篇关于ext3的*文章,要考虑的事情是碎片确实会随着时间的推移而发生。因此,如果您有大量的小文件占据了文件系统的很大一部分,那么随着时间的推移您将失去性能。

The article also validate the claim about 32k files per directory limit (assuming a wikipedia article can validate anything)

本文还验证了每个目录限制的32k文件的声明(假设*文章可以验证任何内容)

#4

I believe Ext3 has a limit of about 32000 files/subdirectories per directory. If you're going the millions of files route, you'll need to spread them throughout many directories. I don't know what that would do to performance.

我相信Ext3每个目录限制大约32000个文件/子目录。如果您要将数百万个文件路由,您需要将它们分布在许多目录中。我不知道那对性能会有什么影响。

My preference would be for the several large files. In fact, why have several at all, unless they're some kind of logically-separate units? If you're still splitting it up just for the sake of splitting it, I say don't do that. Ext3 can handle very large files just fine.

我的偏好是几个大文件。事实上,为什么有几个,除非他们是某种逻辑上独立的单位?如果你为了分裂而仍然将它分开,我说不要这样做。 Ext3可以很好地处理非常大的文件。

#5

I work with a system that stores up to about 5 million files on an XFS file system under Linux and haven't had any performance problems. We only use the files for storing the data, we never full scan them, we have a database for searching and one of the fields in a table contains a guid which we use to retrieve. We use exactly two levels of directories as above with the filenames being the guid, though more could be used if the number of files got even larger. We chose this approach to avoid storing a few extra terabytes in the database that only needed to be stored/returned and never searched through and it has worked well for us. Our files range from 1k to about 500k.

我使用的系统在Linux下的XFS文件系统上存储多达约500万个文件,并且没有任何性能问题。我们只使用文件来存储数据,我们从不对它们进行全面扫描,我们有一个用于搜索的数据库,表中的一个字段包含我们用来检索的guid。我们使用如上所述的两个级别的目录,文件名是guid,但如果文件的数量更大,则可以使用更多。我们选择这种方法是为了避免在数据库中存储一些额外的TB,这些TB只需要存储/返回并且从不搜索过,而且它对我们来说效果很好。我们的文件范围从1k到约500k。

We have also run the system on ext3, and it functioned fine, though I'm not sure if we ever pushed it past about a million files. We'd probably need to go to a 3 directory system due to maximum files per directory limitations.

我们也在ext3上运行系统,它运行良好,但我不确定我们是否曾经推过它超过一百万个文件。由于每个目录的最大文件数限制,我们可能需要转到3目录系统。

#1