如何有效地存储数千个文件的hundrets?

时间:2022-08-15 19:19:05

I'm working on a system that will need to store a lot of documents (PDFs, Word files etc.) I'm using Solr/Lucene to search for revelant information extracted from those documents but I also need a place to store the original files so that they can be opened/downloaded by the users.

我正在研究一个需要存储大量文档的系统(PDF,Word文件等)我正在使用Solr / Lucene来搜索从这些文档中提取的重要信息,但我还需要一个存储原始文件的地方文件,以便用户可以打开/下载它们。

I was thinking about several possibilities:

我在考虑几种可能性:

  • file system - probably not that good idea to store 1m documents
  • 文件系统 - 存储1m文件可能不是一个好主意

  • sql database - but I won't need most of it's relational features as I need to store only the binary document and its id so this might not be the fastest solution
  • sql数据库 - 但我不需要它的大部分关系功能,因为我只需要存储二进制文件及其id,所以这可能不是最快的解决方案

  • no-sql database - don't have any expierience with them so I'm not sure if they are any good either, there are also many of them so I don't know which one to pick
  • no-sql数据库 - 没有任何expierience与他们,所以我不确定他们是否也有任何好处,也有很多他们所以我不知道哪一个选择

The storage I'm looking for should be:

我正在寻找的存储应该是:

  • fast
  • scallable
  • open-source (not crucial but nice to have)
  • 开源(不是至关重要但很高兴)

Can you recommend what's the best way of storing those files will be in your opinion?

您能否建议在您看来存储这些文件的最佳方式是什么?

4 个解决方案

#1


5  

A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.

文件系统 - 顾名思义 - 经过精心设计和优化,可以高效,可扩展的方式存储大量文件。

#2


1  

You can follow Facebook as it stores a lot of files (15 billion photos):

您可以关注Facebook,因为它存储了大量文件(150亿张照片):

  • They Initially started with NFS share served by commercial storage appliances.
  • 它们最初是由商业存储设备提供的NFS共享开始的。

  • Then they moved to their onw implementation http file server called Haystack
  • 然后他们转移到他们的onw实现http文件服务器Haystack

Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919

如果您想了解更多信息,请访问以下网址http://www.facebook.com/note.php?note_id=76191543919

Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.

关于NFS共享。请记住,出于性能原因,NFS共享通常会限制一个文件夹中的文件数量。 (如果您假设所有最近的文件系统都使用b-tree来存储其结构,则这可能有点反直觉。)因此,如果您使用的是(NetApp)等商业NFS共享,则可能需要将文件保存在多个文件夹中。

You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group. For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.

如果您的文件有任何ID,则可以这样做。只需将Ascii表示分成几个字符组,并为每个组创建文件夹。例如,我们对id使用整数,因此id为1234567891的文件存储为存储/ 0012/341/7891。

Hope that helps.

希望有所帮助。

#3


0  

In my opinion...

在我看来...

I would store files compressed onto disk (file system) and use a database to keep track of them.

我会将压缩到磁盘上的文件(文件系统)存储起来并使用数据库来跟踪它们。

and posibly use Sqlite if this is its only job.

并且如果这是它唯一的工作,则可以使用Sqlite。

#4


0  

File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)

文件系统:在考虑大局时,DBMS再次使用文件系统。文件系统专用于保存文件,因此您可以看到优化(如LukeH所述)

#1


5  

A filesystem -- as the name suggests -- is designed and optimised to store large numbers of files in an efficient and scalable way.

文件系统 - 顾名思义 - 经过精心设计和优化,可以高效,可扩展的方式存储大量文件。

#2


1  

You can follow Facebook as it stores a lot of files (15 billion photos):

您可以关注Facebook,因为它存储了大量文件(150亿张照片):

  • They Initially started with NFS share served by commercial storage appliances.
  • 它们最初是由商业存储设备提供的NFS共享开始的。

  • Then they moved to their onw implementation http file server called Haystack
  • 然后他们转移到他们的onw实现http文件服务器Haystack

Here is a facebook note if you want to learn more http://www.facebook.com/note.php?note_id=76191543919

如果您想了解更多信息,请访问以下网址http://www.facebook.com/note.php?note_id=76191543919

Regarding the NFS share. Keep in mind that NFS shares usually limits amount of files in one folder for performance reasons. (This could be a bit counter intuitive if you assume that all recent file systems use b-trees to store their structure.) So if you are using comercial NFS shares like (NetApp) you will likely need to keep files in multiple folders.

关于NFS共享。请记住,出于性能原因,NFS共享通常会限制一个文件夹中的文件数量。 (如果您假设所有最近的文件系统都使用b-tree来存储其结构,则这可能有点反直觉。)因此,如果您使用的是(NetApp)等商业NFS共享,则可能需要将文件保存在多个文件夹中。

You can do that if you have any kind of id for your files. Just divide it Ascii representation in to groups of few characters and make folder for each group. For example we use integers for ids so file with id 1234567891 is stored as storage/0012/3456/7891.

如果您的文件有任何ID,则可以这样做。只需将Ascii表示分成几个字符组,并为每个组创建文件夹。例如,我们对id使用整数,因此id为1234567891的文件存储为存储/ 0012/341/7891。

Hope that helps.

希望有所帮助。

#3


0  

In my opinion...

在我看来...

I would store files compressed onto disk (file system) and use a database to keep track of them.

我会将压缩到磁盘上的文件(文件系统)存储起来并使用数据库来跟踪它们。

and posibly use Sqlite if this is its only job.

并且如果这是它唯一的工作,则可以使用Sqlite。

#4


0  

File System : While thinking about the big picture, The DBMS use the file system again. And the File system is dedicated for keeping the files, so you can see the optimizations (as LukeH mentioned)

文件系统:在考虑大局时,DBMS再次使用文件系统。文件系统专用于保存文件,因此您可以看到优化(如LukeH所述)