在mysql数据库中存储大文件/二进制数据:什么时候可以?

时间:2022-02-24 16:39:56

Ok, I have searched about this and read a few points of view about storing binary data in a [MySQL] database. Generally I consider this a bad idea and try to avoid it, favouring traditional file transfers and just storing a reference to the file in a database.

好的,我已经搜索过了,并阅读了一些关于在[MySQL]数据库中存储二进制数据的观点。一般来说,我认为这是一个坏主意,并试图避免这种做法,支持传统的文件传输,只在数据库中存储对文件的引用。

However, I am working on a project which requires database synchronisation with a remote/cloud database, not just for files, but also for settings and other user content. For this, and other reasons, I felt this might be an appropriate situation for binary storage in a database.

但是,我正在开发一个项目,该项目需要与远程/云数据库同步进行数据库同步,不只是用于文件,还需要设置和其他用户内容。出于这个原因和其他原因,我认为这可能是数据库中二进制存储的合适情况。

I have written a general system for the database sync which works well using Reflection and XML. I have also (against my instincts) integrated the file storage in to this system. Again, it works well - I chop files in to 64Kb BLOBs and store them in a table, with a file_id reference (linked to a seperate table which contains meta data such as file name/size/mime type).

我已经为数据库同步编写了一个通用系统,它可以很好地使用反射和XML。我还(违背我的直觉)将文件存储集成到这个系统中。同样,它工作得很好——我将文件剪切到64Kb的BLOBs中,并将它们存储在一个表中,其中包含file_id引用(链接到包含元数据的分离表,例如文件名/size/mime类型)。

This enables me to send bits and pieces as and when a connection is available, and also allows me to limit each request size to keep things running smoothly.

这使我能够在连接可用时发送比特和碎片,并允许我限制每个请求大小,以保持事情顺利运行。

So far I have not found any issues with this, and have successfully imported and transferred over 1gb of data in both directions (over about 10-15 files / 16000 rows), but I worry about its scalability - will it slow down once there is 20gb+ data in there, or can MySQL handle it provided my queries are well structured?

到目前为止我还没有发现任何问题,并成功地导入和转移超过1 gb的数据在两个方向上(大约10 - 15文件/ 16000行),但我担心它的可伸缩性,它会减缓一旦有20 gb +数据,或MySQL可以处理它提供了结构化查询好吗?

Another reason for my decision to store the data in the database was that I figured I could simply add another HDD/storage device to MySQL if space ran low, in the hope of efficient scaling/replication/etc.

我决定在数据库中存储数据的另一个原因是,我认为如果空间不够,我可以在MySQL中添加另一个HDD/storage设备,希望能够有效地扩展/复制/等等。

I would very much appreciate any views or comments as to whether this is a good or bad approach, and have I missed any obvious problems I'm likely to see once used in a production environment?

对于这种方法是好是坏,我非常感谢您的任何观点或评论,我是否遗漏了在生产环境中使用过的任何明显问题?

edit: I forgot to mention, the file sizes could range from 1KB to ~1GB

编辑:我忘了说,文件大小可以从1KB到~1GB不等。

[Rough] Conclusion Firstly: thanks very much to those who contributed a considered answer. Choosing the accepted answer here has been quite difficult as each has something decent to offer.

[简而言之]首先:非常感谢那些提供了深思熟虑的答案的人。在这里选择公认的答案是相当困难的,因为每个人都有合适的答案。

In the end (despite my hopes), I have decided that a pure MySQL storage server is at best only an ok solution (I still can't help wondering why they bother including the BLOB types though).

最后(尽管我有希望),我认为纯MySQL存储服务器充其量只是一个ok的解决方案(我仍然忍不住想知道为什么他们要麻烦包含BLOB类型)。

As the alternative, I am torn between @Nick Coons file system approach and @tadman's suggestion of a hybrid using a light weight key/value database engine such as leveldb. Provided the practicalities of using leveldb in this project are not an issue, this is most likely the approach I will work towards.

作为另一种选择,我在@Nick Coons文件系统方法和@tadman建议使用轻量级密钥/值数据库引擎(如leveldb)混合使用之间左右为难。如果在这个项目中使用leveldb的实用性不是一个问题,那么这很可能就是我要研究的方法。

I have accepted tadman's answer on this basis; his answer was also most applicable and useful to my situation.

在此基础上,我接受了泰德曼的回答;他的回答对我的处境也是最适用和有用的。

That being said, and for those that are interested: I have enjoyed quite a lot of success using only MySQL so far. I have tested a table storing over 15gb of binary data without any noticable negative side effects from to inserting/retrieving data from large tables (with careful queries). However, I am certain this is still very inefficient and either of the alternative methods mentioned will be significantly better.

话虽如此,对于那些感兴趣的人来说:到目前为止,我只使用MySQL已经取得了相当大的成功。我测试了一个表,它存储了超过15gb的二进制数据,并且没有从大表中插入/检索数据(经过仔细的查询)产生任何明显的负面影响。然而,我确信这仍然是非常低效的,并且提到的任何一种方法都将明显地更好。

3 个解决方案

#1


2  

I have to wonder why you're even bothering with a database at all, when the layer you've added on top to chunk, store, retrieve and reassemble would work just as well on a well-defined filesystem structure. MySQL wants all of its data on a single volume, so it's not a case of adding another drive whenever you feel like it, and replication of large amounts of binary data is going to be cripplingly slow as the binary logs will end up duplicating the amount of data you need to store.

我想知道为什么您甚至还要为数据库操心,当您在上面添加到数据块、存储、检索和重新组装的层在定义良好的文件系统结构上也能正常工作时。MySQL希望所有的数据在一个体积,所以它不是添加另一个驱动器的只要你喜欢它,和复制大量的二进制数据将是极为缓慢的二进制日志将会复制你需要存储的数据量。

The simplest approach is often the best one. Storing this in the filesystem directly is probably the best way to do it. If you need to keep an index of what's stored where, maybe you'd use a database like MySQL, but there's many ways to accomplish this same task. The more low-tech, the better. For example, don't rule out SQLite because an embedded database performs very well under light read and write load, and has the advantage of being "just a file" when it comes to backing up and restoring.

最简单的方法往往是最好的方法。直接将其存储在文件系统中可能是最好的方法。如果需要保存存储在哪里的索引,可以使用MySQL这样的数据库,但是有很多方法可以完成相同的任务。技术含量越低越好。例如,不要排除SQLite,因为嵌入式数据库在轻读和写负载下运行良好,并且在备份和恢复时具有“仅仅是一个文件”的优势。

That being said, what you're doing sounds suspiciously similar to LevelDB, so before you commit to your approach, you'd have to see how it's significantly different than a key-value document store of that variety.

话虽如此,您所做的事情听起来与LevelDB非常相似,因此在您提交方法之前,您必须了解它与这种类型的键值文档存储有多么大的不同。

#2


3  

Short Answer:

简短的回答:

I'm not sure there's a hard-lined way to answer this. You mentioned files being from 1KB to 1GB.. I wouldn't store binary data in a DB if it's going to anywhere near 1KB, let along 1GB. I may store a few bytes of binary data in a DB if it's incidental, but any large amount of data, especially that doesn't need to be searched, should be stored in the filesystem:

我不确定是否有明确的方法来回答这个问题。您提到了从1KB到1GB的文件。如果二进制数据接近1KB,我就不会在DB中存储,更别提1GB了。如果是偶然的,我可以在DB中存储一些字节的二进制数据,但是任何大量的数据,特别是不需要搜索的数据,都应该存储在文件系统中:

When you store data in a DB, you're storing it on a filesystem anyway, you've just added another layer (the DB) to the mix. There's a cost to this layer, so there ought to be a benefit to make up the difference. If you're storing the data so that you can search based on it or join it to other data, then this makes sense. But file data, binary or not, is typically not used in that way.

当您在DB中存储数据时,无论如何您是在文件系统中存储数据,您只是在混合中添加了另一个层(DB)。这一层是要付出代价的,所以应该有好处来弥补这个差异。如果你要存储数据以便你可以基于它搜索或者将它连接到其他数据,那么这是有意义的。但是文件数据,不管是否二进制,通常不会以这种方式使用。

Example Implementation:

示例实现:

There are better methods to distribute file data than to enter it into a DB, such as a distributed filesystems (check into GlusterFS, MooseFS, both of which will scale by simply adding additional hard drives, whereas MySQL will not).

与将文件数据输入到DB(如分布式文件系统)相比,有更好的方法来分发文件数据(查看GlusterFS、MooseFS),这两种方法都可以通过添加额外的硬盘驱动器来扩展,而MySQL则不会)。

Typically, I'll store file data in the filesystem using an SHA1 hash of the data as the name of the file. If the hash is 98a75af529f07b1ef7be7400f51344b9f07b1ef7, then I'll store it in this directory structure:

通常,我将使用数据的SHA1散列作为文件的名称,在文件系统中存储文件数据。如果散列是98a75af529f07b1ef7be7400f51344b9f07b1ef7,那么我将它存储在这个目录结构中:

./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7

That is, a top-level directory made up of the first two characters, a second-level directory made up of the second two characters, and then finally the file with the name of the total string. In this way, I can literally have billions of files without having so many in a single directory that the system is too slow to function.

也就是说,一个由前两个字符组成的*目录,一个由第二个两个字符组成的二级目录,最后是一个带有整个字符串名称的文件。这样,我就可以拥有数十亿个文件,而不会在一个目录中有太多文件,以至于系统运行太慢。

Then I create a DB table with these columns to hold the meta data:

然后我用这些列创建一个DB表来保存元数据:

  • file_id, an auto_increment field
  • file_id,auto_increment字段
  • created, a field with a default value of current_timestamp
  • 创建了一个具有current_timestamp默认值的字段
  • prev_id, more on this below
  • prev_id,更多信息见下文
  • hash, the SHA1 hash on the filesystem
  • 哈希,文件系统上的SHA1哈希
  • name, a textual name of the file (such as the original name that the file would have taken on disk.
  • 名称,文件的文本名称(例如文件在磁盘上的原始名称)。

When I need a hierarchical directory structure, I would also create a directory table and add a dir_id to the list of columns above.

当我需要层次目录结构时,我还将创建一个目录表,并向上面的列列表中添加dir_id。

If I edit the file represented by ./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7, I don't actually change that file on disk, I create a new one (because the new file contents would be represented by a new SHA1 hash), and create a new entry in the files table where prev_id equals the file_id of the file I edited. In other words, I now have versioning.

如果我编辑文件用。/ 98 / a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7,实际上我不改变磁盘上的文件,我创建了一个新的(因为新文件内容将由一个新的SHA1哈希),并创建一个新文件表中的条目的file_id prev_id =我编辑的文件。换句话说,我现在有版本控制。

If I need this to be available in a distributed fashion, I setup MySQL replication and then use GlusterFS to replicate he filesystem across multiple servers.

如果我需要以分布式的方式使用它,我设置了MySQL复制,然后使用GlusterFS在多个服务器上复制文件系统。

#3


2  

I think you will find a fair amount of debate on this as I did when I began looking into this. I tend to lean toward storing in the file system and maintaining a reference. However, that is not to say that there is never a time to store binary data in a database.

我想你们会发现很多关于这个的争论就像我开始研究这个的时候一样。我倾向于在文件系统中存储和维护引用。然而,这并不是说永远没有时间在数据库中存储二进制数据。

I would say that simply to keep things in sync is not a reason within itself to make an argument for storing binary data in a database. There certainly are ways to keep file systems in sync so that as a database is kept in sync so is the file system.

我想说的是,仅仅保持事物的同步并不能成为在数据库中存储二进制数据的理由。当然,有一些方法可以使文件系统保持同步,以便数据库和文件系统保持同步。

The bottom line is that there is a fair amount of debate on this topic and you have to go with what works for you. If what you have set up works. Use it. Do performance and load testing to make sure it works. If it doesn't hold up, change it.

最重要的是,在这个话题上有相当多的争论,你必须选择适合自己的。如果你已经准备好了。使用它。进行性能测试和负载测试以确保其工作。如果不能坚持,就改变它。

#1


2  

I have to wonder why you're even bothering with a database at all, when the layer you've added on top to chunk, store, retrieve and reassemble would work just as well on a well-defined filesystem structure. MySQL wants all of its data on a single volume, so it's not a case of adding another drive whenever you feel like it, and replication of large amounts of binary data is going to be cripplingly slow as the binary logs will end up duplicating the amount of data you need to store.

我想知道为什么您甚至还要为数据库操心,当您在上面添加到数据块、存储、检索和重新组装的层在定义良好的文件系统结构上也能正常工作时。MySQL希望所有的数据在一个体积,所以它不是添加另一个驱动器的只要你喜欢它,和复制大量的二进制数据将是极为缓慢的二进制日志将会复制你需要存储的数据量。

The simplest approach is often the best one. Storing this in the filesystem directly is probably the best way to do it. If you need to keep an index of what's stored where, maybe you'd use a database like MySQL, but there's many ways to accomplish this same task. The more low-tech, the better. For example, don't rule out SQLite because an embedded database performs very well under light read and write load, and has the advantage of being "just a file" when it comes to backing up and restoring.

最简单的方法往往是最好的方法。直接将其存储在文件系统中可能是最好的方法。如果需要保存存储在哪里的索引,可以使用MySQL这样的数据库,但是有很多方法可以完成相同的任务。技术含量越低越好。例如,不要排除SQLite,因为嵌入式数据库在轻读和写负载下运行良好,并且在备份和恢复时具有“仅仅是一个文件”的优势。

That being said, what you're doing sounds suspiciously similar to LevelDB, so before you commit to your approach, you'd have to see how it's significantly different than a key-value document store of that variety.

话虽如此,您所做的事情听起来与LevelDB非常相似,因此在您提交方法之前,您必须了解它与这种类型的键值文档存储有多么大的不同。

#2


3  

Short Answer:

简短的回答:

I'm not sure there's a hard-lined way to answer this. You mentioned files being from 1KB to 1GB.. I wouldn't store binary data in a DB if it's going to anywhere near 1KB, let along 1GB. I may store a few bytes of binary data in a DB if it's incidental, but any large amount of data, especially that doesn't need to be searched, should be stored in the filesystem:

我不确定是否有明确的方法来回答这个问题。您提到了从1KB到1GB的文件。如果二进制数据接近1KB,我就不会在DB中存储,更别提1GB了。如果是偶然的,我可以在DB中存储一些字节的二进制数据,但是任何大量的数据,特别是不需要搜索的数据,都应该存储在文件系统中:

When you store data in a DB, you're storing it on a filesystem anyway, you've just added another layer (the DB) to the mix. There's a cost to this layer, so there ought to be a benefit to make up the difference. If you're storing the data so that you can search based on it or join it to other data, then this makes sense. But file data, binary or not, is typically not used in that way.

当您在DB中存储数据时,无论如何您是在文件系统中存储数据,您只是在混合中添加了另一个层(DB)。这一层是要付出代价的,所以应该有好处来弥补这个差异。如果你要存储数据以便你可以基于它搜索或者将它连接到其他数据,那么这是有意义的。但是文件数据,不管是否二进制,通常不会以这种方式使用。

Example Implementation:

示例实现:

There are better methods to distribute file data than to enter it into a DB, such as a distributed filesystems (check into GlusterFS, MooseFS, both of which will scale by simply adding additional hard drives, whereas MySQL will not).

与将文件数据输入到DB(如分布式文件系统)相比,有更好的方法来分发文件数据(查看GlusterFS、MooseFS),这两种方法都可以通过添加额外的硬盘驱动器来扩展,而MySQL则不会)。

Typically, I'll store file data in the filesystem using an SHA1 hash of the data as the name of the file. If the hash is 98a75af529f07b1ef7be7400f51344b9f07b1ef7, then I'll store it in this directory structure:

通常,我将使用数据的SHA1散列作为文件的名称,在文件系统中存储文件数据。如果散列是98a75af529f07b1ef7be7400f51344b9f07b1ef7,那么我将它存储在这个目录结构中:

./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7

That is, a top-level directory made up of the first two characters, a second-level directory made up of the second two characters, and then finally the file with the name of the total string. In this way, I can literally have billions of files without having so many in a single directory that the system is too slow to function.

也就是说,一个由前两个字符组成的*目录,一个由第二个两个字符组成的二级目录,最后是一个带有整个字符串名称的文件。这样,我就可以拥有数十亿个文件,而不会在一个目录中有太多文件,以至于系统运行太慢。

Then I create a DB table with these columns to hold the meta data:

然后我用这些列创建一个DB表来保存元数据:

  • file_id, an auto_increment field
  • file_id,auto_increment字段
  • created, a field with a default value of current_timestamp
  • 创建了一个具有current_timestamp默认值的字段
  • prev_id, more on this below
  • prev_id,更多信息见下文
  • hash, the SHA1 hash on the filesystem
  • 哈希,文件系统上的SHA1哈希
  • name, a textual name of the file (such as the original name that the file would have taken on disk.
  • 名称,文件的文本名称(例如文件在磁盘上的原始名称)。

When I need a hierarchical directory structure, I would also create a directory table and add a dir_id to the list of columns above.

当我需要层次目录结构时,我还将创建一个目录表,并向上面的列列表中添加dir_id。

If I edit the file represented by ./98/a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7, I don't actually change that file on disk, I create a new one (because the new file contents would be represented by a new SHA1 hash), and create a new entry in the files table where prev_id equals the file_id of the file I edited. In other words, I now have versioning.

如果我编辑文件用。/ 98 / a7/98a75af529f07b1ef7be7400f51344b9f07b1ef7,实际上我不改变磁盘上的文件,我创建了一个新的(因为新文件内容将由一个新的SHA1哈希),并创建一个新文件表中的条目的file_id prev_id =我编辑的文件。换句话说,我现在有版本控制。

If I need this to be available in a distributed fashion, I setup MySQL replication and then use GlusterFS to replicate he filesystem across multiple servers.

如果我需要以分布式的方式使用它,我设置了MySQL复制,然后使用GlusterFS在多个服务器上复制文件系统。

#3


2  

I think you will find a fair amount of debate on this as I did when I began looking into this. I tend to lean toward storing in the file system and maintaining a reference. However, that is not to say that there is never a time to store binary data in a database.

我想你们会发现很多关于这个的争论就像我开始研究这个的时候一样。我倾向于在文件系统中存储和维护引用。然而,这并不是说永远没有时间在数据库中存储二进制数据。

I would say that simply to keep things in sync is not a reason within itself to make an argument for storing binary data in a database. There certainly are ways to keep file systems in sync so that as a database is kept in sync so is the file system.

我想说的是,仅仅保持事物的同步并不能成为在数据库中存储二进制数据的理由。当然,有一些方法可以使文件系统保持同步,以便数据库和文件系统保持同步。

The bottom line is that there is a fair amount of debate on this topic and you have to go with what works for you. If what you have set up works. Use it. Do performance and load testing to make sure it works. If it doesn't hold up, change it.

最重要的是,在这个话题上有相当多的争论,你必须选择适合自己的。如果你已经准备好了。使用它。进行性能测试和负载测试以确保其工作。如果不能坚持,就改变它。