存储大量文本(进入数据库或作为文件?)的最佳实践是什么?压缩它的方法是什么?

时间:2021-11-13 16:55:38

I'm building a web-app that handles internal emails and other frequent small-to-medium sized chunks of text between users and clients. What's the best method for storing this data? In a database (MySQL) or as thousands of individual files? What about compressing it (PHP's gzcompress() or MySQL's compression features)?

我正在构建一个Web应用程序,用于处理用户和客户端之间的内部电子邮件和其他频繁的中小型文本块。存储这些数据的最佳方法是什么?在数据库(MySQL)中还是数千个单独的文件?压缩它(PHP的gzcompress()或MySQL的压缩功能)怎么样?

This will not be a public application, so the user load will be minimal (less than 20 users at a time). However, there will be a lot of communication going back-and-forth every day within the app, so I expect the amount of data to grow quite large as time goes by (which is why I'd like to compress it).

这不是公共应用程序,因此用户负载将是最小的(一次少于20个用户)。但是,应用程序中每天都会有很多通信来回传递,所以我希望随着时间的推移,数据量会变得非常大(这就是为什么我要压缩它)。

I'd like to keep the data in a database for ease of access and portability, but some of the threads I've seen on here regarding images have suggested using file storage. What do you think?

我想将数据保存在数据库中以便于访问和移植,但我在这里看到的一些关于图像的线程建议使用文件存储。你怎么看?

Thank you, Seth

谢谢,塞思

Edit for clarification: I do not require any sort of searching of the text, which is why I would lean toward compressing it to save on space.

编辑澄清:我不需要任何类型的文本搜索,这就是为什么我倾向于压缩它以节省空间。

4 个解决方案

#1


For images and documents that are already in a specific format (excel, word documents, pdf files, etc) I prefer file storage. But for just raw text I would probably rather use a database. It is easier to replicate across machines for failover, you can do substring searches over the text and although I don't know of a specific algorithm to use to compress it, I would think that a database would be a better way to go. But only if you already have just the text and it is only text. Any other format of document I would prefer using file storage.

对于已经采用特定格式(excel,word文档,pdf文件等)的图像和文档,我更喜欢文件存储。但对于原始文本,我可能宁愿使用数据库。跨机器进行故障转移更容易复制,你可以对文本进行子串搜索,虽然我不知道用于压缩它的特定算法,但我认为数据库是更好的方法。但是,只有你已经只有文本而且它只是文本。我希望使用文件存储的任何其他格式的文档。

And unless I am missing something I would use a CLOB instead of a BLOB, if it is only text.

除非我遗漏了一些东西,否则我会使用CLOB而不是BLOB,如果它只是文本。

#2


One of the main reasons for keeping the files in a database is to keep it consistent with the rest of the data that you are storing. It will be easier to make backups, (re)deploy with predefined datasets etc. Furthermore it's easier to guarantee transactional integrity.

将文件保留在数据库中的主要原因之一是使其与您存储的其余数据保持一致。使用预定义的数据集等进行备份,(重新)部署会更容易。此外,更容易保证事务完整性。

One of the benefits of storing text as files could be that it is easier to serve them using a webserver, if this is the only remaining benefit of using files you could look into caching the files on the webserver -- that will give you much of the easy backup and transactions of the database but at the same time allow some speedup for http requests.

将文本存储为文件的好处之一可能是使用网络服务器更容易为它们提供服务,如果这是使用文件的唯一剩余好处,您可以考虑缓存网络服务器上的文件 - 这将为您提供更多数据库的简单备份和事务,但同时允许http请求的一些加速。

#3


I would have chosen to use a DB. You describe a scenario where you are going to store a large quantity of messages. You do not provide much information about the system, but i would guess that you probably would like to sort, group and apply several other properties to the messages. It would be much easier and probably faster to keep the message with its attributes in a DB instead of using file storage.

我会选择使用DB。您描述了要存储大量消息的方案。您没有提供有关系统的大量信息,但我猜您可能希望对消息进行排序,分组和应用其他几个属性。将消息及其属性保存在数据库中而不是使用文件存储会更容易也可能更快。

When it comes to compression I do not know which of the methods is most effective. You should probably try both before choosing.

在压缩方面,我不知道哪种方法最有效。你应该在选择前尝试两种方法。

#4


I wonder how big is this "medium chunk". If the text is just written messages (so less than 10 KB), then compressing makes them even smaller and there wouldn't be big impact on database growth. It makes developing and maintenance also much easier to have everything available with singl query and not having to get the file contents separately.

我想知道这个“中等大块”有多大。如果文本只是写入消息(小于10 KB),那么压缩会使它们更小,并且不会对数据库增长产生很大影响。它使得开发和维护也更容易通过单一查询获得一切,而不必单独获取文件内容。

#1


For images and documents that are already in a specific format (excel, word documents, pdf files, etc) I prefer file storage. But for just raw text I would probably rather use a database. It is easier to replicate across machines for failover, you can do substring searches over the text and although I don't know of a specific algorithm to use to compress it, I would think that a database would be a better way to go. But only if you already have just the text and it is only text. Any other format of document I would prefer using file storage.

对于已经采用特定格式(excel,word文档,pdf文件等)的图像和文档,我更喜欢文件存储。但对于原始文本,我可能宁愿使用数据库。跨机器进行故障转移更容易复制,你可以对文本进行子串搜索,虽然我不知道用于压缩它的特定算法,但我认为数据库是更好的方法。但是,只有你已经只有文本而且它只是文本。我希望使用文件存储的任何其他格式的文档。

And unless I am missing something I would use a CLOB instead of a BLOB, if it is only text.

除非我遗漏了一些东西,否则我会使用CLOB而不是BLOB,如果它只是文本。

#2


One of the main reasons for keeping the files in a database is to keep it consistent with the rest of the data that you are storing. It will be easier to make backups, (re)deploy with predefined datasets etc. Furthermore it's easier to guarantee transactional integrity.

将文件保留在数据库中的主要原因之一是使其与您存储的其余数据保持一致。使用预定义的数据集等进行备份,(重新)部署会更容易。此外,更容易保证事务完整性。

One of the benefits of storing text as files could be that it is easier to serve them using a webserver, if this is the only remaining benefit of using files you could look into caching the files on the webserver -- that will give you much of the easy backup and transactions of the database but at the same time allow some speedup for http requests.

将文本存储为文件的好处之一可能是使用网络服务器更容易为它们提供服务,如果这是使用文件的唯一剩余好处,您可以考虑缓存网络服务器上的文件 - 这将为您提供更多数据库的简单备份和事务,但同时允许http请求的一些加速。

#3


I would have chosen to use a DB. You describe a scenario where you are going to store a large quantity of messages. You do not provide much information about the system, but i would guess that you probably would like to sort, group and apply several other properties to the messages. It would be much easier and probably faster to keep the message with its attributes in a DB instead of using file storage.

我会选择使用DB。您描述了要存储大量消息的方案。您没有提供有关系统的大量信息,但我猜您可能希望对消息进行排序,分组和应用其他几个属性。将消息及其属性保存在数据库中而不是使用文件存储会更容易也可能更快。

When it comes to compression I do not know which of the methods is most effective. You should probably try both before choosing.

在压缩方面,我不知道哪种方法最有效。你应该在选择前尝试两种方法。

#4


I wonder how big is this "medium chunk". If the text is just written messages (so less than 10 KB), then compressing makes them even smaller and there wouldn't be big impact on database growth. It makes developing and maintenance also much easier to have everything available with singl query and not having to get the file contents separately.

我想知道这个“中等大块”有多大。如果文本只是写入消息(小于10 KB),那么压缩会使它们更小,并且不会对数据库增长产生很大影响。它使得开发和维护也更容易通过单一查询获得一切,而不必单独获取文件内容。