Versions I am running (basically latest of everything):
PHP: 5.3.1
MySQL: 5.1.41
Apache: 2.2.14
OS: CentOS (latest)
我正在运行的版本(基本上是最新的):PHP:5.3.1 MySQL:5.1.41 Apache:2.2.14 OS:CentOS(最新)
Here is the situation.
这是情况。
I have thousands of very important documents, ranging from customer contracts to voice signatures (recordings of customer authorisation for contracts), with file types including, but not limited to jpg, gif, png, tiff, doc, docx, xls, wav, mp3, pdf, etc.
我有成千上万个非常重要的文件,从客户合同到语音签名(合同的客户授权记录),文件类型包括但不限于jpg,gif,png,tiff,doc,docx,xls,wav,mp3 ,pdf等
All of these documents are currently stored on several servers including Windows 32 bit, CentOS and Mac, among others. Some files are also stored on employees desktop computers and laptops, and some are still hard copies stored in hundreds of boxes and filing cabinets.
所有这些文档目前都存储在多个服务器上,包括Windows 32位,CentOS和Mac等。有些文件也存储在员工台式计算机和笔记本电脑上,有些文件仍然是存储在数百个盒子和文件柜中的硬拷贝。
Now because customers or lawyers could demand evidence of contracts at any time, my company has to be able to search and locate the correct document(s) effectively, for this reason ALL of these files have to be digitised (if not already) and correlated into some sort of order for searching and accessing.
现在因为客户或律师可以随时要求合同证明,我公司必须能够有效地搜索和找到正确的文件,因此所有这些文件都必须数字化(如果还没有)并且相关进入某种搜索和访问的顺序。
As the programmer, I have created a full Customer Relations Management tool that the whole company uses. This includes Customer Profiles management, Order and job Tracking tools, Job/sale creation and management modules, etc, and at the moment any file that is needed at a customer profile level (drivers licence, credit authority, etc) or at a job/sale level (contracts, voice signatures, etc) can be uploaded to the server and sits in a parent/child hierarchy structure, just like Windows Explorer or any other typical file managment model.
作为程序员,我创建了一个完整的客户关系管理工具,供整个公司使用。这包括客户资料管理,订单和工作跟踪工具,工作/销售创建和管理模块等,以及目前客户资料级别(驾驶执照,信贷机构等)或工作中所需的任何文件/销售级别(合同,语音签名等)可以上传到服务器并位于父/子层次结构中,就像Windows资源管理器或任何其他典型的文件管理模型一样。
The structure appears as such:
结构如下:
drivers_license
|- DL_123.jpg
voice_signatures
|- VS_123.wav
|- VS_4567.wav
contracts
drivers_license | - DL_123.jpg voice_signatures | - VS_123.wav | - VS_4567.wav合同
So the files are uplaoded using PHP and Apache, and are stored in the file system of the OS. At the time of uploading, certain information about the file(s) is stored in a MySQL database. Some of the information stored is:
因此,文件使用PHP和Apache进行上传,并存储在操作系统的文件系统中。在上载时,关于文件的某些信息存储在MySQL数据库中。存储的一些信息是:
TABLE: FileUploads
FileID
CustomerID (the customer id that the file belongs to, they all have this.)
JobID/SaleID (the id of the job/sale associated, if any.)
FileSize
FileType
UploadedDateTime
UploadedBy
FilePath (the directory path the file is stored in.)
FileName (current file name of uploaded file, combination of CustomerID and JobID/SaleID if applicable.)
FileDescription
OriginalFileName (original name of the source file when uploaded, including extension.)
表:FileUploads FileID CustomerID(文件所属的客户ID,它们都有此。)JobID / SaleID(相关作业/销售的ID,如果有).FileSize FileType UploadedDateTime UploadedBy FilePath(文件的目录路径是存储在。)FileName(上传文件的当前文件名,CustomerID和JobID / SaleID的组合,如果适用的话。)FileDescription OriginalFileName(上载时源文件的原始名称,包括扩展名。)
So as you can see, the file is linked to the database by the File Name. When I want to provide a customers' files for download to a user all I have to do is "SELECT * FROM FileUploads WHERE CustomerID = 123 OR JobID = 2345;" and this will output all the file details I require, and with the FilePath and FileName I can provide the link for download.
如您所见,文件通过文件名链接到数据库。当我想提供客户的文件供下载给用户时,我所要做的就是“SELECT * FROM FileUploads WHERE CustomerID = 123 OR JobID = 2345;”这将输出我需要的所有文件详细信息,并且使用FilePath和FileName,我可以提供下载链接。
http... server / FilePath / FileName
http ... server / FilePath / FileName
There are a number of problems with this method:
这种方法存在许多问题:
- Storing files in this "database unconcious" environment means data integrity is not kept. If a record is deleted, the file may not be deleted also, or vice versa.
- 在“数据库无意识”环境中存储文件意味着不保留数据完整性。如果删除了记录,则也可能不会删除该文件,反之亦然。
- Files are strewn all over the place, different servers, computers, etc.
- 文件遍布整个地方,不同的服务器,计算机等。
- The file name is the ONLY thing matching the binary to the database and customer profile and customer records.
- 文件名是唯一匹配二进制文件到数据库和客户配置文件和客户记录的东西。
etc, etc. There are so many reasons, some of which are described here: http://www.dreamwerx.net/site/article01 . Also there is an interesting article here too: sietch.net/ViewNewsItem.aspx?NewsItemID=124 .
有很多原因,其中一些在这里描述:http://www.dreamwerx.net/site/article01。此外,这里也有一篇有趣的文章:sietch.net/ViewNewsItem.aspx?NewsItemID=124。
SO, after much research I have pretty much decided I am going to store ALL of these files in the database, as a BLOB or LONGBLOB, but there are still many considerations before I do this.
经过大量研究后,我几乎已经决定将所有这些文件存储在数据库中,如BLOB或LONGBLOB,但在我这样做之前还有很多注意事项。
I know that storing them in the database is a viable option, however there are a number of methods of storing them. I also know storing them is one thing; correlating and accessing them in a manageable way is another thing entirely.
我知道将它们存储在数据库中是一个可行的选择,但是有许多方法可以存储它们。我也知道存储它们是一回事;以可管理的方式关联和访问它们完全是另一回事。
The article provided at this link: dreamwerx.net/site/article01 describes a way of splitting the uploaded binary files into 64kb chunks and storing each chunk with the FileID, and then streaming the actual binary file to the client using headers. This is a really cool idea since it alleviates preassure on the servers memory; instead of loading an entire 100mb file into the RAM and then sending it to the client, it is doing it 64kb at a time. I have tried this (and updated his scripts) and this is totally successful, in a very small frame of testing.
此链接提供的文章:dreamwerx.net/site/article01描述了一种将上传的二进制文件拆分为64kb块并使用FileID存储每个块的方法,然后使用标头将实际二进制文件流式传输到客户端。这是一个非常酷的主意,因为它减轻了对服务器内存的预感;而不是将整个100mb文件加载到RAM然后将其发送到客户端,它一次只做64kb。我尝试了这个(并更新了他的脚本),这在一个非常小的测试框架中是完全成功的。
So if you are in agreeance that this method is a viable, stable and robust long-term option to store moderately large files (1kb to couple hundred megs), and large quantities of these files, let me know what other considerations or ideas you have.
因此,如果您同意这种方法是一个可行,稳定且强大的长期选项来存储中等大小的文件(1kb到几百兆),以及大量这些文件,请告诉我您有哪些其他注意事项或想法。
Also, I am considering getting a current "File Management" PHP script that gives an interface for managing files stored in the File System and converting it to manage files stored in the database. If there is already any software out there that does this, please let me know.
此外,我正在考虑获取当前的“文件管理”PHP脚本,该脚本提供了一个界面,用于管理存储在文件系统中的文件,并将其转换为管理存储在数据库中的文件。如果已经有任何软件可以做到这一点,请告诉我。
I guess there are many questions I could ask, and all the information is up there ^^ so please, discuss all aspects of this and we can pass ideas back and forth and teach each other.
我想我可以提出很多问题,所有的信息都在那里^^所以请讨论这方面的所有方面,我们可以来回传递意见并相互教授。
Cheers,
干杯,
Quantico773
Quantico773
1 个解决方案
#1
36
I work on a large software system that has done both mechanisms for storing attachments and other content. The first iteration of the system stored all data in BLOBs in the DB. I cursed it at the time. As a programmer, I could write side scripts to immediately operate on the data and change it whenever I wanted to.
我在一个大型软件系统上工作,该系统已经完成了存储附件和其他内容的两种机制。系统的第一次迭代将所有数据存储在DB中的BLOB中。我当时诅咒它。作为程序员,我可以编写辅助脚本来立即对数据进行操作,并随时更改它。
Advance about 10 years and I still manage the same software but the architecture has changed and it was written with filesystem pointers. I curse it now and wish it were back in the DB. I have the added benefit of several years and having worked this application in much greater capacity in many more and many larger situations, I feel my opinion now is better educated. Promotion or system migration of the application requires extensive scripting and copying of millions of files. On one occasion we changed the OS and all the file pointers had the wrong directory separator, or the server name changes where the file was and we had to write and schedule simple SQL update statements with the DBA on the weekend to fix. Another is that the filesystem and DB records get out of sync, why is uncertain but after thousands of days of operation, sometimes non-transactional systems (filesystem and DB don't share transactional contexts) simply become out of sync. Sometimes files mysteriously go missing.
提前大约10年,我仍然管理相同的软件,但架构已经改变,它是用文件系统指针编写的。我现在诅咒它并希望它回到数据库中。我有几年的额外好处,并且已经在更多和更多更大的情况下以更大的能力工作这个应用程序,我觉得我的观点现在受到更好的教育。应用程序的升级或系统迁移需要大量脚本和数百万个文件的复制。有一次我们更改了操作系统,所有文件指针都有错误的目录分隔符,或者服务器名称在文件所在的地方发生了变化,我们不得不在周末编写并安排简单的SQL更新语句与DBA进行修复。另一个是文件系统和数据库记录不同步,为什么不确定,但经过数千天的操作,有时非事务性系统(文件系统和数据库不共享事务上下文)只是变得不同步。有时文件会神秘地丢失。
When all this was in the DB, migration or environment promotion was a matter of dump and import the DB. Row changes could be properly audited, everything in sync and logs can be replayed to point-in-time if necessary. Sure the DB gets big, but it is 2011 and this stuff is simply not a challenge for databases.
当所有这些都在数据库中时,迁移或环境升级就是转储和导入数据库的问题。可以正确审核行更改,同步和日志中的所有内容都可以在必要时重播到时间点。当然数据库变大了,但它是2011年,这对于数据库来说根本不是一个挑战。
For what it is worth we had some similar issues with large data buffers when streaming some data, but A) we could pump the data in byte buffers with the Input|OutputStreams in JDBC and B) when using other tools, we wrote a stored procedure that would chunk the BLOB into a temp table and iteratively serve the chunks from the temp table. Works great.
对于什么是值得我们在流式传输某些数据时遇到类似的大数据缓冲区问题,但是A)我们可以使用JDBC和B中的Input | OutputStreams在字节缓冲区中抽取数据当使用其他工具时,我们编写了一个存储过程将BLOB块化为临时表并迭代地从临时表中提供块。效果很好。
I don't care what the technical reason for not putting this stuff in the DB, but it is so much easier to manage in a consolidated location I could double and triple the hardware or grid the DB for the time wasted by consultants and customers just in a short period of time managing the disparate files.
我不在乎没有把这些东西放在数据库中的技术原因,但是在一个整合的位置管理起来要容易得多,我可以将数据库的硬件或网格加倍和三倍,而顾问和客户浪费时间在短时间内管理不同的文件。
Update: go easy on the commenters, they're just giving their opinion on the matter.
更新:轻松评论评论者,他们只是就此事发表意见。
#1
36
I work on a large software system that has done both mechanisms for storing attachments and other content. The first iteration of the system stored all data in BLOBs in the DB. I cursed it at the time. As a programmer, I could write side scripts to immediately operate on the data and change it whenever I wanted to.
我在一个大型软件系统上工作,该系统已经完成了存储附件和其他内容的两种机制。系统的第一次迭代将所有数据存储在DB中的BLOB中。我当时诅咒它。作为程序员,我可以编写辅助脚本来立即对数据进行操作,并随时更改它。
Advance about 10 years and I still manage the same software but the architecture has changed and it was written with filesystem pointers. I curse it now and wish it were back in the DB. I have the added benefit of several years and having worked this application in much greater capacity in many more and many larger situations, I feel my opinion now is better educated. Promotion or system migration of the application requires extensive scripting and copying of millions of files. On one occasion we changed the OS and all the file pointers had the wrong directory separator, or the server name changes where the file was and we had to write and schedule simple SQL update statements with the DBA on the weekend to fix. Another is that the filesystem and DB records get out of sync, why is uncertain but after thousands of days of operation, sometimes non-transactional systems (filesystem and DB don't share transactional contexts) simply become out of sync. Sometimes files mysteriously go missing.
提前大约10年,我仍然管理相同的软件,但架构已经改变,它是用文件系统指针编写的。我现在诅咒它并希望它回到数据库中。我有几年的额外好处,并且已经在更多和更多更大的情况下以更大的能力工作这个应用程序,我觉得我的观点现在受到更好的教育。应用程序的升级或系统迁移需要大量脚本和数百万个文件的复制。有一次我们更改了操作系统,所有文件指针都有错误的目录分隔符,或者服务器名称在文件所在的地方发生了变化,我们不得不在周末编写并安排简单的SQL更新语句与DBA进行修复。另一个是文件系统和数据库记录不同步,为什么不确定,但经过数千天的操作,有时非事务性系统(文件系统和数据库不共享事务上下文)只是变得不同步。有时文件会神秘地丢失。
When all this was in the DB, migration or environment promotion was a matter of dump and import the DB. Row changes could be properly audited, everything in sync and logs can be replayed to point-in-time if necessary. Sure the DB gets big, but it is 2011 and this stuff is simply not a challenge for databases.
当所有这些都在数据库中时,迁移或环境升级就是转储和导入数据库的问题。可以正确审核行更改,同步和日志中的所有内容都可以在必要时重播到时间点。当然数据库变大了,但它是2011年,这对于数据库来说根本不是一个挑战。
For what it is worth we had some similar issues with large data buffers when streaming some data, but A) we could pump the data in byte buffers with the Input|OutputStreams in JDBC and B) when using other tools, we wrote a stored procedure that would chunk the BLOB into a temp table and iteratively serve the chunks from the temp table. Works great.
对于什么是值得我们在流式传输某些数据时遇到类似的大数据缓冲区问题,但是A)我们可以使用JDBC和B中的Input | OutputStreams在字节缓冲区中抽取数据当使用其他工具时,我们编写了一个存储过程将BLOB块化为临时表并迭代地从临时表中提供块。效果很好。
I don't care what the technical reason for not putting this stuff in the DB, but it is so much easier to manage in a consolidated location I could double and triple the hardware or grid the DB for the time wasted by consultants and customers just in a short period of time managing the disparate files.
我不在乎没有把这些东西放在数据库中的技术原因,但是在一个整合的位置管理起来要容易得多,我可以将数据库的硬件或网格加倍和三倍,而顾问和客户浪费时间在短时间内管理不同的文件。
Update: go easy on the commenters, they're just giving their opinion on the matter.
更新:轻松评论评论者,他们只是就此事发表意见。