如何在sql server 2008中保存PDF、Docx、xls等文档

时间:2022-11-22 16:56:20

I develop a web application that let users to upload files like images and documents. this file divided into two parts :

我开发了一个web应用程序,允许用户上传像图片和文档这样的文件。该文件分为两部分:

  1. binary files
  2. 二进制文件
  3. document files
  4. 文档文件

I want to allow users to search documents that uploaded. specialy using full text search. What data types I should use for these two file types?

我想允许用户搜索上传的文件。特别是使用全文搜索。对于这两种文件类型,我应该使用什么数据类型?

3 个解决方案

#1


2  

You can store the data in binary and use full text search to interpret the binary data and extract the textual information: .doc, .txt, .xls, .ppt, .htm. The extracted text is indexed and becomes available for querying (make sure you use the CONTAINS keyword). Needless to say, full text search has to be enabled.Not sure how adding a full text index will affect your system - i.e., its size. You'll also need to look at the execution plan to ensure the index gets used at query time.

您可以将数据存储在二进制文件中,并使用全文搜索来解释二进制数据并提取文本信息:.doc、.txt、.xls、.ppt、.htm。提取的文本被索引,并可用于查询(请确保使用CONTAINS关键字)。不用说,必须启用全文搜索。不确定添加全文索引将如何影响您的系统—例如。,它的大小。您还需要查看执行计划,以确保在查询时使用索引。

For more information look at this:

有关更多信息,请参阅以下内容:

http://technet.microsoft.com/en-us/library/ms142499(SQL.90).aspx

http://technet.microsoft.com/en-us/library/ms142499(SQL.90). aspx

Pros: The main advantage of storing data in the database is that it makes the data "self-contained". Since all of the data is contained within the database, backing up the data, moving the data from one database server to another, replicating the database, and so on, is much easier.

优点:在数据库中存储数据的主要优点是使数据“自包含”。由于所有数据都包含在数据库中,所以备份数据、将数据从一个数据库服务器移动到另一个数据库服务器、复制数据库等等都要容易得多。

also you can enable versioning of files and also make it easier for load balanced web farms.

您还可以启用文件的版本控制,并使负载平衡的web农场更容易实现。

Cons: you can read it here: https://dba.stackexchange.com/questions/3924/sql-server-2005-large-binary-storage. But this is something that you have to do in order to search through the files efficiently.

缺点:您可以在这里阅读:https://dba.stackexchange.com/questions/3924/sql-server-2005- big - binarystorage。但这是你要做的为了有效地搜索文件。

Or the other thing that I could suggest is probably storing keywords in the database and then linking the same to file in the fileshare.

或者我建议的另一件事可能是在数据库中存储关键字,然后将相同的内容链接到fileshare中。

Here is an article discussing abt using a FileStream and a database: http://blogs.msdn.com/b/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx

这里有一篇文章讨论abt使用FileStream和数据库:http://blogs.msdn.com/b/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx。

#2


0  

You first need to convert the PDF to text. There are libraries for this sort of thing (ie: PowerGREP). Then I'd recommend storing the text of the PDF files in a database. If you need to do full text searching and logic such as "on the same line" then you'll need to store one record per line of text. If you just want to search for text in a file, then you can change the structure of your SQL schema to match your needs.

首先需要将PDF转换为文本。这类东西有库(即PowerGREP)。然后,我建议将PDF文件的文本存储在数据库中。如果需要进行全文搜索和逻辑操作,如“在同一行”,则需要为每行文本存储一条记录。如果您只想在文件中搜索文本,那么您可以更改SQL模式的结构以满足您的需要。

For docx files, I would convert them to RTF and search them that way while stored in SQL.

对于docx文件,我将它们转换为RTF,并在SQL中存储时以这种方式搜索它们。

For images, Microsoft has a program called Microsoft OneNote that does OCR (optical character recognition) so you can search for text within images. It doesn't matter what tool you use, just that it supports OCR.

对于图像,微软有一个叫做Microsoft OneNote的程序,它可以进行光学字符识别,这样你就可以在图像中搜索文本。不管您使用什么工具,它都支持OCR。

Essentially, if you don't have a way to directly read the binary file, then you need to convert it to text with some library, then worry about doing your searching.

本质上,如果您没有直接读取二进制文件的方法,那么您需要使用某个库将其转换为文本,然后再考虑进行搜索。

#3


0  

The full-text index can be created for columns which use any of the following data types – CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, NTEXT, VARBINARY, VARBINARY (MAX), IMAGE and XML.

可以为使用以下任何数据类型的列创建全文索引——CHAR、NCHAR、VARCHAR、NVARCHAR、TEXT、NTEXT、VARBINARY、VARBINARY (MAX)、IMAGE和XML。

In addition, To use full text search you must create a full-text index for the table against which they want to run full-text search queries. For a particular SQL Server Table or Indexed View you can create a maximum of one Full-Text Index.

此外,要使用全文搜索,必须为要运行全文搜索查询的表创建全文索引。对于特定的SQL Server表或索引视图,可以创建最多一个全文索引。

these are two article about it:

这是关于它的两篇文章:

SQL SERVER - 2008 - Creating Full Text Catalog and Full Text Search

SQL SERVER - 2008 -创建全文目录和全文搜索。

Using Full Text Search in SQL Server 2008

在SQL Server 2008中使用全文搜索

#1


2  

You can store the data in binary and use full text search to interpret the binary data and extract the textual information: .doc, .txt, .xls, .ppt, .htm. The extracted text is indexed and becomes available for querying (make sure you use the CONTAINS keyword). Needless to say, full text search has to be enabled.Not sure how adding a full text index will affect your system - i.e., its size. You'll also need to look at the execution plan to ensure the index gets used at query time.

您可以将数据存储在二进制文件中,并使用全文搜索来解释二进制数据并提取文本信息:.doc、.txt、.xls、.ppt、.htm。提取的文本被索引,并可用于查询(请确保使用CONTAINS关键字)。不用说,必须启用全文搜索。不确定添加全文索引将如何影响您的系统—例如。,它的大小。您还需要查看执行计划,以确保在查询时使用索引。

For more information look at this:

有关更多信息,请参阅以下内容:

http://technet.microsoft.com/en-us/library/ms142499(SQL.90).aspx

http://technet.microsoft.com/en-us/library/ms142499(SQL.90). aspx

Pros: The main advantage of storing data in the database is that it makes the data "self-contained". Since all of the data is contained within the database, backing up the data, moving the data from one database server to another, replicating the database, and so on, is much easier.

优点:在数据库中存储数据的主要优点是使数据“自包含”。由于所有数据都包含在数据库中,所以备份数据、将数据从一个数据库服务器移动到另一个数据库服务器、复制数据库等等都要容易得多。

also you can enable versioning of files and also make it easier for load balanced web farms.

您还可以启用文件的版本控制,并使负载平衡的web农场更容易实现。

Cons: you can read it here: https://dba.stackexchange.com/questions/3924/sql-server-2005-large-binary-storage. But this is something that you have to do in order to search through the files efficiently.

缺点:您可以在这里阅读:https://dba.stackexchange.com/questions/3924/sql-server-2005- big - binarystorage。但这是你要做的为了有效地搜索文件。

Or the other thing that I could suggest is probably storing keywords in the database and then linking the same to file in the fileshare.

或者我建议的另一件事可能是在数据库中存储关键字,然后将相同的内容链接到fileshare中。

Here is an article discussing abt using a FileStream and a database: http://blogs.msdn.com/b/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx

这里有一篇文章讨论abt使用FileStream和数据库:http://blogs.msdn.com/b/manisblog/archive/2007/10/21/filestream-data-type-sql-server-2008.aspx。

#2


0  

You first need to convert the PDF to text. There are libraries for this sort of thing (ie: PowerGREP). Then I'd recommend storing the text of the PDF files in a database. If you need to do full text searching and logic such as "on the same line" then you'll need to store one record per line of text. If you just want to search for text in a file, then you can change the structure of your SQL schema to match your needs.

首先需要将PDF转换为文本。这类东西有库(即PowerGREP)。然后,我建议将PDF文件的文本存储在数据库中。如果需要进行全文搜索和逻辑操作,如“在同一行”,则需要为每行文本存储一条记录。如果您只想在文件中搜索文本,那么您可以更改SQL模式的结构以满足您的需要。

For docx files, I would convert them to RTF and search them that way while stored in SQL.

对于docx文件,我将它们转换为RTF,并在SQL中存储时以这种方式搜索它们。

For images, Microsoft has a program called Microsoft OneNote that does OCR (optical character recognition) so you can search for text within images. It doesn't matter what tool you use, just that it supports OCR.

对于图像,微软有一个叫做Microsoft OneNote的程序,它可以进行光学字符识别,这样你就可以在图像中搜索文本。不管您使用什么工具,它都支持OCR。

Essentially, if you don't have a way to directly read the binary file, then you need to convert it to text with some library, then worry about doing your searching.

本质上,如果您没有直接读取二进制文件的方法,那么您需要使用某个库将其转换为文本,然后再考虑进行搜索。

#3


0  

The full-text index can be created for columns which use any of the following data types – CHAR, NCHAR, VARCHAR, NVARCHAR, TEXT, NTEXT, VARBINARY, VARBINARY (MAX), IMAGE and XML.

可以为使用以下任何数据类型的列创建全文索引——CHAR、NCHAR、VARCHAR、NVARCHAR、TEXT、NTEXT、VARBINARY、VARBINARY (MAX)、IMAGE和XML。

In addition, To use full text search you must create a full-text index for the table against which they want to run full-text search queries. For a particular SQL Server Table or Indexed View you can create a maximum of one Full-Text Index.

此外,要使用全文搜索,必须为要运行全文搜索查询的表创建全文索引。对于特定的SQL Server表或索引视图,可以创建最多一个全文索引。

these are two article about it:

这是关于它的两篇文章:

SQL SERVER - 2008 - Creating Full Text Catalog and Full Text Search

SQL SERVER - 2008 -创建全文目录和全文搜索。

Using Full Text Search in SQL Server 2008

在SQL Server 2008中使用全文搜索