I want to search a sentence (word combination of) in some table or view of DB. I dont want to use Fultext search property of DB. Is there any alternative efficient way?
我想在DB的某些表或视图中搜索一个句子(单词组合)。我不想使用DB的Fultext搜索属性。有没有其他有效的方式?
4 个解决方案
#1
Without the use of an index, a database has to perform a "full table scan". This is rather like you looking through a book one page at a time to find what you need.
如果不使用索引,数据库必须执行“全表扫描”。这就像你一次翻阅一本书,找到你需要的东西。
That being said, computers are a lot faster than humans. It really depends on how much load your system has. Using MySQL we successfully implemented a search system on a table of lead information. The nature of the problem was one that could not be solved by normal indexes (including full text). So we designed it to be powered using a full table scan.
话虽如此,计算机比人类快得多。这实际上取决于您的系统有多少负载。使用MySQL,我们成功地在领导信息表上实现了一个搜索系统。问题的本质是普通索引(包括全文)无法解决的问题。因此我们将其设计为使用全表扫描进行供电。
That involved creating tables as narrow as possible with the search data, and joining them to a larger table with related, but non-search data.
这涉及使用搜索数据创建尽可能窄的表,并将它们连接到具有相关但非搜索数据的较大表。
At the time (4 years ago), 100,000 records could be scanned in .06 seconds. 1,000,000 records took about .6 seconds. The system is still in heavy production use with millions of records.
当时(4年前),可以在.06秒内扫描100,000条记录。 1,000,000条记录大约需要0.6秒。该系统仍在大量生产中使用,拥有数百万条记录。
If your data needs exceed 6 digits of records, you may want to re-evaluate using a full text index, or do some research on inverted indexes.
如果您的数据需要超过6位数的记录,您可能需要使用全文索引重新评估,或对倒排索引进行一些研究。
Please comment if you would like any more info.
如果您想了解更多信息,请发表评论。
Edit: The search tables were kept as narrow as possible. Ideally 50-100 bytes per record. ENUMS and TINYINT are great space savers if you can use them to "map" to string values another way.
编辑:搜索表保持尽可能窄。理想情况下,每条记录50-100个字节。 ENUMS和TINYINT是很好的节省空间,如果你可以用它们以另一种方式“映射”到字符串值。
The search queries were generated using a PHP class. They were simply:
搜索查询是使用PHP类生成的。他们只是:
-- DataTable is the big table that holds all of the data
-- SearchTable is the narrow table that holds the bits of searchable data
SELECT
MainTable.ID,
MainTable.Name,
MainTable.Whatever
FROM
MainTable, SearchTable
WHERE
MainTable.ID = SearchTable.ID
AND SearchTable.State IN ('PA', 'DE')
AND SearchTable.Age < 40
AND SearchTable.Status = 3
Essentially, the two tables were joined on a primary key (fast), and the filtering was done by full table scan on the SearchTable (pretty fast). We were using MySQL.
本质上,这两个表是在主键(快速)上连接的,并且过滤是通过SearchTable上的全表扫描完成的(非常快)。我们使用MySQL。
We found that by having the record format == "FIXED" in the MyISAM tables, we could increase performace by 3x. This meant no blobs, no varchars, etc...
我们发现通过在MyISAM表中记录格式==“FIXED”,我们可以将性能提高3倍。这意味着没有blob,没有varchars等...
Let me know if this helps.
如果这有帮助,请告诉我。
#2
None as efficient as Fulltext search.
没有像全文搜索一样有效。
Basically it boils down to where with like derivatives and since indexes are tossed away in most of the scenarios , it becomes a very expensive query.
基本上它归结为具有相似衍生物的地方,并且由于索引在大多数场景中被抛弃,因此它变成了非常昂贵的查询。
#3
If you are using JAVA have at look at Lucene
如果你正在使用JAVA看看Lucene
If you are using .net, you can have a look at Lucene.net, it will minimize the calls to the database for the search queries.
如果您使用的是.net,您可以查看Lucene.net,它将最小化对搜索查询的数据库调用。
Following from http://incubator.apache.org/lucene.net/
来自http://incubator.apache.org/lucene.net/
Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and .NET platform utilizing Microsoft .NET Framework.
Lucene.Net是使用Microsoft .NET Framework的C#和.NET平台的Java Lucene搜索引擎的源代码,每类,API-per-API和算法端口。
Lucene.Net sticks to the APIs and classes used in the original Java implementation of Lucene. The API names as well as class names are preserved with the intention of giving Lucene.Net the look and feel of the C# language and the .NET Framework. For example, the method Hits.length() in the Java implementation now reads Hits.Length() in the C# port.
Lucene.Net坚持使用Lucene的原始Java实现中使用的API和类。保留API名称和类名称的目的是为Lucene.Net提供C#语言和.NET Framework的外观。例如,Java实现中的方法Hits.length()现在读取C#端口中的Hits.Length()。
In addition to the APIs and classes port to C#, the algorithm of Java Lucene is ported to C# Lucene. This means an index created with Java Lucene is back-and-forth compatible with the C# Lucene; both at reading, writing and updating. In fact a Lucene index can be concurrently searched and updated using Java Lucene and C# Lucene processes.
除了端口到C#的API和类之外,Java Lucene的算法还移植到C#Lucene。这意味着使用Java Lucene创建的索引与C#Lucene来回兼容;在阅读,写作和更新方面。实际上,可以使用Java Lucene和C#Lucene进程同时搜索和更新Lucene索引。
#4
You could break up the text into individual words, stick them in a separate table, and use that to find PK IDs that have all the words in your search sentence [i.e. but not necessarily in the right order], and then search just those rows for the sentence. Should avoid having to do a table scan every time.
您可以将文本分解为单个单词,将它们粘贴在单独的表中,并使用它来查找包含搜索语句中所有单词的PK ID [即但不一定按照正确的顺序],然后搜索句子的那些行。应该避免每次都要进行表扫描。
Please ask if you need me to explain further
请问您是否需要我进一步解释
#1
Without the use of an index, a database has to perform a "full table scan". This is rather like you looking through a book one page at a time to find what you need.
如果不使用索引,数据库必须执行“全表扫描”。这就像你一次翻阅一本书,找到你需要的东西。
That being said, computers are a lot faster than humans. It really depends on how much load your system has. Using MySQL we successfully implemented a search system on a table of lead information. The nature of the problem was one that could not be solved by normal indexes (including full text). So we designed it to be powered using a full table scan.
话虽如此,计算机比人类快得多。这实际上取决于您的系统有多少负载。使用MySQL,我们成功地在领导信息表上实现了一个搜索系统。问题的本质是普通索引(包括全文)无法解决的问题。因此我们将其设计为使用全表扫描进行供电。
That involved creating tables as narrow as possible with the search data, and joining them to a larger table with related, but non-search data.
这涉及使用搜索数据创建尽可能窄的表,并将它们连接到具有相关但非搜索数据的较大表。
At the time (4 years ago), 100,000 records could be scanned in .06 seconds. 1,000,000 records took about .6 seconds. The system is still in heavy production use with millions of records.
当时(4年前),可以在.06秒内扫描100,000条记录。 1,000,000条记录大约需要0.6秒。该系统仍在大量生产中使用,拥有数百万条记录。
If your data needs exceed 6 digits of records, you may want to re-evaluate using a full text index, or do some research on inverted indexes.
如果您的数据需要超过6位数的记录,您可能需要使用全文索引重新评估,或对倒排索引进行一些研究。
Please comment if you would like any more info.
如果您想了解更多信息,请发表评论。
Edit: The search tables were kept as narrow as possible. Ideally 50-100 bytes per record. ENUMS and TINYINT are great space savers if you can use them to "map" to string values another way.
编辑:搜索表保持尽可能窄。理想情况下,每条记录50-100个字节。 ENUMS和TINYINT是很好的节省空间,如果你可以用它们以另一种方式“映射”到字符串值。
The search queries were generated using a PHP class. They were simply:
搜索查询是使用PHP类生成的。他们只是:
-- DataTable is the big table that holds all of the data
-- SearchTable is the narrow table that holds the bits of searchable data
SELECT
MainTable.ID,
MainTable.Name,
MainTable.Whatever
FROM
MainTable, SearchTable
WHERE
MainTable.ID = SearchTable.ID
AND SearchTable.State IN ('PA', 'DE')
AND SearchTable.Age < 40
AND SearchTable.Status = 3
Essentially, the two tables were joined on a primary key (fast), and the filtering was done by full table scan on the SearchTable (pretty fast). We were using MySQL.
本质上,这两个表是在主键(快速)上连接的,并且过滤是通过SearchTable上的全表扫描完成的(非常快)。我们使用MySQL。
We found that by having the record format == "FIXED" in the MyISAM tables, we could increase performace by 3x. This meant no blobs, no varchars, etc...
我们发现通过在MyISAM表中记录格式==“FIXED”,我们可以将性能提高3倍。这意味着没有blob,没有varchars等...
Let me know if this helps.
如果这有帮助,请告诉我。
#2
None as efficient as Fulltext search.
没有像全文搜索一样有效。
Basically it boils down to where with like derivatives and since indexes are tossed away in most of the scenarios , it becomes a very expensive query.
基本上它归结为具有相似衍生物的地方,并且由于索引在大多数场景中被抛弃,因此它变成了非常昂贵的查询。
#3
If you are using JAVA have at look at Lucene
如果你正在使用JAVA看看Lucene
If you are using .net, you can have a look at Lucene.net, it will minimize the calls to the database for the search queries.
如果您使用的是.net,您可以查看Lucene.net,它将最小化对搜索查询的数据库调用。
Following from http://incubator.apache.org/lucene.net/
来自http://incubator.apache.org/lucene.net/
Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and .NET platform utilizing Microsoft .NET Framework.
Lucene.Net是使用Microsoft .NET Framework的C#和.NET平台的Java Lucene搜索引擎的源代码,每类,API-per-API和算法端口。
Lucene.Net sticks to the APIs and classes used in the original Java implementation of Lucene. The API names as well as class names are preserved with the intention of giving Lucene.Net the look and feel of the C# language and the .NET Framework. For example, the method Hits.length() in the Java implementation now reads Hits.Length() in the C# port.
Lucene.Net坚持使用Lucene的原始Java实现中使用的API和类。保留API名称和类名称的目的是为Lucene.Net提供C#语言和.NET Framework的外观。例如,Java实现中的方法Hits.length()现在读取C#端口中的Hits.Length()。
In addition to the APIs and classes port to C#, the algorithm of Java Lucene is ported to C# Lucene. This means an index created with Java Lucene is back-and-forth compatible with the C# Lucene; both at reading, writing and updating. In fact a Lucene index can be concurrently searched and updated using Java Lucene and C# Lucene processes.
除了端口到C#的API和类之外,Java Lucene的算法还移植到C#Lucene。这意味着使用Java Lucene创建的索引与C#Lucene来回兼容;在阅读,写作和更新方面。实际上,可以使用Java Lucene和C#Lucene进程同时搜索和更新Lucene索引。
#4
You could break up the text into individual words, stick them in a separate table, and use that to find PK IDs that have all the words in your search sentence [i.e. but not necessarily in the right order], and then search just those rows for the sentence. Should avoid having to do a table scan every time.
您可以将文本分解为单个单词,将它们粘贴在单独的表中,并使用它来查找包含搜索语句中所有单词的PK ID [即但不一定按照正确的顺序],然后搜索句子的那些行。应该避免每次都要进行表扫描。
Please ask if you need me to explain further
请问您是否需要我进一步解释