带有单词边界的MySQL全文搜索

时间:2022-09-19 19:10:58

I've read some articles and issues, but couldn't find a satisfying solution. I want to select related records from the database when a user fills in a form; on the same way the functionality works on this site when you ask a question.

我已经阅读了一些文章和问题,但找不到令人满意的解决方案。我想在用户填写表单时从数据库中选择相关记录;以同样的方式,当您提出问题时,该功能在此网站上有效。

Consider a database table with the following three records in the column subject

考虑一个数据库表,列主题中包含以下三个记录

+---+---------------------------------------------------+
| 1 | Pagina aanmaken en beter doorzoekbaar maken       |
+---+---------------------------------------------------+
| 2 | Sorteerfunctie uitbreiden in zoek-en-boek functie |
+---+---------------------------------------------------+
| 3 | Zoek de verschillen tussen de pagina's            |
+---+---------------------------------------------------+

I start my search query with the word zoek so i want to query the most relevant results from the database on the term zoek. I came up with the following query:

我用zoek这个词开始我的搜索查询,所以我想在术语zoek上查询数据库中最相关的结果。我想出了以下查询:

SELECT 
    id, 
    subject, 
    MATCH(
        subject
    ) 
    AGAINST(
        'zoek*'
        IN BOOLEAN MODE
    ) 
    AS 
        score
FROM 
    Issues 
WHERE 
    MATCH(
        subject
    ) 
    AGAINST(
        'zoek*'
        IN BOOLEAN MODE
    )

When i run this query i'd expected all the records to show and (probably, i don't know how specificity works in MySQL) ID 3 to display on top (because exact word match).

当我运行这个查询时,我希望显示所有记录和(可能,我不知道MySQL中的特异性如何工作)ID 3显示在顶部(因为确切的单词匹配)。

Instead the results of the query were only row 2 and 3 with exactly the same score (0.031008131802082062).

相反,查询结果只有第2行和第3行,得分完全相同(0.031008131802082062)。

What do i need to change in my query to match appropriate records? Also considering that users can type in keywords or sentences.

在查询中需要更改哪些内容才能匹配适当的记录?还要考虑用户可以输入关键字或句子。

5 个解决方案

#1


0  

There is a workaound for your case:

你的案例有一个工作重点:

SELECT 
    id, 
    subject, 
    IF (subject LIKE "zoek %" OR subject LIKE "% zoek %" OR subject LIKE "% zoek", 
        1, 
        IF (subject LIKE "% zoek%",
            0.5,
            IF (subject LIKE "%zoek%",
                0.2,
                0)
            )
        ) as score
FROM 
    Issues 
WHERE subject LIKE "%zoek%"
ORDER by score DESC

Expected result:

+---+---------------------------------------------------+------+
|id |   subject                                         |score |    
+---+---------------------------------------------------+------+
|3  | Zoek de verschillen tussen de pagina's            | 1    |
+---+---------------------------------------------------+------+
|2  | Sorteerfunctie uitbreiden in zoek-en-boek functie | 0.5  |
+---+---------------------------------------------------+------+
|1  | Pagina aanmaken en beter doorzoekbaar maken       | 0.2  |
+---+---------------------------------------------------+------+

#2


2  

MySQL full-text search doesn't support suffixes.

MySQL全文搜索不支持后缀。

To get the first row you would have to do a match against '*zoek*' which is currently not allowed.

要获得第一行,您必须与目前不允许的'* zoek *'进行匹配。

The alternative is to use

另一种方法是使用

SELECT id, subject
FROM Issues 
WHERE subject LIKE '%zoek%' 

#3


1  

As others advised, MySQL's FULLTEXT indexes do not support leading wildcards, and therefore cannot help in searching for suffixes.

正如其他人所建议的那样,MySQL的FULLTEXT索引不支持前导通配符,因此无法帮助搜索后缀。

However, the new ngram Full-Text Parser might help :

但是,新的ngram全文分析器可能会有所帮助:

The built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation, MySQL provides an ngram full-text parser (...).

内置的MySQL全文解析器使用单词之间的空格作为分隔符来确定单词的开始和结束位置,这在使用不使用单词分隔符的表意语言时是一个限制。为了解决这个限制,MySQL提供了一个ngram全文解析器(...)。

An ngram is a contiguous sequence of n characters from a given sequence of text. The ngram parser tokenizes a sequence of text into a contiguous sequence of n characters.

ngram是来自给定文本序列的n个字符的连续序列。 ngram解析器将一系列文本标记为n个字符的连续序列。

As I have never used this feature, I cannot help further on this topic. Notice however:

由于我从未使用过此功能,因此我无法进一步了解此主题。但请注意:

Because an ngram FULLTEXT index contains only ngrams, and does not contain information about the beginning of terms, wildcard searches may return unexpected results.

由于ngram FULLTEXT索引仅包含ngrams,并且不包含有关术语开头的信息,因此通配符搜索可能会返回意外结果。

#4


1  

Try this queries for different results:

尝试此查询以获得不同的结果:

  1. Select all subject that starts with letter "z":
    SELECT ID, Subject FROM table_name WHERE Subject LIKE 'z%';

    选择以字母“z”开头的所有主题:SELECT ID,Subject FROM table_name WHERE Subject LIKE'z%';

  2. Select all subject that ends with letter "z":
    SELECT ID, Subject FROM table_name WHERE Subject LIKE '%z';

    选择以字母“z”结尾的所有主题:SELECT ID,Subject FROM table_name WHERE Subject LIKE'%z';

  3. Select all subject containing the pattern "zoek":
    SELECT ID, Subject FROM table_name WHERE Subject LIKE '%zoek%';

    选择包含模式“zoek”的所有主题:SELECT ID,Subject FROM table_name WHERE Subject LIKE'%zoek%';

#5


0  

Sorry...

Middle of word (doorzoekbaar) is, by definition of MySQL's FULLTEXT, not something that will be found. FULLTEXT has no concept of "compound nouns", so it won't attempt to pick the word apart.

根据MySQL的FULLTEXT的定义,单词(doorzoekbaar)的中间部分不是可以找到的东西。 FULLTEXT没有“复合名词”的概念,所以它不会试图分开这个词。

The definition of a "word" in FULLTEXT give 'dash' and 'space' the same meaning -- namely a word boundary. So, zoek de... and zoek-... are given equal weight.

FULLTEXT中“单词”的定义赋予“破折号”和“空格”相同的含义 - 即单词边界。因此,zoek de ...和zoek -...被赋予相同的权重。

Look at Solr, Lucene, and other 3rd party "fulltext solutions". They may (or may not) provide what you want.

看看Solr,Lucene和其他第三方“全文解决方案”。他们可能(或可能不)提供您想要的东西。

zoek* and +zoek*, when run with IN BOOLEAN MODE will find zoekbaar.

zoek *和+ zoek *,当用IN BOOLEAN MODE运行时会找到zoekbaar。

#1


0  

There is a workaound for your case:

你的案例有一个工作重点:

SELECT 
    id, 
    subject, 
    IF (subject LIKE "zoek %" OR subject LIKE "% zoek %" OR subject LIKE "% zoek", 
        1, 
        IF (subject LIKE "% zoek%",
            0.5,
            IF (subject LIKE "%zoek%",
                0.2,
                0)
            )
        ) as score
FROM 
    Issues 
WHERE subject LIKE "%zoek%"
ORDER by score DESC

Expected result:

+---+---------------------------------------------------+------+
|id |   subject                                         |score |    
+---+---------------------------------------------------+------+
|3  | Zoek de verschillen tussen de pagina's            | 1    |
+---+---------------------------------------------------+------+
|2  | Sorteerfunctie uitbreiden in zoek-en-boek functie | 0.5  |
+---+---------------------------------------------------+------+
|1  | Pagina aanmaken en beter doorzoekbaar maken       | 0.2  |
+---+---------------------------------------------------+------+

#2


2  

MySQL full-text search doesn't support suffixes.

MySQL全文搜索不支持后缀。

To get the first row you would have to do a match against '*zoek*' which is currently not allowed.

要获得第一行,您必须与目前不允许的'* zoek *'进行匹配。

The alternative is to use

另一种方法是使用

SELECT id, subject
FROM Issues 
WHERE subject LIKE '%zoek%' 

#3


1  

As others advised, MySQL's FULLTEXT indexes do not support leading wildcards, and therefore cannot help in searching for suffixes.

正如其他人所建议的那样,MySQL的FULLTEXT索引不支持前导通配符,因此无法帮助搜索后缀。

However, the new ngram Full-Text Parser might help :

但是,新的ngram全文分析器可能会有所帮助:

The built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation, MySQL provides an ngram full-text parser (...).

内置的MySQL全文解析器使用单词之间的空格作为分隔符来确定单词的开始和结束位置,这在使用不使用单词分隔符的表意语言时是一个限制。为了解决这个限制,MySQL提供了一个ngram全文解析器(...)。

An ngram is a contiguous sequence of n characters from a given sequence of text. The ngram parser tokenizes a sequence of text into a contiguous sequence of n characters.

ngram是来自给定文本序列的n个字符的连续序列。 ngram解析器将一系列文本标记为n个字符的连续序列。

As I have never used this feature, I cannot help further on this topic. Notice however:

由于我从未使用过此功能,因此我无法进一步了解此主题。但请注意:

Because an ngram FULLTEXT index contains only ngrams, and does not contain information about the beginning of terms, wildcard searches may return unexpected results.

由于ngram FULLTEXT索引仅包含ngrams,并且不包含有关术语开头的信息,因此通配符搜索可能会返回意外结果。

#4


1  

Try this queries for different results:

尝试此查询以获得不同的结果:

  1. Select all subject that starts with letter "z":
    SELECT ID, Subject FROM table_name WHERE Subject LIKE 'z%';

    选择以字母“z”开头的所有主题:SELECT ID,Subject FROM table_name WHERE Subject LIKE'z%';

  2. Select all subject that ends with letter "z":
    SELECT ID, Subject FROM table_name WHERE Subject LIKE '%z';

    选择以字母“z”结尾的所有主题:SELECT ID,Subject FROM table_name WHERE Subject LIKE'%z';

  3. Select all subject containing the pattern "zoek":
    SELECT ID, Subject FROM table_name WHERE Subject LIKE '%zoek%';

    选择包含模式“zoek”的所有主题:SELECT ID,Subject FROM table_name WHERE Subject LIKE'%zoek%';

#5


0  

Sorry...

Middle of word (doorzoekbaar) is, by definition of MySQL's FULLTEXT, not something that will be found. FULLTEXT has no concept of "compound nouns", so it won't attempt to pick the word apart.

根据MySQL的FULLTEXT的定义,单词(doorzoekbaar)的中间部分不是可以找到的东西。 FULLTEXT没有“复合名词”的概念,所以它不会试图分开这个词。

The definition of a "word" in FULLTEXT give 'dash' and 'space' the same meaning -- namely a word boundary. So, zoek de... and zoek-... are given equal weight.

FULLTEXT中“单词”的定义赋予“破折号”和“空格”相同的含义 - 即单词边界。因此,zoek de ...和zoek -...被赋予相同的权重。

Look at Solr, Lucene, and other 3rd party "fulltext solutions". They may (or may not) provide what you want.

看看Solr,Lucene和其他第三方“全文解决方案”。他们可能(或可能不)提供您想要的东西。

zoek* and +zoek*, when run with IN BOOLEAN MODE will find zoekbaar.

zoek *和+ zoek *,当用IN BOOLEAN MODE运行时会找到zoekbaar。