在SQL中实现子字符串搜索的最佳方法是什么?

时间:2022-09-13 09:35:42

We have a simple SQL problem here. In a varchar column, we wanted to search for a string anywhere in the field. What is the best way to implement this for performance? Obviously an index is not going to help here, any other tricks?

我们这里有一个简单的SQL问题。在varchar列中,我们希望在字段中的任何位置搜索字符串。实现此性能的最佳方法是什么?显然,索引在这里没有任何帮助,还有其他任何技巧?

We are using MySQL and have about 3 million records. We need to execute many of these queries per second so really trying to implement these with the best performance.

我们正在使用MySQL并拥有大约300万条记录。我们需要每秒执行许多这些查询,因此我们真正尝试以最佳性能实现这些查询。

The most simple way to do this is so far is:

到目前为止,最简单的方法是:

Select * from table where column like '%search%'

I should further specify that the column is actually a long string like "sadfasdfwerwe" and I have to search for "asdf" in this column. So they are not sentences and trying to match a word in them. Would full text search still help here?

我应该进一步指定该列实际上是一个长字符串,如“sadfasdfwerwe”,我必须在此列中搜索“asdf”。所以他们不是句子,而是试图匹配其中的一个词。全文搜索仍然有用吗?

4 个解决方案

#1


14  

Check out my presentation Practical Fulltext Search in MySQL.

查看我的演示文稿MySQL中的实用全文搜索。

I compared:

Today what I would use is Apache Solr, which puts Lucene into a service with a bunch of extra features and tools.

今天我将使用的是Apache Solr,它将Lucene置于具有一系列额外功能和工具的服务中。


Re your comment: Aha, okay, no. None of the fulltext search capabilities I mentioned are going to help, since they all assume some kind of word boundaries

你的评论:啊哈,好吧,不。我提到的全文搜索功能都没有帮助,因为它们都假设某种词边界

The other way to efficiently find arbitrary substrings is the N-gram approach. Basically, create an index of all possible sequences of N letters and point to the strings where each respective sequence occurs. Typically this is done with N=3, or a trigram, because it's a point of compromise between matching longer substrings and keeping the index to a manageable size.

有效地找到任意子串的另一种方法是N-gram方法。基本上,创建N个字母的所有可能序列的索引,并指向每个相应序列出现的字符串。通常,这是通过N = 3或三元组来完成的,因为它是匹配较长子串并将索引保​​持在可管理大小之间的折衷点。

I don't know of any SQL database that supports N-gram indexing transparently, but you could set it up yourself using an inverted index:

我不知道任何透明地支持N-gram索引的SQL数据库,但您可以使用倒排索引自己设置它:

create table trigrams (
  trigram char(3) primary key
);

create table trigram_matches (
  trigram char(3),
  document_id int,
  primary key (trigram, document_id),
  foreign key (trigram) references trigrams(trigram),
  foreign key (document_id) references mytable(document_id)
);

Now populate it the hard way:

现在用艰难的方式填充它:

insert into trigram_matches
  select t.trigram, d.document_id
  from trigrams t join mytable d
    on d.textcolumn like concat('%', t.trigram, '%');

Of course this will take quite a while! But once it's done, you can search much more quickly:

当然这需要一段时间!但是一旦完成,你可以更快地搜索:

select d.*
from mytable d join trigram_matches t
  on t.document_id = d.document_id
where t.trigram = 'abc'

Of course you could be searching for patterns longer than three characters, but the inverted index still helps to narrow your search a lot:

当然你可以搜索超过三个字符的模式,但倒排索引仍然有助于缩小你的搜索范围:

select d.*
from mytable d join trigram_matches t
  on t.document_id = d.document_id
where t.trigram = 'abc'
  and d.textcolumn like '%abcdef%';

#2


0  

I you want to match whole words, look at a FULLTEXT index & MATCH() AGAINST(). And of course, take a load of your database server: cache results for a appropriate amount of time for you specific needs.

我想匹配整个单词,查看FULLTEXT索引和MATCH()AGAINST()。当然,请加载您的数据库服务器:根据您的特定需求缓存结果一段适当的时间。

#3


0  

First, maybe this is an issue with a badly designed table that stores a delimited string in one field instead of correctly designing to make a related table. If this is the case, you should fix your design.

首先,这可能是一个设计糟糕的表的问题,该表将分隔的字符串存储在一个字段中,而不是正确设计以创建相关的表。如果是这种情况,您应该修改您的设计。

If you have a field with long descriptive text (saya a notes field) and the search is always by whole word, you can do a full-text search.

如果您的字段包含长描述性文本(例如注释字段)并且搜索始终是整个单词,则可以进行全文搜索。

Consider if you can require your users to at least give you the first character of what they are searching for if it is an ordinary field like Last_name.

考虑一下,如果它是像Last_name这样的普通字段,您是否可以要求您的用户至少为您提供他们正在搜索的内容的第一个字符。

Consider doing an exact match search first and only performing the wildcard match if no results are returned. This will work if you have users who can provide exact matches. We did this once with airport name searches, it came back really fast if they put inthe exact name and slower if they did not.

考虑首先进行完全匹配搜索,如果没有返回结果,则仅执行通配符匹配。如果您有可以提供完全匹配的用户,这将有效。我们用机场名称搜索做了一次,如果他们输入确切的名字,它会很快恢复,如果他们没有,则会慢一些。

If you want to search just for strings that are not words that may be somewhere in the text, you are pretty much stuck with bad performance.

如果您只想搜索不是文本中某些字词的字符串,那么您几乎会遇到性能不佳的问题。

#4


0  

  1. mysql fulltext search's quality (for this purpose) is poor, if your language is not English

    mysql全文搜索的质量(为此目的)很差,如果你的语言不是英语

  2. trigram search gives very good results, for this task

    trigram搜索为此任务提供了非常好的结果

  3. postgreSQL has trigram index, it's easy to use :)

    postgreSQL有三元组索引,很容易使用:)

  4. but if you need to do it in mysql, try this, improved version of Bill Karwin's answer:

    但如果您需要在mysql中进行,请尝试使用Bill Karwin的改进版本的答案:

    -each trigram is stored only once

    -each trigram只存储一次

    -a simple php class uses the data

    - 一个简单的php类使用数据

    <?php
    
      /*
    
        # mysql table structure
        CREATE TABLE `trigram2content` (
    `trigram_id` int NOT NULL REFERENCES trigrams(id),
    `content_type_id` int(11) NOT NULL,
    `record_id` int(11) NOT NULL,
    PRIMARY KEY (`content_type_id`,`trigram_id`,`record_id`)
    );
    
    #each trigram is stored only once
    CREATE TABLE `trigrams` (
    `id` int not null auto_increment,
    `token` varchar(3) NOT NULL,
    PRIMARY KEY (id),
    UNIQUE token(token)
    ) DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
    
    
    SELECT count(*), record_id FROM trigrams t
    inner join trigram2content c ON t.id=c.trigram_id
    WHERE (
    t.token IN ('loc','ock','ck ','blo',' bl', ' bu', 'bur', 'urn')
    AND c.content_type_id = 0
    )
    GROUP by record_id
    ORDER BY count(*) DESC
    limit 20;
    
    
    */
    class trigram
    {
    
        private $dbLink;
    
        var $types = array(
            array(0, 'name'),
            array(1, 'city'));
    
    
        function trigram()
        {
          //connect to db
          $this->dbLink = mysql_connect("localhost", "username", "password");
          if ($this->dbLink) mysql_select_db("dbname");
          else mysql_error();
    
          mysql_query("SET NAMES utf8;", $this->dbLink);
        }
    
        function get_type_value($type_name){
          for($i=0; $i<count($this->types); $i++){
              if($this->types[$i][1] == $type_name)
                  return $this->types[$i][0];
          }
          return "";
        }
    
        function getNgrams($word, $n = 3) {
            $ngrams = array();
            $len = mb_strlen($word, 'utf-8');
            for($i = 0; $i < $len-($n-1); $i++) {
                $ngrams[] = mysql_real_escape_string(mb_substr($word, $i, $n, 'utf-8'), $this->dbLink);
            }
            return $ngrams;
        }
    
        /**
        input: array('hel', 'ell', 'llo', 'lo ', 'o B', ' Be', 'Bel', 'ell', 'llo', 'lo ', 'o  ')
        output: array(1,     2,     3,      4,      5,      6,      7,     2,   3,  4,      8)
        */
        private function getTrigramIds(&$t){
            $u = array_unique($t);
            $q = "SELECT * FROM trigrams WHERE token IN ('" . implode("', '", $u) . "')";
    
            $query = mysql_query($q, $this->dbLink);
            $n = mysql_num_rows($query);
    
            $ids = array(); //these trigrams are already in db, they have id
            $ok = array();
    
            for ($i=0; $i<$n; $i++)
            {
              $row = mysql_fetch_array($query, MYSQL_ASSOC);
              $ok []= $row['token'];
              $ids[ $row['token'] ] = $row['id'];
            }
            $diff = array_diff($u, $ok); //these trigrams are not yet in the db
            foreach($diff as $n){
                mysql_query("INSERT INTO trigrams (token) VALUES('$n')", $this->dbLink);
                $ids[$n]= mysql_insert_id();
            }
    
            //so many ids than items (if a trigram occurs more times in input, then it will occur more times in output as well)
            $result = array();
            foreach($t as $n){
                $result[]= $ids[$n];
            }
            return $result;
        }
    
        function insertData($id, $data, $type){
            $t = $this->getNgrams($data);
    
            $id = intval($id);
            $type = $this->get_type_value($type);
            $tIds = $this->getTrigramIds($t);
            $q = "INSERT INTO trigram2content (trigram_id, content_type_id, record_id) VALUES ";
            $rows = array();
            foreach($tIds as $n => $tid){
                $rows[]= "($tid, $type, $id)";
            }
            $q .= implode(", ", $rows);
            mysql_query($q, $this->dbLink);
        }
    
        function updateData($id, $data, $type){
            mysql_query("DELETE FROM trigram2content WHERE record_id=".intval($id)." AND content_type_id=".$this->get_type_value($type), $this->dbLink);
            $this->insertData($id, $data, $type);
        }
    
        function search($str, $type){
    
            $tri = $this->getNgrams($str);
            $max = count($tri);
            $q = "SELECT count(*), count(*)/$max as score, record_id FROM trigrams t inner join trigram2content c ON t.id=c.trigram_id
    WHERE (
    t.token IN ('" . implode("', '", $tri) . "')
    AND c.content_type_id = ".$this->get_type_value($type)."
    )
    GROUP by record_id
    HAVING score >= 0.6
    ORDER BY count(*) DESC
    limit 20;";
            $query = mysql_query($q, $this->dbLink);
            $n = mysql_num_rows($query);
    
            $result = array();
            for ($i=0; $i<$n; $i++)
            {
              $row = mysql_fetch_array($query, MYSQL_ASSOC);
              $result[] = $row;
            }
            return $result;
        }
    
    
    };
    

and usage:

 $t = new trigram();

 $t->insertData(1, "hello bello", "name");
 $t->insertData(2, "hellllo Mammmma mia", "name");

  print_r($t->search("helo", "name"));

#1


14  

Check out my presentation Practical Fulltext Search in MySQL.

查看我的演示文稿MySQL中的实用全文搜索。

I compared:

Today what I would use is Apache Solr, which puts Lucene into a service with a bunch of extra features and tools.

今天我将使用的是Apache Solr,它将Lucene置于具有一系列额外功能和工具的服务中。


Re your comment: Aha, okay, no. None of the fulltext search capabilities I mentioned are going to help, since they all assume some kind of word boundaries

你的评论:啊哈,好吧,不。我提到的全文搜索功能都没有帮助,因为它们都假设某种词边界

The other way to efficiently find arbitrary substrings is the N-gram approach. Basically, create an index of all possible sequences of N letters and point to the strings where each respective sequence occurs. Typically this is done with N=3, or a trigram, because it's a point of compromise between matching longer substrings and keeping the index to a manageable size.

有效地找到任意子串的另一种方法是N-gram方法。基本上,创建N个字母的所有可能序列的索引,并指向每个相应序列出现的字符串。通常,这是通过N = 3或三元组来完成的,因为它是匹配较长子串并将索引保​​持在可管理大小之间的折衷点。

I don't know of any SQL database that supports N-gram indexing transparently, but you could set it up yourself using an inverted index:

我不知道任何透明地支持N-gram索引的SQL数据库,但您可以使用倒排索引自己设置它:

create table trigrams (
  trigram char(3) primary key
);

create table trigram_matches (
  trigram char(3),
  document_id int,
  primary key (trigram, document_id),
  foreign key (trigram) references trigrams(trigram),
  foreign key (document_id) references mytable(document_id)
);

Now populate it the hard way:

现在用艰难的方式填充它:

insert into trigram_matches
  select t.trigram, d.document_id
  from trigrams t join mytable d
    on d.textcolumn like concat('%', t.trigram, '%');

Of course this will take quite a while! But once it's done, you can search much more quickly:

当然这需要一段时间!但是一旦完成,你可以更快地搜索:

select d.*
from mytable d join trigram_matches t
  on t.document_id = d.document_id
where t.trigram = 'abc'

Of course you could be searching for patterns longer than three characters, but the inverted index still helps to narrow your search a lot:

当然你可以搜索超过三个字符的模式,但倒排索引仍然有助于缩小你的搜索范围:

select d.*
from mytable d join trigram_matches t
  on t.document_id = d.document_id
where t.trigram = 'abc'
  and d.textcolumn like '%abcdef%';

#2


0  

I you want to match whole words, look at a FULLTEXT index & MATCH() AGAINST(). And of course, take a load of your database server: cache results for a appropriate amount of time for you specific needs.

我想匹配整个单词,查看FULLTEXT索引和MATCH()AGAINST()。当然,请加载您的数据库服务器:根据您的特定需求缓存结果一段适当的时间。

#3


0  

First, maybe this is an issue with a badly designed table that stores a delimited string in one field instead of correctly designing to make a related table. If this is the case, you should fix your design.

首先,这可能是一个设计糟糕的表的问题,该表将分隔的字符串存储在一个字段中,而不是正确设计以创建相关的表。如果是这种情况,您应该修改您的设计。

If you have a field with long descriptive text (saya a notes field) and the search is always by whole word, you can do a full-text search.

如果您的字段包含长描述性文本(例如注释字段)并且搜索始终是整个单词,则可以进行全文搜索。

Consider if you can require your users to at least give you the first character of what they are searching for if it is an ordinary field like Last_name.

考虑一下,如果它是像Last_name这样的普通字段,您是否可以要求您的用户至少为您提供他们正在搜索的内容的第一个字符。

Consider doing an exact match search first and only performing the wildcard match if no results are returned. This will work if you have users who can provide exact matches. We did this once with airport name searches, it came back really fast if they put inthe exact name and slower if they did not.

考虑首先进行完全匹配搜索,如果没有返回结果,则仅执行通配符匹配。如果您有可以提供完全匹配的用户,这将有效。我们用机场名称搜索做了一次,如果他们输入确切的名字,它会很快恢复,如果他们没有,则会慢一些。

If you want to search just for strings that are not words that may be somewhere in the text, you are pretty much stuck with bad performance.

如果您只想搜索不是文本中某些字词的字符串,那么您几乎会遇到性能不佳的问题。

#4


0  

  1. mysql fulltext search's quality (for this purpose) is poor, if your language is not English

    mysql全文搜索的质量(为此目的)很差,如果你的语言不是英语

  2. trigram search gives very good results, for this task

    trigram搜索为此任务提供了非常好的结果

  3. postgreSQL has trigram index, it's easy to use :)

    postgreSQL有三元组索引,很容易使用:)

  4. but if you need to do it in mysql, try this, improved version of Bill Karwin's answer:

    但如果您需要在mysql中进行,请尝试使用Bill Karwin的改进版本的答案:

    -each trigram is stored only once

    -each trigram只存储一次

    -a simple php class uses the data

    - 一个简单的php类使用数据

    <?php
    
      /*
    
        # mysql table structure
        CREATE TABLE `trigram2content` (
    `trigram_id` int NOT NULL REFERENCES trigrams(id),
    `content_type_id` int(11) NOT NULL,
    `record_id` int(11) NOT NULL,
    PRIMARY KEY (`content_type_id`,`trigram_id`,`record_id`)
    );
    
    #each trigram is stored only once
    CREATE TABLE `trigrams` (
    `id` int not null auto_increment,
    `token` varchar(3) NOT NULL,
    PRIMARY KEY (id),
    UNIQUE token(token)
    ) DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
    
    
    SELECT count(*), record_id FROM trigrams t
    inner join trigram2content c ON t.id=c.trigram_id
    WHERE (
    t.token IN ('loc','ock','ck ','blo',' bl', ' bu', 'bur', 'urn')
    AND c.content_type_id = 0
    )
    GROUP by record_id
    ORDER BY count(*) DESC
    limit 20;
    
    
    */
    class trigram
    {
    
        private $dbLink;
    
        var $types = array(
            array(0, 'name'),
            array(1, 'city'));
    
    
        function trigram()
        {
          //connect to db
          $this->dbLink = mysql_connect("localhost", "username", "password");
          if ($this->dbLink) mysql_select_db("dbname");
          else mysql_error();
    
          mysql_query("SET NAMES utf8;", $this->dbLink);
        }
    
        function get_type_value($type_name){
          for($i=0; $i<count($this->types); $i++){
              if($this->types[$i][1] == $type_name)
                  return $this->types[$i][0];
          }
          return "";
        }
    
        function getNgrams($word, $n = 3) {
            $ngrams = array();
            $len = mb_strlen($word, 'utf-8');
            for($i = 0; $i < $len-($n-1); $i++) {
                $ngrams[] = mysql_real_escape_string(mb_substr($word, $i, $n, 'utf-8'), $this->dbLink);
            }
            return $ngrams;
        }
    
        /**
        input: array('hel', 'ell', 'llo', 'lo ', 'o B', ' Be', 'Bel', 'ell', 'llo', 'lo ', 'o  ')
        output: array(1,     2,     3,      4,      5,      6,      7,     2,   3,  4,      8)
        */
        private function getTrigramIds(&$t){
            $u = array_unique($t);
            $q = "SELECT * FROM trigrams WHERE token IN ('" . implode("', '", $u) . "')";
    
            $query = mysql_query($q, $this->dbLink);
            $n = mysql_num_rows($query);
    
            $ids = array(); //these trigrams are already in db, they have id
            $ok = array();
    
            for ($i=0; $i<$n; $i++)
            {
              $row = mysql_fetch_array($query, MYSQL_ASSOC);
              $ok []= $row['token'];
              $ids[ $row['token'] ] = $row['id'];
            }
            $diff = array_diff($u, $ok); //these trigrams are not yet in the db
            foreach($diff as $n){
                mysql_query("INSERT INTO trigrams (token) VALUES('$n')", $this->dbLink);
                $ids[$n]= mysql_insert_id();
            }
    
            //so many ids than items (if a trigram occurs more times in input, then it will occur more times in output as well)
            $result = array();
            foreach($t as $n){
                $result[]= $ids[$n];
            }
            return $result;
        }
    
        function insertData($id, $data, $type){
            $t = $this->getNgrams($data);
    
            $id = intval($id);
            $type = $this->get_type_value($type);
            $tIds = $this->getTrigramIds($t);
            $q = "INSERT INTO trigram2content (trigram_id, content_type_id, record_id) VALUES ";
            $rows = array();
            foreach($tIds as $n => $tid){
                $rows[]= "($tid, $type, $id)";
            }
            $q .= implode(", ", $rows);
            mysql_query($q, $this->dbLink);
        }
    
        function updateData($id, $data, $type){
            mysql_query("DELETE FROM trigram2content WHERE record_id=".intval($id)." AND content_type_id=".$this->get_type_value($type), $this->dbLink);
            $this->insertData($id, $data, $type);
        }
    
        function search($str, $type){
    
            $tri = $this->getNgrams($str);
            $max = count($tri);
            $q = "SELECT count(*), count(*)/$max as score, record_id FROM trigrams t inner join trigram2content c ON t.id=c.trigram_id
    WHERE (
    t.token IN ('" . implode("', '", $tri) . "')
    AND c.content_type_id = ".$this->get_type_value($type)."
    )
    GROUP by record_id
    HAVING score >= 0.6
    ORDER BY count(*) DESC
    limit 20;";
            $query = mysql_query($q, $this->dbLink);
            $n = mysql_num_rows($query);
    
            $result = array();
            for ($i=0; $i<$n; $i++)
            {
              $row = mysql_fetch_array($query, MYSQL_ASSOC);
              $result[] = $row;
            }
            return $result;
        }
    
    
    };
    

and usage:

 $t = new trigram();

 $t->insertData(1, "hello bello", "name");
 $t->insertData(2, "hellllo Mammmma mia", "name");

  print_r($t->search("helo", "name"));