How to find varchar
-word that has the most similar beginning of the specified word in MySQL database?
如何在MySQL数据库中找到与指定单词最相似的varchar-word?
For example:
例如:
+-------------------+
| word_column |
+-------------------+
| StackOferflow |
| StackExchange |
| MetaStackExchange |
| .... |
+-------------------+
query: call get_with_similar_begin('StackExch_bla_bla_bla');
output: 'StackExchange'
query:call get_with_similar_begin('StackExch_bla_bla_bla');输出:'StackExchange'
query: call get_with_similar_begin('StackO_bla_bla_bla');
output: 'StackOferflow'
query:call get_with_similar_begin('StackO_bla_bla_bla');输出:'StackOferflow'
UPDATE :
更新:
Select * from words where word_column like 'StackExch_bla_bla_bla'
will not give the correct result, because 'StackExchange'
does not match this filter.
从单词中选择*,其中像'StackExch_bla_bla_bla'这样的word_column将不会给出正确的结果,因为'StackExchange'与此过滤器不匹配。
Additional info: I has BTREE-index
on word_column
and I would like to use it whenever possible
附加信息:我在word_column上有BTREE索引,我想尽可能使用它
4 个解决方案
#1
2
In SQL Server we can use CTE like below query to achieve what you want:
在SQL Server中,我们可以像下面的查询一样使用CTE来实现你想要的:
declare @search nvarchar(255) = 'StackExch_bla_bla_bla';
-- A cte that contains `StackExch_bla_bla_bla` sub-strings: {`StackExch_bla_bla_bla`, `StackExch_bla_bla_bl`, ..., `S`}
with cte(part, lvl) as (
select @search, 1
union all
select substring(@search, 1, len(@search) - lvl), lvl + 1
from cte
where lvl < len(@search)
), t as ( -- Now below cte will find match level of each word_column
select t.word_column, min(cte.lvl) matchLvl
from yourTable t
left join cte
on t.word_column like cte.part+'%'
group by t.word_column
)
select top(1) word_column
from t
where matchLvl is not null -- remove non-matched rows
order by matchLvl;
SQL Server小提琴演示
I need more time to find a way in MySQL for it, Hope some MySQL experts answer faster ;).
我需要更多时间在MySQL中找到一种方法,希望一些MySQL专家能够更快地回答;)。
My best try in MySQL is this:
我在MySQL中的最佳尝试是:
select tt.word_column
from (
select t.word_column, min(lvl) matchLvl
from yourTable t
join (
select 'StackExch_bla_bla_bla' part, 1 lvl
union all select 'StackExch_bla_bla_bl', 2
union all select 'StackExch_bla_bla_b', 3
union all select 'StackExch_bla_bla_', 4
union all select 'StackExch_bla_bla', 5
union all select 'StackExch_bla_bl', 6
union all select 'StackExch_bla_b', 7
union all select 'StackExch_bla_', 8
union all select 'StackExch_bla', 9
union all select 'StackExch_bl', 10
union all select 'StackExch_b', 11
union all select 'StackExch_', 12
union all select 'StackExch', 13
union all select 'StackExc', 14
union all select 'StackEx', 15
union all select 'StackE', 16
union all select 'Stack', 17
union all select 'Stac', 18
union all select 'Sta', 19
union all select 'St', 20
union all select 'S', 21
) p on t.word_column like concat(p.part, '%')
group by t.word_column
) tt
order by matchLvl
limit 1;
I think by creating a stored procedure and using a temp table to store values in p
sub-select you can achieve what you want -HTH ;).
我认为通过创建存储过程并使用临时表在p子选择中存储值,您可以实现您想要的--HTH;)。
MySQL小提琴演示
#2
2
This is a slight variation on @shA.t's answer. The aggregation is not necessary:
@ shA.t的回答略有不同。聚合不是必需的:
select t.*, p.lvl
from yourTable t join
(select 'StackExch_bla_bla_bla' as part, 1 as lvl union all
select 'StackExch_bla_bla_bl', 2 union all
select 'StackExch_bla_bla_b', 3 union all
select 'StackExch_bla_bla_', 4 union all
select 'StackExch_bla_bla', 5 union all
select 'StackExch_bla_bl', 6 union all
select 'StackExch_bla_b', 7 union all
select 'StackExch_bla_', 8 union all
select 'StackExch_bla', 9 union all
select 'StackExch_bl', 10 union all
select 'StackExch_b', 11 union all
select 'StackExch_', 12 union all
select 'StackExch', 13 union all
select 'StackExc', 14 union all
select 'StackEx', 15 union all
select 'StackE', 16 union all
select 'Stack', 17 union all
select 'Stac', 18 union all
select 'Sta', 19 union all
select 'St', 20 union all
select 'S', 21
) p
on t.word_column like concat(p.part, '%')
order by matchLvl
limit 1;
A faster way is to use case
:
更快的方法是使用案例:
select t.*,
(case when t.word_column like concat('StackExch_bla_bla_bla', '%') then 'StackExch_bla_bla_bla'
when t.word_column like concat('StackExch_bla_bla_bl', '%') then 'StackExch_bla_bla_bl'
when t.word_column like concat('StackExch_bla_bla_b', '%') then 'StackExch_bla_bla_b'
. . .
when t.word_column like concat('S', '%') then 'S'
else ''
end) as longest_match
from t
order by length(longest_match) desc
limit 1;
Neither of these will make effective use of the index.
这些都不能有效地利用该指数。
If you want a version that uses the index, then do the looping at the application layer, and repeated run the query as:
如果您需要使用索引的版本,请在应用程序层执行循环,然后重复运行查询:
select t.*
from t
where t.word_column like 'StackExch_bla_bla_bla%'
limit 1;
Then stop when you hit the first match. MySQL should be using the index for the like
comparison.
然后在第一场比赛时停止。 MySQL应该使用索引进行类似的比较。
You can come pretty close to this using a union all
:
你可以使用union all来接近这个:
(select t.*, 'StackExch_bla_bla_bla' as matching
from t
where t.word_column like 'StackExch_bla_bla_bla%'
limit 1
) union all
(select t.*, 'StackExch_bla_bla_bl'
from t
where t.word_column like 'StackExch_bla_bla_bl%'
limit 1
) union all
(select t.*, 'StackExch_bla_bla_b'
from t
where t.word_column like 'StackExch_bla_bla_b%'
limit 1
) union al
. . .
(select t.*, 'S'
from t
where t.word_column like 'S%'
limit 1
)
order by length(matching) desc
limit 1;
#3
2
Create table/insert data.
创建表/插入数据。
CREATE DATABASE IF NOT EXISTS *;
USE *;
DROP TABLE IF EXISTS word;
CREATE TABLE IF NOT EXISTS word(
word_column VARCHAR(255)
, KEY(word_column)
)
;
INSERT INTO word
(`word_column`)
VALUES
('*'),
('StackExchange'),
('MetaStackExchange')
;
This solution depends on generating a large number list. We can do that with this query. This query generates numbers from 1 to 1000. I do this so this query will support searches up to 1000 chars.
此解决方案取决于生成大量列表。我们可以使用此查询执行此操作。此查询生成从1到1000的数字。我这样做,因此此查询将支持最多1000个字符的搜索。
Query
询问
SELECT
@row := @row + 1 AS ROW
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
result
结果
row
--------
1
2
3
4
5
6
7
8
9
10
...
...
990
991
992
993
994
995
996
997
998
999
1000
Now we use the last query as delivered table in combination with DISTINCT SUBSTRING('StackExch_bla_bla_bla', 1, [number])
to find a unique word list.
现在我们将最后一个查询作为已传递的表与DISTINCT SUBSTRING('StackExch_bla_bla_bla',1,[number])结合使用,以查找唯一的单词列表。
Query
询问
SELECT
DISTINCT
SUBSTRING('StackExch_bla_bla_bla', 1, rows.row) AS word
FROM (
SELECT
@row := @row + 1 AS ROW
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
) ROWS
Result
结果
word
-----------------------
S
St
Sta
Stac
Stack
StackE
StackEx
StackExc
StackExch
StackExch_
StackExch_b
StackExch_bl
StackExch_bla
StackExch_bla_
StackExch_bla_b
StackExch_bla_bl
StackExch_bla_bla
StackExch_bla_bla_
StackExch_bla_bla_b
StackExch_bla_bla_bl
StackExch_bla_bla_bla
Now want can join and use REPLACE(word_column, word, '')
and CHAR_LENGTH(REPLACE(word_column, word, ''))
to generate a list.
现在想要加入并使用REPLACE(word_column,word,'')和CHAR_LENGTH(REPLACE(word_column,word,''))来生成列表。
Query
询问
SELECT
*
, REPLACE(word_column, word, '') AS replaced
, CHAR_LENGTH(REPLACE(word_column, word, '')) chars_afterreplace
FROM (
SELECT
DISTINCT
SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word
FROM (
SELECT
@row := @row + 1 AS row_number
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
) ROWS
) words
INNER JOIN
word
ON
word.word_column LIKE CONCAT(words.word, '%')
Result
结果
word word_column replaced chars_afterreplace
---------- ------------- ------------- --------------------
S StackExchange tackExchange 12
S * tackOverflow 12
St StackExchange ackExchange 11
St * ackOverflow 11
Sta StackExchange ckExchange 10
Sta * ckOverflow 10
Stac StackExchange kExchange 9
Stac * kOverflow 9
Stack StackExchange Exchange 8
Stack * Overflow 8
StackE StackExchange xchange 7
StackEx StackExchange change 6
StackExc StackExchange hange 5
StackExch StackExchange ange 4
StackExch_ StackExchange StackExchange 13
Now we can clearly see we want the word with the lowest chars_afterreplace. So we want to do ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC LIMIT 1
现在我们可以清楚地看到我们想要具有最低chars_afterreplace的单词。所以我们想做ORDER BY CHAR_LENGTH(REPLACE(word_column,word,''))ASC LIMIT 1
Query
询问
SELECT
word.word_column
FROM (
SELECT
DISTINCT
SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word
FROM (
SELECT
@row := @row + 1 AS row_number
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
) ROWS
) words
INNER JOIN word
ON word.word_column LIKE CONCAT(words.word, '%')
ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC
LIMIT 1
Results
结果
word_column
---------------
StackExchange
#4
0
The following solutions need a table containing sequence numbers from 1 to (at least) the length of your word_column
. Assuming the word_column
is VARCHAR(190)
you need a table with numbers from 1 to 190. If you use MariaDB with the sequence plugin, you can use the table seq_1_to_190
. If you don't have it, there are many ways to create it. One simple way is to use the information_schema.columns
table:
以下解决方案需要一个包含从word到(至少)word_column长度的序列号的表。假设word_column是VARCHAR(190),则需要一个数字从1到190的表。如果将MariaDB与序列插件一起使用,则可以使用表seq_1_to_190。如果您没有它,有很多方法可以创建它。一种简单的方法是使用information_schema.columns表:
create table if not exists seq_1_to_190 (seq tinyint unsigned auto_increment primary key)
select null as seq from information_schema.columns limit 190;
You can also create it on-the-fly in a subquery, but that would complicate your queries.
您也可以在子查询中即时创建它,但这会使您的查询复杂化。
I will use the session variable @word
to store the search string.
我将使用会话变量@word来存储搜索字符串。
set @word = 'StackExch_bla_bla_bla';
But you can replace all its occurrences with the constant search string.
但您可以使用常量搜索字符串替换所有出现的内容。
Now we can use the sequence table to create all prefix substrings with
现在我们可以使用序列表来创建所有前缀子串
select seq as l, left(@word, seq) as substr
from seq_1_to_190 s
where s.seq <= char_length(@word)
http://rextester.com/BWU18001
and use it for the LIKE
condition when you join it with your words
table:
当您将其与单词表连接时,将其用于LIKE条件:
select w.word_column
from (
select seq as l, left(@word, seq) as substr
from seq_1_to_190 s
where s.seq <= char_length(@word)
) s
join words w on w.word_column like concat(replace(s.substr, '_', '\_'), '%')
order by s.l desc
limit 1
http://rextester.com/STQP82942
http://rextester.com/STQP82942
Note that _
is a placeholder and you need to escape it in your search string with \_
. You also need to do that for %
if your string can contain it, but I will skip this part in my answer.
请注意_是占位符,您需要使用\ __在搜索字符串中将其转义。如果您的字符串可以包含它,您还需要为%执行此操作,但我将在我的答案中跳过此部分。
The query can also be written without the subquery:
也可以在没有子查询的情况下编写查询:
select w.word_column
from seq_1_to_190 s
join words w on w.word_column like concat(replace(left(@word, seq), '_', '\_'), '%')
where s.seq <= char_length(@word)
order by s.seq desc
limit 1
http://rextester.com/QVZI59071
http://rextester.com/QVZI59071
These queries do the job and in theorie they should also be fast. But MySQL (In my case MariaDB 10.0.19) creates a bad execution plan and doesn't use the index for the ORDER BY
clause. Both queries run in about 1.8 seconds on a 100K rows data set.
这些查询可以完成工作,理论上它们也应该很快。但是MySQL(在我的案例中是MariaDB 10.0.19)创建了一个糟糕的执行计划,并且没有使用ORDER BY子句的索引。两个查询在100K行数据集上运行大约1.8秒。
Best I could do to improve the performance with a single query is
我用单个查询来提高性能的最佳方法是
select (
select word_column
from words w
where w.word_column like concat(replace(left(@word, s.seq), '_', '\_'), '%')
limit 1
) as word_column
from seq_1_to_190 s
where s.seq <= char_length(@word)
having word_column is not null
order by s.seq desc
limit 1
http://rextester.com/APZHA8471
http://rextester.com/APZHA8471
This one is faster, but still needs like 670 msec. Note that Gordons CASE query runs in 125 msec, though it needs a full table/index scan and filesort.
这个更快,但仍需要670毫秒。请注意,Gordons CASE查询运行时间为125毫秒,但需要完整的表/索引扫描和文件排序。
However I managed to force the engine to use the index for the ORDER BY
clause with an indexed temporary table:
但是我设法强制引擎使用带有索引临时表的ORDER BY子句的索引:
drop temporary table if exists tmp;
create temporary table tmp(
id tinyint unsigned auto_increment primary key,
pattern varchar(190)
) engine=memory
select null as id, left(@word, seq) as pattern
from seq_1_to_190 s
where s.seq <= char_length(@word)
order by s.seq desc;
select w.word_column
from tmp force index for order by (primary)
join words w
on w.word_column >= tmp.pattern
and w.word_column < concat(tmp.pattern, char(127))
order by tmp.id asc
limit 1
http://rextester.com/OOE82089
This query is "instant" (less than 1 msec) on my 100K rows test table. If I remove FORCE INDEX
or use a LIKE
condition, it will be slow again.
此查询在我的100K行测试表上是“即时”(小于1毫秒)。如果我删除FORCE INDEX或使用LIKE条件,它将再次变慢。
Note that char(127)
seems to work for ASCII strings. You might need to find another character according to your character set.
请注意,char(127)似乎适用于ASCII字符串。您可能需要根据您的字符集找到另一个字符。
After all that, I must say that my first thought was to use a UNION ALL
query, which was also suggested by Gordon Linoff. However - here is a SQL only solution:
毕竟,我必须说我的第一个想法是使用UNION ALL查询,这也是Gordon Linoff建议的。但是 - 这是一个SQL唯一的解决方案:
set @subquery = '(
select word_column
from words
where word_column like {pattern}
limit 1
)';
set session group_concat_max_len = 1000000;
set @sql = (
select group_concat(
replace(
@subquery,
'{pattern}',
replace(quote(concat(left(@word, seq), '%')), '_', '\_')
)
order by s.seq desc
separator ' union all '
)
from seq_1_to_190 s
where s.seq <= char_length(@word)
);
set @sql = concat(@sql, ' limit 1');
prepare stmt from @sql;
execute stmt;
http://rextester.com/OPTJ37873
http://rextester.com/OPTJ37873
It is also "instant".
它也是“即时的”。
If you like strored procedures/functions - Here's a function:
如果你喜欢strored的程序/函数 - 这是一个函数:
create function get_with_similar_begin(search_str text) returns text
begin
declare l integer;
declare res text;
declare pattern text;
set l = char_length(search_str);
while l > 0 and res is null do
set pattern = left(search_str, l);
set pattern = replace(pattern, '_', '\_');
set pattern = replace(pattern, '%', '\%');
set pattern = concat(pattern, '%');
set res = (select word_column from words where word_column like pattern);
set l = l - 1;
end while;
return res;
end
Use it as
用它作为
select get_with_similar_begin('StackExch_bla_bla_bla');
select get_with_similar_begin('StackO_bla_bla_bla');
http://rextester.com/CJTU4629
It is probably the fastest way. Though for long strings a kind of divide and conquer algorinthm might decrease the average number of lookups. But might also be just overkill.
这可能是最快的方式。虽然对于长串,一种分而治之的算法可能会减少平均查找次数。但也可能只是矫枉过正。
If you want to test your queries on a big table - I used the following code to create my test table (for MariaDB with sequence plugin):
如果你想在一个大表上测试你的查询 - 我使用下面的代码来创建我的测试表(对于带序列插件的MariaDB):
drop table if exists words;
create table words(
id mediumint auto_increment primary key,
word_column varchar(190),
index(word_column)
);
insert into words(word_column)
select concat('Stack', rand(1)) as word_column
from seq_1_to_100000;
insert into words(word_column)values('StackOferflow'),('StackExchange'),('MetaStackExchange');
#1
2
In SQL Server we can use CTE like below query to achieve what you want:
在SQL Server中,我们可以像下面的查询一样使用CTE来实现你想要的:
declare @search nvarchar(255) = 'StackExch_bla_bla_bla';
-- A cte that contains `StackExch_bla_bla_bla` sub-strings: {`StackExch_bla_bla_bla`, `StackExch_bla_bla_bl`, ..., `S`}
with cte(part, lvl) as (
select @search, 1
union all
select substring(@search, 1, len(@search) - lvl), lvl + 1
from cte
where lvl < len(@search)
), t as ( -- Now below cte will find match level of each word_column
select t.word_column, min(cte.lvl) matchLvl
from yourTable t
left join cte
on t.word_column like cte.part+'%'
group by t.word_column
)
select top(1) word_column
from t
where matchLvl is not null -- remove non-matched rows
order by matchLvl;
SQL Server小提琴演示
I need more time to find a way in MySQL for it, Hope some MySQL experts answer faster ;).
我需要更多时间在MySQL中找到一种方法,希望一些MySQL专家能够更快地回答;)。
My best try in MySQL is this:
我在MySQL中的最佳尝试是:
select tt.word_column
from (
select t.word_column, min(lvl) matchLvl
from yourTable t
join (
select 'StackExch_bla_bla_bla' part, 1 lvl
union all select 'StackExch_bla_bla_bl', 2
union all select 'StackExch_bla_bla_b', 3
union all select 'StackExch_bla_bla_', 4
union all select 'StackExch_bla_bla', 5
union all select 'StackExch_bla_bl', 6
union all select 'StackExch_bla_b', 7
union all select 'StackExch_bla_', 8
union all select 'StackExch_bla', 9
union all select 'StackExch_bl', 10
union all select 'StackExch_b', 11
union all select 'StackExch_', 12
union all select 'StackExch', 13
union all select 'StackExc', 14
union all select 'StackEx', 15
union all select 'StackE', 16
union all select 'Stack', 17
union all select 'Stac', 18
union all select 'Sta', 19
union all select 'St', 20
union all select 'S', 21
) p on t.word_column like concat(p.part, '%')
group by t.word_column
) tt
order by matchLvl
limit 1;
I think by creating a stored procedure and using a temp table to store values in p
sub-select you can achieve what you want -HTH ;).
我认为通过创建存储过程并使用临时表在p子选择中存储值,您可以实现您想要的--HTH;)。
MySQL小提琴演示
#2
2
This is a slight variation on @shA.t's answer. The aggregation is not necessary:
@ shA.t的回答略有不同。聚合不是必需的:
select t.*, p.lvl
from yourTable t join
(select 'StackExch_bla_bla_bla' as part, 1 as lvl union all
select 'StackExch_bla_bla_bl', 2 union all
select 'StackExch_bla_bla_b', 3 union all
select 'StackExch_bla_bla_', 4 union all
select 'StackExch_bla_bla', 5 union all
select 'StackExch_bla_bl', 6 union all
select 'StackExch_bla_b', 7 union all
select 'StackExch_bla_', 8 union all
select 'StackExch_bla', 9 union all
select 'StackExch_bl', 10 union all
select 'StackExch_b', 11 union all
select 'StackExch_', 12 union all
select 'StackExch', 13 union all
select 'StackExc', 14 union all
select 'StackEx', 15 union all
select 'StackE', 16 union all
select 'Stack', 17 union all
select 'Stac', 18 union all
select 'Sta', 19 union all
select 'St', 20 union all
select 'S', 21
) p
on t.word_column like concat(p.part, '%')
order by matchLvl
limit 1;
A faster way is to use case
:
更快的方法是使用案例:
select t.*,
(case when t.word_column like concat('StackExch_bla_bla_bla', '%') then 'StackExch_bla_bla_bla'
when t.word_column like concat('StackExch_bla_bla_bl', '%') then 'StackExch_bla_bla_bl'
when t.word_column like concat('StackExch_bla_bla_b', '%') then 'StackExch_bla_bla_b'
. . .
when t.word_column like concat('S', '%') then 'S'
else ''
end) as longest_match
from t
order by length(longest_match) desc
limit 1;
Neither of these will make effective use of the index.
这些都不能有效地利用该指数。
If you want a version that uses the index, then do the looping at the application layer, and repeated run the query as:
如果您需要使用索引的版本,请在应用程序层执行循环,然后重复运行查询:
select t.*
from t
where t.word_column like 'StackExch_bla_bla_bla%'
limit 1;
Then stop when you hit the first match. MySQL should be using the index for the like
comparison.
然后在第一场比赛时停止。 MySQL应该使用索引进行类似的比较。
You can come pretty close to this using a union all
:
你可以使用union all来接近这个:
(select t.*, 'StackExch_bla_bla_bla' as matching
from t
where t.word_column like 'StackExch_bla_bla_bla%'
limit 1
) union all
(select t.*, 'StackExch_bla_bla_bl'
from t
where t.word_column like 'StackExch_bla_bla_bl%'
limit 1
) union all
(select t.*, 'StackExch_bla_bla_b'
from t
where t.word_column like 'StackExch_bla_bla_b%'
limit 1
) union al
. . .
(select t.*, 'S'
from t
where t.word_column like 'S%'
limit 1
)
order by length(matching) desc
limit 1;
#3
2
Create table/insert data.
创建表/插入数据。
CREATE DATABASE IF NOT EXISTS *;
USE *;
DROP TABLE IF EXISTS word;
CREATE TABLE IF NOT EXISTS word(
word_column VARCHAR(255)
, KEY(word_column)
)
;
INSERT INTO word
(`word_column`)
VALUES
('*'),
('StackExchange'),
('MetaStackExchange')
;
This solution depends on generating a large number list. We can do that with this query. This query generates numbers from 1 to 1000. I do this so this query will support searches up to 1000 chars.
此解决方案取决于生成大量列表。我们可以使用此查询执行此操作。此查询生成从1到1000的数字。我这样做,因此此查询将支持最多1000个字符的搜索。
Query
询问
SELECT
@row := @row + 1 AS ROW
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
result
结果
row
--------
1
2
3
4
5
6
7
8
9
10
...
...
990
991
992
993
994
995
996
997
998
999
1000
Now we use the last query as delivered table in combination with DISTINCT SUBSTRING('StackExch_bla_bla_bla', 1, [number])
to find a unique word list.
现在我们将最后一个查询作为已传递的表与DISTINCT SUBSTRING('StackExch_bla_bla_bla',1,[number])结合使用,以查找唯一的单词列表。
Query
询问
SELECT
DISTINCT
SUBSTRING('StackExch_bla_bla_bla', 1, rows.row) AS word
FROM (
SELECT
@row := @row + 1 AS ROW
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
) ROWS
Result
结果
word
-----------------------
S
St
Sta
Stac
Stack
StackE
StackEx
StackExc
StackExch
StackExch_
StackExch_b
StackExch_bl
StackExch_bla
StackExch_bla_
StackExch_bla_b
StackExch_bla_bl
StackExch_bla_bla
StackExch_bla_bla_
StackExch_bla_bla_b
StackExch_bla_bla_bl
StackExch_bla_bla_bla
Now want can join and use REPLACE(word_column, word, '')
and CHAR_LENGTH(REPLACE(word_column, word, ''))
to generate a list.
现在想要加入并使用REPLACE(word_column,word,'')和CHAR_LENGTH(REPLACE(word_column,word,''))来生成列表。
Query
询问
SELECT
*
, REPLACE(word_column, word, '') AS replaced
, CHAR_LENGTH(REPLACE(word_column, word, '')) chars_afterreplace
FROM (
SELECT
DISTINCT
SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word
FROM (
SELECT
@row := @row + 1 AS row_number
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
) ROWS
) words
INNER JOIN
word
ON
word.word_column LIKE CONCAT(words.word, '%')
Result
结果
word word_column replaced chars_afterreplace
---------- ------------- ------------- --------------------
S StackExchange tackExchange 12
S * tackOverflow 12
St StackExchange ackExchange 11
St * ackOverflow 11
Sta StackExchange ckExchange 10
Sta * ckOverflow 10
Stac StackExchange kExchange 9
Stac * kOverflow 9
Stack StackExchange Exchange 8
Stack * Overflow 8
StackE StackExchange xchange 7
StackEx StackExchange change 6
StackExc StackExchange hange 5
StackExch StackExchange ange 4
StackExch_ StackExchange StackExchange 13
Now we can clearly see we want the word with the lowest chars_afterreplace. So we want to do ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC LIMIT 1
现在我们可以清楚地看到我们想要具有最低chars_afterreplace的单词。所以我们想做ORDER BY CHAR_LENGTH(REPLACE(word_column,word,''))ASC LIMIT 1
Query
询问
SELECT
word.word_column
FROM (
SELECT
DISTINCT
SUBSTRING('StackExch_bla_bla_bla', 1, rows.row_number) AS word
FROM (
SELECT
@row := @row + 1 AS row_number
FROM (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
)
row1
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row2
CROSS JOIN (
SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9
) row3
CROSS JOIN (
SELECT @row := 0
) AS init_user_param
) ROWS
) words
INNER JOIN word
ON word.word_column LIKE CONCAT(words.word, '%')
ORDER BY CHAR_LENGTH(REPLACE(word_column, word, '')) ASC
LIMIT 1
Results
结果
word_column
---------------
StackExchange
#4
0
The following solutions need a table containing sequence numbers from 1 to (at least) the length of your word_column
. Assuming the word_column
is VARCHAR(190)
you need a table with numbers from 1 to 190. If you use MariaDB with the sequence plugin, you can use the table seq_1_to_190
. If you don't have it, there are many ways to create it. One simple way is to use the information_schema.columns
table:
以下解决方案需要一个包含从word到(至少)word_column长度的序列号的表。假设word_column是VARCHAR(190),则需要一个数字从1到190的表。如果将MariaDB与序列插件一起使用,则可以使用表seq_1_to_190。如果您没有它,有很多方法可以创建它。一种简单的方法是使用information_schema.columns表:
create table if not exists seq_1_to_190 (seq tinyint unsigned auto_increment primary key)
select null as seq from information_schema.columns limit 190;
You can also create it on-the-fly in a subquery, but that would complicate your queries.
您也可以在子查询中即时创建它,但这会使您的查询复杂化。
I will use the session variable @word
to store the search string.
我将使用会话变量@word来存储搜索字符串。
set @word = 'StackExch_bla_bla_bla';
But you can replace all its occurrences with the constant search string.
但您可以使用常量搜索字符串替换所有出现的内容。
Now we can use the sequence table to create all prefix substrings with
现在我们可以使用序列表来创建所有前缀子串
select seq as l, left(@word, seq) as substr
from seq_1_to_190 s
where s.seq <= char_length(@word)
http://rextester.com/BWU18001
and use it for the LIKE
condition when you join it with your words
table:
当您将其与单词表连接时,将其用于LIKE条件:
select w.word_column
from (
select seq as l, left(@word, seq) as substr
from seq_1_to_190 s
where s.seq <= char_length(@word)
) s
join words w on w.word_column like concat(replace(s.substr, '_', '\_'), '%')
order by s.l desc
limit 1
http://rextester.com/STQP82942
http://rextester.com/STQP82942
Note that _
is a placeholder and you need to escape it in your search string with \_
. You also need to do that for %
if your string can contain it, but I will skip this part in my answer.
请注意_是占位符,您需要使用\ __在搜索字符串中将其转义。如果您的字符串可以包含它,您还需要为%执行此操作,但我将在我的答案中跳过此部分。
The query can also be written without the subquery:
也可以在没有子查询的情况下编写查询:
select w.word_column
from seq_1_to_190 s
join words w on w.word_column like concat(replace(left(@word, seq), '_', '\_'), '%')
where s.seq <= char_length(@word)
order by s.seq desc
limit 1
http://rextester.com/QVZI59071
http://rextester.com/QVZI59071
These queries do the job and in theorie they should also be fast. But MySQL (In my case MariaDB 10.0.19) creates a bad execution plan and doesn't use the index for the ORDER BY
clause. Both queries run in about 1.8 seconds on a 100K rows data set.
这些查询可以完成工作,理论上它们也应该很快。但是MySQL(在我的案例中是MariaDB 10.0.19)创建了一个糟糕的执行计划,并且没有使用ORDER BY子句的索引。两个查询在100K行数据集上运行大约1.8秒。
Best I could do to improve the performance with a single query is
我用单个查询来提高性能的最佳方法是
select (
select word_column
from words w
where w.word_column like concat(replace(left(@word, s.seq), '_', '\_'), '%')
limit 1
) as word_column
from seq_1_to_190 s
where s.seq <= char_length(@word)
having word_column is not null
order by s.seq desc
limit 1
http://rextester.com/APZHA8471
http://rextester.com/APZHA8471
This one is faster, but still needs like 670 msec. Note that Gordons CASE query runs in 125 msec, though it needs a full table/index scan and filesort.
这个更快,但仍需要670毫秒。请注意,Gordons CASE查询运行时间为125毫秒,但需要完整的表/索引扫描和文件排序。
However I managed to force the engine to use the index for the ORDER BY
clause with an indexed temporary table:
但是我设法强制引擎使用带有索引临时表的ORDER BY子句的索引:
drop temporary table if exists tmp;
create temporary table tmp(
id tinyint unsigned auto_increment primary key,
pattern varchar(190)
) engine=memory
select null as id, left(@word, seq) as pattern
from seq_1_to_190 s
where s.seq <= char_length(@word)
order by s.seq desc;
select w.word_column
from tmp force index for order by (primary)
join words w
on w.word_column >= tmp.pattern
and w.word_column < concat(tmp.pattern, char(127))
order by tmp.id asc
limit 1
http://rextester.com/OOE82089
This query is "instant" (less than 1 msec) on my 100K rows test table. If I remove FORCE INDEX
or use a LIKE
condition, it will be slow again.
此查询在我的100K行测试表上是“即时”(小于1毫秒)。如果我删除FORCE INDEX或使用LIKE条件,它将再次变慢。
Note that char(127)
seems to work for ASCII strings. You might need to find another character according to your character set.
请注意,char(127)似乎适用于ASCII字符串。您可能需要根据您的字符集找到另一个字符。
After all that, I must say that my first thought was to use a UNION ALL
query, which was also suggested by Gordon Linoff. However - here is a SQL only solution:
毕竟,我必须说我的第一个想法是使用UNION ALL查询,这也是Gordon Linoff建议的。但是 - 这是一个SQL唯一的解决方案:
set @subquery = '(
select word_column
from words
where word_column like {pattern}
limit 1
)';
set session group_concat_max_len = 1000000;
set @sql = (
select group_concat(
replace(
@subquery,
'{pattern}',
replace(quote(concat(left(@word, seq), '%')), '_', '\_')
)
order by s.seq desc
separator ' union all '
)
from seq_1_to_190 s
where s.seq <= char_length(@word)
);
set @sql = concat(@sql, ' limit 1');
prepare stmt from @sql;
execute stmt;
http://rextester.com/OPTJ37873
http://rextester.com/OPTJ37873
It is also "instant".
它也是“即时的”。
If you like strored procedures/functions - Here's a function:
如果你喜欢strored的程序/函数 - 这是一个函数:
create function get_with_similar_begin(search_str text) returns text
begin
declare l integer;
declare res text;
declare pattern text;
set l = char_length(search_str);
while l > 0 and res is null do
set pattern = left(search_str, l);
set pattern = replace(pattern, '_', '\_');
set pattern = replace(pattern, '%', '\%');
set pattern = concat(pattern, '%');
set res = (select word_column from words where word_column like pattern);
set l = l - 1;
end while;
return res;
end
Use it as
用它作为
select get_with_similar_begin('StackExch_bla_bla_bla');
select get_with_similar_begin('StackO_bla_bla_bla');
http://rextester.com/CJTU4629
It is probably the fastest way. Though for long strings a kind of divide and conquer algorinthm might decrease the average number of lookups. But might also be just overkill.
这可能是最快的方式。虽然对于长串,一种分而治之的算法可能会减少平均查找次数。但也可能只是矫枉过正。
If you want to test your queries on a big table - I used the following code to create my test table (for MariaDB with sequence plugin):
如果你想在一个大表上测试你的查询 - 我使用下面的代码来创建我的测试表(对于带序列插件的MariaDB):
drop table if exists words;
create table words(
id mediumint auto_increment primary key,
word_column varchar(190),
index(word_column)
);
insert into words(word_column)
select concat('Stack', rand(1)) as word_column
from seq_1_to_100000;
insert into words(word_column)values('StackOferflow'),('StackExchange'),('MetaStackExchange');