如何在MySQL中找到最流行的单词?

时间:2022-01-02 19:14:47

I have a table called results with 5 columns.

我有一个名为结果的表,有5列。

I'd like to use the title column to find rows that are say: WHERE title like '%for sale%' and then listing the most popular words in that column. One would be for and another would be sale but I want to see what other words correlate with this.

我想使用title列查找这样的行:例如'%for sale%',然后列出该列中最流行的单词。一种是出售,另一种是出售,但我想看看与此相关的其他词是什么。

Sample data:

样本数据:

title
cheap cars for sale
house for sale
cats and dogs for sale
iphones and androids for sale
cheap phones for sale
house furniture for sale

Results (single words):

结果(单个词):

for    6
sale    6
cheap    2
and    2
house    2
furniture 1
cars    1
etc...

6 个解决方案

#1


7  

You can extract words with some string manipulation. Assuming you have a numbers table and that words are separated by single spaces:

您可以使用一些字符串操作提取单词。假设您有一个数字表,并且单词由单个空格分隔:

select substring_index(substring_index(r.title, ' ', n.n), ' ', -1) as word,
       count(*)
from results r join
     numbers n
     on n.n <= length(title) - length(replace(title, ' ', '')) + 1
group by word;

If you don't have a numbers table, you can construct one manually using a subquery:

如果没有数字表,则可以使用子查询手动构建:

from results r join
     (select 1 as n union all select 2 union all select 3 union all . . .
     ) n
     . . .

The SQL Fiddle (courtesy of @GrzegorzAdamKowalski) is here.

下面是SQL Fiddle(由@GrzegorzAdamKowalski提供)。

#2


3  

You can use ExtractValue in some interesting way. See SQL fiddle here: http://sqlfiddle.com/#!9/0b0a0/45

你可以用一些有趣的方法来使用提取值。请参阅下面的SQL fiddle: http://sqlfiddle.com/#

We need only one table:

我们只需要一张桌子:

CREATE TABLE text (`title` varchar(29));

INSERT INTO text (`title`)
VALUES
    ('cheap cars for sale'),
    ('house for sale'),
    ('cats and dogs for sale'),
    ('iphones and androids for sale'),
    ('cheap phones for sale'),
    ('house furniture for sale')
;

Now we construct series of selects which extract whole words from text converted to XML. Each select extracts N-th word from the text.

现在我们构建了一系列从转换为XML的文本中提取整词的选择。每个选择从文本中提取第n个单词。

select words.word, count(*) as `count` from
(select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[1]') as word from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[2]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[3]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[4]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[5]') from `text`) as words
where length(words.word) > 0
group by words.word
order by `count` desc, words.word asc

#3


2  

This would give you single words (Just if I understand what your single word means.):

这将给你一个单词(只要我理解你这个单词的意思):

select concat(val,' ',cnt) as result from(
    select (substring_index(substring_index(t.title, ' ', n.n), ' ', -1)) val,count(*) as cnt
        from result t cross join(
         select a.n + b.n * 10 + 1 n
         from 
                (select 0 as n union all select 1 union all select 2 union all select 3 
                        union all select 4 union all select 5 union all select 6 
                        union all select 7 union all select 8 union all select 9) a,
                (select 0 as n union all select 1 union all select 2 union all select 3 
                        union all select 4 union all select 5 union all select 6 
                        union all select 7 union all select 8 union all select 9) b
                order by n 
        ) n
    where n.n <= 1 + (length(t.title) - length(replace(t.title, ' ', '')))
    group by val
    order by cnt desc
) as x

Result should be looks like this :

结果应该是这样的:

Result
--------
for 6
sale 6
house 2
and 2
cheap 2
phones 1
iphones 1
dogs 1
furniture 1
cars 1
androids 1
cats 1

But if the single word you need like this :

但是如果你需要这样一个词:

result
-----------
for 6 sale 6 house 2 and 2 cheap 2 phones 1 iphones 1 dogs 1 furniture 1 cars 1 androids 1 cats 1

Just modify the query above to:

只需将上面的查询修改为:

select group_concat(concat(val,' ',cnt) separator ' ') as result from( ...

#4


0  

Update

更新

Idea taken from https://*.com/a/17942691/98491

想法来自https://*.com/a/17942691/98491

This query works on my machine (MySQL 5.7), however Sqlfiddle reports an error. The basic idea is that you should either create a table with numbers from 1 to maximum word occurence (like 4) in your field or as I did, use a UNION 1 .. 4 for simplicity.

这个查询在我的机器上运行(MySQL 5.7),但是Sqlfiddle会报告错误。基本的想法是,您应该创建一个包含从1到最大出现词数(如4)的表,或者像我一样使用UNION 1。4为简单起见。

CREATE TABLE products (
  `id` int,
  `name` varchar(45)
);

INSERT INTO products
    (`id`, `name`)
VALUES
    (1, 'for sale'),
    (2, 'for me'),
    (3, 'for you'),
    (4, 'you and me')
;

SELECT name, COUNT(*) as count FROM
(
SELECT
  product.id,
  SUBSTRING_INDEX(SUBSTRING_INDEX(product.name, ' ', numbers.n), ' ', -1) name
FROM
  (
    SELECT 1 AS n
    UNION SELECT 2
    UNION SELECT 3
    UNION SELECT 4
  ) AS numbers
  INNER JOIN products product
  ON CHAR_LENGTH(product.name)
     -CHAR_LENGTH(REPLACE(product.name, ' ', ''))>=numbers.n-1
ORDER BY
  id, n
)
AS result
GROUP BY name
ORDER BY count DESC

Result will be

结果将是

for | 3
you | 2
me  | 2
and | 1
sale| 1

#5


0  

SQL is not well suited for this task, While possible there are limitations (the number of words for example)

SQL不是很适合这个任务,虽然可能存在一些限制(例如,单词的数量)

a quick PHP script to do the same task may be easier to use long term (and likely quicker too)

执行相同任务的快速PHP脚本可能更容易长期使用(也可能更快)

<?php
$rows = [
    "cheap cars for sale",
    "house for sale",
    "cats and dogs for sale",
    "iphones and androids for sale",
    "cheap phones for sale",
    "house furniture for sale",
];

//rows here should be replaced by the SQL result
$wordTotals = [];
foreach ($rows as $row) {
   $words = explode(" ", $row);
    foreach ($words as $word) {
        if (isset($wordTotals[$word])) {
            $wordTotals[$word]++; 
            continue;
        }

        $wordTotals[$word] = 1;
    }
}

arsort($wordTotals);

foreach($wordTotals as $word => $count) {
    echo $word . " " . $count . PHP_EOL;
}

Output

输出

for 6
sale 6
and 2
cheap 2
house 2
phones 1
androids 1
furniture 1
cats 1
cars 1
dogs 1
iphones 1

#6


0  

Here is working SQL Fiddle: http://sqlfiddle.com/#!9/0b0a0/32

下面是SQL Fiddle: http://sqlfiddle.com/#

Let's start with two tables - one for texts and one for numbers:

让我们从两个表开始——一个是文本表,一个是数字表:

CREATE TABLE text (`title` varchar(29));

INSERT INTO text
    (`title`)
VALUES
    ('cheap cars for sale'),
    ('house for sale'),
    ('cats and dogs for sale'),
    ('iphones and androids for sale'),
    ('cheap phones for sale'),
    ('house furniture for sale')
;

CREATE TABLE iterator (`index` int);

INSERT INTO iterator
    (`index`)
VALUES
    (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),
    (16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)
;

The second table, iterator must contains numbers from 1 to N where N higher or equal to the lenght of the longest string in text.

第二个表,iterator必须包含从1到N的数字,其中N大于或等于文本中最长字符串的长度。

Then, run this query:

然后,运行这个查询:

select
  words.word, count(*) as `count`
from 
(select
  substring(concat(' ', t.title, ' '), i.index+1, j.index-i.index) as word
from
  text as t, iterator as i, iterator as j
where
    substring(concat(' ', t.title), i.index, 1) = ' '
and substring(concat(t.title, ' '), j.index, 1) = ' '
and i.index < j.index
) AS words
where
    length(words.word) > 0
and words.word not like '% %'
group by words.word
order by `count` desc, words.word asc

There are two selects. Outer one simply groups and counts single words (words of length greater than 0 and without any spaces). Inner one extracts all strings starting from any space character and ending with any other space character, so strings aren't words (despite naming this subquery words) because they can contain other spaces than starting and ending one.

有两种选择。外部一个简单的组,并计算单个单词(长度大于0和没有空格的单词)。Inner从任何空格字符提取所有字符串,并以任何其他空格字符结尾,因此字符串不是单词(尽管命名了这个子查询词),因为它们可以包含除开始和结束一个空格之外的其他空格。

Results:

结果:

word    count
for     6
sale    6
and     2
cheap   2
house   2
androids    1
cars    1
cats    1
dogs    1
furniture   1
iphones     1
phones  1

#1


7  

You can extract words with some string manipulation. Assuming you have a numbers table and that words are separated by single spaces:

您可以使用一些字符串操作提取单词。假设您有一个数字表,并且单词由单个空格分隔:

select substring_index(substring_index(r.title, ' ', n.n), ' ', -1) as word,
       count(*)
from results r join
     numbers n
     on n.n <= length(title) - length(replace(title, ' ', '')) + 1
group by word;

If you don't have a numbers table, you can construct one manually using a subquery:

如果没有数字表,则可以使用子查询手动构建:

from results r join
     (select 1 as n union all select 2 union all select 3 union all . . .
     ) n
     . . .

The SQL Fiddle (courtesy of @GrzegorzAdamKowalski) is here.

下面是SQL Fiddle(由@GrzegorzAdamKowalski提供)。

#2


3  

You can use ExtractValue in some interesting way. See SQL fiddle here: http://sqlfiddle.com/#!9/0b0a0/45

你可以用一些有趣的方法来使用提取值。请参阅下面的SQL fiddle: http://sqlfiddle.com/#

We need only one table:

我们只需要一张桌子:

CREATE TABLE text (`title` varchar(29));

INSERT INTO text (`title`)
VALUES
    ('cheap cars for sale'),
    ('house for sale'),
    ('cats and dogs for sale'),
    ('iphones and androids for sale'),
    ('cheap phones for sale'),
    ('house furniture for sale')
;

Now we construct series of selects which extract whole words from text converted to XML. Each select extracts N-th word from the text.

现在我们构建了一系列从转换为XML的文本中提取整词的选择。每个选择从文本中提取第n个单词。

select words.word, count(*) as `count` from
(select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[1]') as word from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[2]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[3]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[4]') from `text`
union all
select ExtractValue(CONCAT('<w>', REPLACE(title, ' ', '</w><w>'), '</w>'), '//w[5]') from `text`) as words
where length(words.word) > 0
group by words.word
order by `count` desc, words.word asc

#3


2  

This would give you single words (Just if I understand what your single word means.):

这将给你一个单词(只要我理解你这个单词的意思):

select concat(val,' ',cnt) as result from(
    select (substring_index(substring_index(t.title, ' ', n.n), ' ', -1)) val,count(*) as cnt
        from result t cross join(
         select a.n + b.n * 10 + 1 n
         from 
                (select 0 as n union all select 1 union all select 2 union all select 3 
                        union all select 4 union all select 5 union all select 6 
                        union all select 7 union all select 8 union all select 9) a,
                (select 0 as n union all select 1 union all select 2 union all select 3 
                        union all select 4 union all select 5 union all select 6 
                        union all select 7 union all select 8 union all select 9) b
                order by n 
        ) n
    where n.n <= 1 + (length(t.title) - length(replace(t.title, ' ', '')))
    group by val
    order by cnt desc
) as x

Result should be looks like this :

结果应该是这样的:

Result
--------
for 6
sale 6
house 2
and 2
cheap 2
phones 1
iphones 1
dogs 1
furniture 1
cars 1
androids 1
cats 1

But if the single word you need like this :

但是如果你需要这样一个词:

result
-----------
for 6 sale 6 house 2 and 2 cheap 2 phones 1 iphones 1 dogs 1 furniture 1 cars 1 androids 1 cats 1

Just modify the query above to:

只需将上面的查询修改为:

select group_concat(concat(val,' ',cnt) separator ' ') as result from( ...

#4


0  

Update

更新

Idea taken from https://*.com/a/17942691/98491

想法来自https://*.com/a/17942691/98491

This query works on my machine (MySQL 5.7), however Sqlfiddle reports an error. The basic idea is that you should either create a table with numbers from 1 to maximum word occurence (like 4) in your field or as I did, use a UNION 1 .. 4 for simplicity.

这个查询在我的机器上运行(MySQL 5.7),但是Sqlfiddle会报告错误。基本的想法是,您应该创建一个包含从1到最大出现词数(如4)的表,或者像我一样使用UNION 1。4为简单起见。

CREATE TABLE products (
  `id` int,
  `name` varchar(45)
);

INSERT INTO products
    (`id`, `name`)
VALUES
    (1, 'for sale'),
    (2, 'for me'),
    (3, 'for you'),
    (4, 'you and me')
;

SELECT name, COUNT(*) as count FROM
(
SELECT
  product.id,
  SUBSTRING_INDEX(SUBSTRING_INDEX(product.name, ' ', numbers.n), ' ', -1) name
FROM
  (
    SELECT 1 AS n
    UNION SELECT 2
    UNION SELECT 3
    UNION SELECT 4
  ) AS numbers
  INNER JOIN products product
  ON CHAR_LENGTH(product.name)
     -CHAR_LENGTH(REPLACE(product.name, ' ', ''))>=numbers.n-1
ORDER BY
  id, n
)
AS result
GROUP BY name
ORDER BY count DESC

Result will be

结果将是

for | 3
you | 2
me  | 2
and | 1
sale| 1

#5


0  

SQL is not well suited for this task, While possible there are limitations (the number of words for example)

SQL不是很适合这个任务,虽然可能存在一些限制(例如,单词的数量)

a quick PHP script to do the same task may be easier to use long term (and likely quicker too)

执行相同任务的快速PHP脚本可能更容易长期使用(也可能更快)

<?php
$rows = [
    "cheap cars for sale",
    "house for sale",
    "cats and dogs for sale",
    "iphones and androids for sale",
    "cheap phones for sale",
    "house furniture for sale",
];

//rows here should be replaced by the SQL result
$wordTotals = [];
foreach ($rows as $row) {
   $words = explode(" ", $row);
    foreach ($words as $word) {
        if (isset($wordTotals[$word])) {
            $wordTotals[$word]++; 
            continue;
        }

        $wordTotals[$word] = 1;
    }
}

arsort($wordTotals);

foreach($wordTotals as $word => $count) {
    echo $word . " " . $count . PHP_EOL;
}

Output

输出

for 6
sale 6
and 2
cheap 2
house 2
phones 1
androids 1
furniture 1
cats 1
cars 1
dogs 1
iphones 1

#6


0  

Here is working SQL Fiddle: http://sqlfiddle.com/#!9/0b0a0/32

下面是SQL Fiddle: http://sqlfiddle.com/#

Let's start with two tables - one for texts and one for numbers:

让我们从两个表开始——一个是文本表,一个是数字表:

CREATE TABLE text (`title` varchar(29));

INSERT INTO text
    (`title`)
VALUES
    ('cheap cars for sale'),
    ('house for sale'),
    ('cats and dogs for sale'),
    ('iphones and androids for sale'),
    ('cheap phones for sale'),
    ('house furniture for sale')
;

CREATE TABLE iterator (`index` int);

INSERT INTO iterator
    (`index`)
VALUES
    (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),
    (16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30)
;

The second table, iterator must contains numbers from 1 to N where N higher or equal to the lenght of the longest string in text.

第二个表,iterator必须包含从1到N的数字,其中N大于或等于文本中最长字符串的长度。

Then, run this query:

然后,运行这个查询:

select
  words.word, count(*) as `count`
from 
(select
  substring(concat(' ', t.title, ' '), i.index+1, j.index-i.index) as word
from
  text as t, iterator as i, iterator as j
where
    substring(concat(' ', t.title), i.index, 1) = ' '
and substring(concat(t.title, ' '), j.index, 1) = ' '
and i.index < j.index
) AS words
where
    length(words.word) > 0
and words.word not like '% %'
group by words.word
order by `count` desc, words.word asc

There are two selects. Outer one simply groups and counts single words (words of length greater than 0 and without any spaces). Inner one extracts all strings starting from any space character and ending with any other space character, so strings aren't words (despite naming this subquery words) because they can contain other spaces than starting and ending one.

有两种选择。外部一个简单的组,并计算单个单词(长度大于0和没有空格的单词)。Inner从任何空格字符提取所有字符串,并以任何其他空格字符结尾,因此字符串不是单词(尽管命名了这个子查询词),因为它们可以包含除开始和结束一个空格之外的其他空格。

Results:

结果:

word    count
for     6
sale    6
and     2
cheap   2
house   2
androids    1
cars    1
cats    1
dogs    1
furniture   1
iphones     1
phones  1