在文本中搜索预定义的单词

时间:2022-09-13 09:36:06

Hi I have a database table that looks like this

嗨,我有一个看起来像这样的数据库表

word_id int(10)
word varchar(30)

And I have a text, I wanna see which one of the words in this text are defined in that table, what's the most elegant way of doing this?

我有一个文本,我想知道在该表中定义了哪个单词,这是最优雅的方式?

Currently I query the database for all the words, then using PHP I search for each word in the whole text, so it takes a long time for PHP to download all the words from database, and then it checks each and everyone of them against my text.

目前我查询数据库中的所有单词,然后使用PHP我搜索整个文本中的每个单词,因此PHP需要很长时间才能从数据库中下载所有单词,然后它会检查每个单词和我的每个单词文本。

2 个解决方案

#1


3  

You can try to extract the words in the text and put them in a SELECT query like this:

您可以尝试提取文本中的单词并将它们放在SELECT查询中,如下所示:

$words = array_unique(get_words_in_text(...));
$sql = "SELECT * FROM words WHERE word IN (".implode(", ", $words)).")";

Might be that your SQL engine optimizes this statement. In any case, the database connection is utilized less than it is in your current approach.

可能是你的SQL引擎优化了这个语句。在任何情况下,数据库连接的使用都少于当前方法。

You can also try to temporarily create a separate word table and add all words in the text to that table. Then you can perform a JOIN with the main word table. If both tables are indexed properly, this might be quite fast.

您还可以尝试临时创建单独的单词表,并将文本中的所有单词添加到该表中。然后,您可以使用主词表执行JOIN。如果两个表都正确索引,这可能会非常快。

EDIT: This question/answer suggests that creating a temporary table is indeed faster (see comments): mysql select .. where .. in -> optimizing. However, it certainly depends on the concrete database you're using, the size of your word table, the size of the texts and the configuration of your index(es). Thus, I recommend evaluating both approaches for your specific scenario. Please report your results. :-)

编辑:这个问题/答案表明创建一个临时表确实更快(见评论):mysql选择..其中..在 - >优化。但是,它当然取决于您正在使用的具体数据库,单词表的大小,文本的大小以及索引的配置。因此,我建议为您的特定方案评估两种方法。请报告您的结果。 :-)

#2


0  

An idea:

一个主意:

// get words in file into array
$file = file_get_contents('file.txt', FILE_IGNORE_NEW_LINES);
$file_words = explode(" ", $file);

// remove duplicate words, count elements in array after de-duplication
$file_words = array_unique($file_words);
$file_count = count($file_words);

// create empty array in which to store hits
$words_with_definition = array();

// check to see if each word exists in database
for ($i=0; $i < $file_count; $i++)
{
    // intentionally leaving out db connection, this is just a concept
    // word should be at least three characters, change as needed
    if (strlen($file_words[$i]) >= 3)
    {
        $sql = "SELECT word FROM your_table WHERE word='".$file_words[$i]."'";

        if (mysql_num_rows($sql) > 0)
        {
            // this is a hit, add it to $words_with_definition
            array_push($words_with_definition, $file_words[$i]);
        }
    }
}

Whatever is in the $words_with_definition array will be the words that hit off the database.

$ words_with_definition数组中的任何内容都是命中数据库的单词。

#1


3  

You can try to extract the words in the text and put them in a SELECT query like this:

您可以尝试提取文本中的单词并将它们放在SELECT查询中,如下所示:

$words = array_unique(get_words_in_text(...));
$sql = "SELECT * FROM words WHERE word IN (".implode(", ", $words)).")";

Might be that your SQL engine optimizes this statement. In any case, the database connection is utilized less than it is in your current approach.

可能是你的SQL引擎优化了这个语句。在任何情况下,数据库连接的使用都少于当前方法。

You can also try to temporarily create a separate word table and add all words in the text to that table. Then you can perform a JOIN with the main word table. If both tables are indexed properly, this might be quite fast.

您还可以尝试临时创建单独的单词表,并将文本中的所有单词添加到该表中。然后,您可以使用主词表执行JOIN。如果两个表都正确索引,这可能会非常快。

EDIT: This question/answer suggests that creating a temporary table is indeed faster (see comments): mysql select .. where .. in -> optimizing. However, it certainly depends on the concrete database you're using, the size of your word table, the size of the texts and the configuration of your index(es). Thus, I recommend evaluating both approaches for your specific scenario. Please report your results. :-)

编辑:这个问题/答案表明创建一个临时表确实更快(见评论):mysql选择..其中..在 - >优化。但是,它当然取决于您正在使用的具体数据库,单词表的大小,文本的大小以及索引的配置。因此,我建议为您的特定方案评估两种方法。请报告您的结果。 :-)

#2


0  

An idea:

一个主意:

// get words in file into array
$file = file_get_contents('file.txt', FILE_IGNORE_NEW_LINES);
$file_words = explode(" ", $file);

// remove duplicate words, count elements in array after de-duplication
$file_words = array_unique($file_words);
$file_count = count($file_words);

// create empty array in which to store hits
$words_with_definition = array();

// check to see if each word exists in database
for ($i=0; $i < $file_count; $i++)
{
    // intentionally leaving out db connection, this is just a concept
    // word should be at least three characters, change as needed
    if (strlen($file_words[$i]) >= 3)
    {
        $sql = "SELECT word FROM your_table WHERE word='".$file_words[$i]."'";

        if (mysql_num_rows($sql) > 0)
        {
            // this is a hit, add it to $words_with_definition
            array_push($words_with_definition, $file_words[$i]);
        }
    }
}

Whatever is in the $words_with_definition array will be the words that hit off the database.

$ words_with_definition数组中的任何内容都是命中数据库的单词。