How do I query for records ordered by similarity?
如何查询按相似度排序的记录?
Eg. searching for "Stock Overflow" would return
如。搜索“股票溢出”将会返回。
- Stack Overflow
- 堆栈溢出
- SharePoint Overflow
- SharePoint溢出
- Math Overflow
- 数学溢出
- Politic Overflow
- 政治溢出
- VFX Overflow
- 视效溢出
Eg. searching for "LO" would return:
如。搜索“LO”将返回:
- pabLO picasso
- 毕加索
- michelangeLO
- 米开朗基罗
- jackson polLOck
- 杰克逊·波洛克
What I need help with:
-
Using a search engine to index & search a MySQL table, for better results
使用搜索引擎索引和搜索MySQL表,以获得更好的结果
-
Using full-text indexing, to find similar/containing strings
使用全文索引,查找类似的/包含的字符串
What does not work well
-
Levenshtein distance is very erratic. (UDF, Query)
Searching for "dog" gives me:- dog
- 狗
- bog
- 沼泽
- ago
- 前
- big
- 大
- echo
- 回声
- Levenshtein距离非常不稳定。(UDF, Query)搜索“dog”会给我:dog bog前的big echo
-
LIKE
returns better results, but returns nothing for long queries although similar strings do exist- dog
- 狗
- dogid
- dogid
- dogaral
- dogaral
- dogma
- 教条
- 类似于返回更好的结果,但是对于长查询没有返回任何东西,尽管类似的字符串确实存在dog dogid dogaral dogma。
3 个解决方案
#1
77
I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it is quite limited. You could get away with LIKE, but it's really for basic searches. You may want to look into other search methods for what you want to achieve. For example:
我发现,当您针对另一个完整字符串搜索完整字符串时,Levenshtein距离可能很好,但是当您在字符串中查找关键字时,这个方法(有时)不会返回所需的结果。此外,SOUNDEX功能并不适合英语以外的语言,所以它是很有限的。你可以侥幸过关,但这是基本的搜索。您可能想要查看其他的搜索方法来获得您想要的结果。例如:
You may use Lucene as search base for your projects. It's implemented in most major programming languages and it'd quite fast and versatile. This method is probably the best, as it not only search for substrings, but also letter transposition, prefixes and suffixes (all combined). However, you need to keep a separate index (using CRON to update it from a independent script once in a while works though).
您可以使用Lucene作为您的项目的搜索基础。它在大多数主要的编程语言中都有实现,而且速度很快,用途广泛。这个方法可能是最好的,因为它不仅搜索子字符串,而且还搜索字母换位、前缀和后缀(所有组合)。但是,您需要保留一个单独的索引(偶尔使用CRON将其从独立的脚本中更新一次)。
Or, if you want a MySQL solution, the fulltext functionality is pretty good, and certainly faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table, then perform your fulltext search :
或者,如果您想要一个MySQL解决方案,那么full - text功能非常好,而且肯定比存储过程快。如果您的表不是MyISAM,您可以创建一个临时表,然后执行全文搜索:
CREATE TABLE IF NOT EXISTS `tests`.`data_table` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(2000) CHARACTER SET latin1 NOT NULL,
`description` text CHARACTER SET latin1 NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
Use a data generator to generate some random data if you don't want to bother creating it yourself...
使用数据生成器生成一些随机数据,如果您不想自己创建它…
** NOTE ** : the column type should be latin1_bin
to perform a case sensitive search instead of case insensitive with latin1
. For unicode strings, I would recommend utf8_bin
for case sensitive and utf8_general_ci
for case insensitive searches.
**注意**:列类型应该是latin1_bin,以执行区分大小写的搜索,而不是不区分大小写的latin1。对于unicode字符串,我建议用utf8_bin表示区分大小写,用utf8_general_ci表示不区分大小写的搜索。
DROP TABLE IF EXISTS `tests`.`data_table_temp`;
CREATE TEMPORARY TABLE `tests`.`data_table_temp`
SELECT * FROM `tests`.`data_table`;
ALTER TABLE `tests`.`data_table_temp` ENGINE = MYISAM;
ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` (
`title` ,
`description`
);
SELECT *,
MATCH (`title`,`description`)
AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score`
FROM `tests`.`data_table_temp`
WHERE MATCH (`title`,`description`)
AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE)
ORDER BY `score` DESC;
DROP TABLE `tests`.`data_table_temp`;
Read more about it from the MySQL API reference page
从MySQL API参考页面了解更多信息
The downside to this is that it will not look for letter transposition or "similar, sounds like" words.
它的缺点是它不会寻找字母的换位或“类似的,听起来像”的单词。
** UPDATE **
* * * *更新
Using Lucene for your search, you will simply need to create a cron job (all web hosts have this "feature") where this job will simply execute a PHP script (i.g. "cd /path/to/script; php searchindexer.php") that will update the indexes. The reason being that indexing thousands of "documents" (rows, data, etc.) may take several seconds, even minutes, but this is to ensure that all searches are performed as fast as possible. Therefore, you may want to create a delay job to be run by the server. It may be overnight, or in the next hour, this is up to you. The PHP script should look something like this:
使用Lucene进行搜索,您只需创建一个cron作业(所有web主机都有这个“特性”),该作业只需执行一个PHP脚本(即g)。“cd /道路/ /脚本;将更新索引的)。原因是索引数千个“文档”(行、数据等)可能需要几秒钟甚至几分钟,但这是为了确保所有搜索都尽可能快地执行。因此,您可能希望创建一个由服务器运行的延迟作业。可能是在一夜之间,或者是在接下来的一个小时里,这取决于你。PHP脚本应该如下所示:
$indexer = Zend_Search_Lucene::create('/path/to/lucene/data');
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
// change this option for your need
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);
$rowSet = getDataRowSet(); // perform your SQL query to fetch whatever you need to index
foreach ($rowSet as $row) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8'))
->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8'))
->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable))
->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8'))
;
$indexer->addDocument($doc);
}
// ... you can get as many $rowSet as you want and create as many documents
// as you wish... each document doesn't necessarily need the same fields...
// Lucene is pretty flexible on this
$indexer->optimize(); // do this every time you add more data to you indexer...
$indexer->commit(); // finalize the process
Then, this is basically how you search (basic search) :
那么,这基本上就是你搜索(基本搜索)的方式:
$index = Zend_Search_Lucene::open('/path/to/lucene/data');
// same search options
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
$query = 'php +field1:foo'; // search for the word 'php' in any field,
// +search for 'foo' in field 'field1'
$hits = $index->find($query);
$numHits = count($hits);
foreach ($hits as $hit) {
$score = $hit->score; // the hit weight
$field1 = $hit->field1;
// etc.
}
Here are great sites about Lucene in Java, PHP, and .Net.
下面是Java、PHP和。net中关于Lucene的一些不错的网站。
In conclusion each search methods have their own pros and cons :
总之,每种搜索方法都有各自的优缺点:
- You mentioned Sphinx search and it looks very good, as long as you can make the deamon run on your web host.
- 你提到了Sphinx搜索,它看起来很不错,只要你能让deamon在你的web主机上运行。
- Zend Lucene requires a cron job to re-index the database. While it is quite transparent to the user, this means that any new data (or deleted data!) is not always in sync with the data in your database and therefore won't show up right away on user search.
- Zend Lucene需要一个cron作业来重新索引数据库。虽然对用户来说非常透明,但这意味着任何新的数据(或删除的数据!)并不总是与数据库中的数据同步,因此不会立即出现在用户搜索中。
- MySQL FULLTEXT search is good and fast, but will not give you all the power and flexibility of the first two.
- MySQL全文搜索是好的和快速的,但不会给你所有的力量和灵活性的前两个。
Please feel free to comment if I have forgotten/missed anything.
如果我忘记/错过任何东西,请随时评论。
#2
18
1. Similarity
1。相似
For Levenshtein in MySQL I found this, from www.codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function
对于MySQL中的Levenshtein,我从www.codejanitor.com/wp/2007/02/10/levenshtein- -as-a- MySQL - storedfunction中找到了这个。
SELECT
column,
LEVENSHTEIN(column, 'search_string') AS distance
FROM table
WHERE
LEVENSHTEIN(column, 'search_string') < distance_limit
ORDER BY distance DESC
2. Containing, case insensitive
2。包含,不区分大小写
Use the LIKE
statement of MySQL, which is case insensitive by default. The %
is a wildcard, so there may be any string before and after search_string
.
使用MySQL的LIKE语句,默认情况下是不区分大小写的。%是通配符,因此在search_string前后可能有任何字符串。
SELECT
*
FROM
table
WHERE
column_name LIKE "%search_string%"
3. Containing, case sensitive
3所示。包含,区分大小写
The MySQL Manual helps:
MySQL手册可以帮助:
The default character set and collation are latin1 and latin1_swedish_ci, so nonbinary string comparisons are case insensitive by default. This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a. To make this search case sensitive, make sure that one of the operands has a case sensitive or binary collation. For example, if you are comparing a column and a string that both have the latin1 character set, you can use the COLLATE operator to cause either operand to have the latin1_general_cs or latin1_bin collation...
默认的字符集和排序是latin1和latin1_swedish_ci,因此默认情况下,非二进制字符串比较不区分大小写。这意味着,如果您使用col_name(如“a%”)搜索,您将得到以a或a开头的所有列值。例如,如果您正在比较具有latin1字符集的列和字符串,您可以使用COLLATE操作符使操作数具有latin1_general_cs或latin1_bin排序…
My MySQL setup does not support latin1_general_cs
or latin1_bin
, but it worked fine for me to use the collation utf8_bin
as binary utf8 is case sensitive:
我的MySQL设置不支持latin1_general_cs或latin1_bin,但是使用排序规则utf8_bin作为二进制utf8是区分大小写的:
SELECT
*
FROM
table
WHERE
column_name LIKE "%search_string%" COLLATE utf8_bin
2. / 3. sorted by Levenshtein Distance
2。/ 3。按Levenshtein距离排序
SELECT
column,
LEVENSHTEIN(column, 'search_string') AS distance // for sorting
FROM table
WHERE
column_name LIKE "%search_string%"
COLLATE utf8_bin // for case sensitivity, just leave out for CI
ORDER BY
distance
DESC
#3
3
It seems that your definition of similarity is semantic similarity. So in order to build such a similarity function, you should use semantic similarity measures. Note that the scope of work on the issue might vary from few hours to years so it is recommended to decide on the scope before getting into work. I didn’t figure out which data do you have in order to build the similarity relation. I assume the you have access the a dataset of documents and a dataset of queries. You can start with co-occurrence of the words (e.g., conditional probability). You will discover quickly that you get the list of stop words as related the most of the words simply because they are very popular. Using the lift of conditional probability will take care of the stop words but will make the relation prone to error in small number (most of your cases). You might try Jacard but since it is symmetric there will be many relations it won't find. Then you might consider relations that appear only in short distance from the base word. You can (and should) consider relations base on general corpus's (e.g., Wikipedia) and user specific (e.g., his emails).
似乎你对相似的定义是语义相似。因此,为了构建这样一个相似函数,您应该使用语义相似度度量。请注意,关于这个问题的工作范围可能会从几个小时到几年不等,因此建议在开始工作之前确定范围。我不知道你有哪些数据来建立相似关系。我假设您已经访问了文档的数据集和查询的数据集。你可以从单词的共现开始(例如,条件概率)。你很快就会发现,你得到的停止词列表与大多数的词相关,仅仅是因为它们非常流行。使用条件概率升降机将处理停止字,但将使关系容易在少量的错误(您的大多数情况)。你可以试试Jacard,但是由于它是对称的,所以会有很多关系它不会找到。然后,您可以考虑只出现在离基本字很近的关系。你可以(也应该)考虑基于通用语料库(如*)和特定用户(如他的电子邮件)的关系。
Very shortly you will have plenty of similarity measures, when all the measures are good and have some advantage over the others.
很短的时间内,当所有的度量都是好的并且相对于其他度量有一些优势时,您将有大量的相似度量。
In order to combine such measures, I like to reduce the problem into a classification problem.
为了结合这些措施,我喜欢把问题简化成一个分类问题。
You should build a data set of paris of words and label them as "is related". In order to build a large labeled dataset you can:
您应该构建一个由单词组成的巴黎数据集,并将它们标记为“相关”。为了构建一个大型标记的数据集,您可以:
- Use sources of known related words (e.g., good old Wikipedia categories) for positives
- 使用已知相关词汇的来源(例如,好的旧*分类)来获取积极的信息。
- Most of the word not known as related are not related.
- 大多数不相关的词都不相关。
Then use all the measures you have as features of the pairs. Now you are in the domain of supervised classification problem. Build a classifier on the data set, evaluated according to your needs and get a similarity measure that fits your needs.
然后使用所有你拥有的度量作为对的特征。现在你进入了监督分类问题。在数据集上构建一个分类器,根据您的需要进行评估,并获得适合您需要的相似度度量。
#1
77
I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it is quite limited. You could get away with LIKE, but it's really for basic searches. You may want to look into other search methods for what you want to achieve. For example:
我发现,当您针对另一个完整字符串搜索完整字符串时,Levenshtein距离可能很好,但是当您在字符串中查找关键字时,这个方法(有时)不会返回所需的结果。此外,SOUNDEX功能并不适合英语以外的语言,所以它是很有限的。你可以侥幸过关,但这是基本的搜索。您可能想要查看其他的搜索方法来获得您想要的结果。例如:
You may use Lucene as search base for your projects. It's implemented in most major programming languages and it'd quite fast and versatile. This method is probably the best, as it not only search for substrings, but also letter transposition, prefixes and suffixes (all combined). However, you need to keep a separate index (using CRON to update it from a independent script once in a while works though).
您可以使用Lucene作为您的项目的搜索基础。它在大多数主要的编程语言中都有实现,而且速度很快,用途广泛。这个方法可能是最好的,因为它不仅搜索子字符串,而且还搜索字母换位、前缀和后缀(所有组合)。但是,您需要保留一个单独的索引(偶尔使用CRON将其从独立的脚本中更新一次)。
Or, if you want a MySQL solution, the fulltext functionality is pretty good, and certainly faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table, then perform your fulltext search :
或者,如果您想要一个MySQL解决方案,那么full - text功能非常好,而且肯定比存储过程快。如果您的表不是MyISAM,您可以创建一个临时表,然后执行全文搜索:
CREATE TABLE IF NOT EXISTS `tests`.`data_table` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(2000) CHARACTER SET latin1 NOT NULL,
`description` text CHARACTER SET latin1 NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
Use a data generator to generate some random data if you don't want to bother creating it yourself...
使用数据生成器生成一些随机数据,如果您不想自己创建它…
** NOTE ** : the column type should be latin1_bin
to perform a case sensitive search instead of case insensitive with latin1
. For unicode strings, I would recommend utf8_bin
for case sensitive and utf8_general_ci
for case insensitive searches.
**注意**:列类型应该是latin1_bin,以执行区分大小写的搜索,而不是不区分大小写的latin1。对于unicode字符串,我建议用utf8_bin表示区分大小写,用utf8_general_ci表示不区分大小写的搜索。
DROP TABLE IF EXISTS `tests`.`data_table_temp`;
CREATE TEMPORARY TABLE `tests`.`data_table_temp`
SELECT * FROM `tests`.`data_table`;
ALTER TABLE `tests`.`data_table_temp` ENGINE = MYISAM;
ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` (
`title` ,
`description`
);
SELECT *,
MATCH (`title`,`description`)
AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score`
FROM `tests`.`data_table_temp`
WHERE MATCH (`title`,`description`)
AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE)
ORDER BY `score` DESC;
DROP TABLE `tests`.`data_table_temp`;
Read more about it from the MySQL API reference page
从MySQL API参考页面了解更多信息
The downside to this is that it will not look for letter transposition or "similar, sounds like" words.
它的缺点是它不会寻找字母的换位或“类似的,听起来像”的单词。
** UPDATE **
* * * *更新
Using Lucene for your search, you will simply need to create a cron job (all web hosts have this "feature") where this job will simply execute a PHP script (i.g. "cd /path/to/script; php searchindexer.php") that will update the indexes. The reason being that indexing thousands of "documents" (rows, data, etc.) may take several seconds, even minutes, but this is to ensure that all searches are performed as fast as possible. Therefore, you may want to create a delay job to be run by the server. It may be overnight, or in the next hour, this is up to you. The PHP script should look something like this:
使用Lucene进行搜索,您只需创建一个cron作业(所有web主机都有这个“特性”),该作业只需执行一个PHP脚本(即g)。“cd /道路/ /脚本;将更新索引的)。原因是索引数千个“文档”(行、数据等)可能需要几秒钟甚至几分钟,但这是为了确保所有搜索都尽可能快地执行。因此,您可能希望创建一个由服务器运行的延迟作业。可能是在一夜之间,或者是在接下来的一个小时里,这取决于你。PHP脚本应该如下所示:
$indexer = Zend_Search_Lucene::create('/path/to/lucene/data');
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
// change this option for your need
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);
$rowSet = getDataRowSet(); // perform your SQL query to fetch whatever you need to index
foreach ($rowSet as $row) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8'))
->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8'))
->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable))
->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8'))
;
$indexer->addDocument($doc);
}
// ... you can get as many $rowSet as you want and create as many documents
// as you wish... each document doesn't necessarily need the same fields...
// Lucene is pretty flexible on this
$indexer->optimize(); // do this every time you add more data to you indexer...
$indexer->commit(); // finalize the process
Then, this is basically how you search (basic search) :
那么,这基本上就是你搜索(基本搜索)的方式:
$index = Zend_Search_Lucene::open('/path/to/lucene/data');
// same search options
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
$query = 'php +field1:foo'; // search for the word 'php' in any field,
// +search for 'foo' in field 'field1'
$hits = $index->find($query);
$numHits = count($hits);
foreach ($hits as $hit) {
$score = $hit->score; // the hit weight
$field1 = $hit->field1;
// etc.
}
Here are great sites about Lucene in Java, PHP, and .Net.
下面是Java、PHP和。net中关于Lucene的一些不错的网站。
In conclusion each search methods have their own pros and cons :
总之,每种搜索方法都有各自的优缺点:
- You mentioned Sphinx search and it looks very good, as long as you can make the deamon run on your web host.
- 你提到了Sphinx搜索,它看起来很不错,只要你能让deamon在你的web主机上运行。
- Zend Lucene requires a cron job to re-index the database. While it is quite transparent to the user, this means that any new data (or deleted data!) is not always in sync with the data in your database and therefore won't show up right away on user search.
- Zend Lucene需要一个cron作业来重新索引数据库。虽然对用户来说非常透明,但这意味着任何新的数据(或删除的数据!)并不总是与数据库中的数据同步,因此不会立即出现在用户搜索中。
- MySQL FULLTEXT search is good and fast, but will not give you all the power and flexibility of the first two.
- MySQL全文搜索是好的和快速的,但不会给你所有的力量和灵活性的前两个。
Please feel free to comment if I have forgotten/missed anything.
如果我忘记/错过任何东西,请随时评论。
#2
18
1. Similarity
1。相似
For Levenshtein in MySQL I found this, from www.codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function
对于MySQL中的Levenshtein,我从www.codejanitor.com/wp/2007/02/10/levenshtein- -as-a- MySQL - storedfunction中找到了这个。
SELECT
column,
LEVENSHTEIN(column, 'search_string') AS distance
FROM table
WHERE
LEVENSHTEIN(column, 'search_string') < distance_limit
ORDER BY distance DESC
2. Containing, case insensitive
2。包含,不区分大小写
Use the LIKE
statement of MySQL, which is case insensitive by default. The %
is a wildcard, so there may be any string before and after search_string
.
使用MySQL的LIKE语句,默认情况下是不区分大小写的。%是通配符,因此在search_string前后可能有任何字符串。
SELECT
*
FROM
table
WHERE
column_name LIKE "%search_string%"
3. Containing, case sensitive
3所示。包含,区分大小写
The MySQL Manual helps:
MySQL手册可以帮助:
The default character set and collation are latin1 and latin1_swedish_ci, so nonbinary string comparisons are case insensitive by default. This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a. To make this search case sensitive, make sure that one of the operands has a case sensitive or binary collation. For example, if you are comparing a column and a string that both have the latin1 character set, you can use the COLLATE operator to cause either operand to have the latin1_general_cs or latin1_bin collation...
默认的字符集和排序是latin1和latin1_swedish_ci,因此默认情况下,非二进制字符串比较不区分大小写。这意味着,如果您使用col_name(如“a%”)搜索,您将得到以a或a开头的所有列值。例如,如果您正在比较具有latin1字符集的列和字符串,您可以使用COLLATE操作符使操作数具有latin1_general_cs或latin1_bin排序…
My MySQL setup does not support latin1_general_cs
or latin1_bin
, but it worked fine for me to use the collation utf8_bin
as binary utf8 is case sensitive:
我的MySQL设置不支持latin1_general_cs或latin1_bin,但是使用排序规则utf8_bin作为二进制utf8是区分大小写的:
SELECT
*
FROM
table
WHERE
column_name LIKE "%search_string%" COLLATE utf8_bin
2. / 3. sorted by Levenshtein Distance
2。/ 3。按Levenshtein距离排序
SELECT
column,
LEVENSHTEIN(column, 'search_string') AS distance // for sorting
FROM table
WHERE
column_name LIKE "%search_string%"
COLLATE utf8_bin // for case sensitivity, just leave out for CI
ORDER BY
distance
DESC
#3
3
It seems that your definition of similarity is semantic similarity. So in order to build such a similarity function, you should use semantic similarity measures. Note that the scope of work on the issue might vary from few hours to years so it is recommended to decide on the scope before getting into work. I didn’t figure out which data do you have in order to build the similarity relation. I assume the you have access the a dataset of documents and a dataset of queries. You can start with co-occurrence of the words (e.g., conditional probability). You will discover quickly that you get the list of stop words as related the most of the words simply because they are very popular. Using the lift of conditional probability will take care of the stop words but will make the relation prone to error in small number (most of your cases). You might try Jacard but since it is symmetric there will be many relations it won't find. Then you might consider relations that appear only in short distance from the base word. You can (and should) consider relations base on general corpus's (e.g., Wikipedia) and user specific (e.g., his emails).
似乎你对相似的定义是语义相似。因此,为了构建这样一个相似函数,您应该使用语义相似度度量。请注意,关于这个问题的工作范围可能会从几个小时到几年不等,因此建议在开始工作之前确定范围。我不知道你有哪些数据来建立相似关系。我假设您已经访问了文档的数据集和查询的数据集。你可以从单词的共现开始(例如,条件概率)。你很快就会发现,你得到的停止词列表与大多数的词相关,仅仅是因为它们非常流行。使用条件概率升降机将处理停止字,但将使关系容易在少量的错误(您的大多数情况)。你可以试试Jacard,但是由于它是对称的,所以会有很多关系它不会找到。然后,您可以考虑只出现在离基本字很近的关系。你可以(也应该)考虑基于通用语料库(如*)和特定用户(如他的电子邮件)的关系。
Very shortly you will have plenty of similarity measures, when all the measures are good and have some advantage over the others.
很短的时间内,当所有的度量都是好的并且相对于其他度量有一些优势时,您将有大量的相似度量。
In order to combine such measures, I like to reduce the problem into a classification problem.
为了结合这些措施,我喜欢把问题简化成一个分类问题。
You should build a data set of paris of words and label them as "is related". In order to build a large labeled dataset you can:
您应该构建一个由单词组成的巴黎数据集,并将它们标记为“相关”。为了构建一个大型标记的数据集,您可以:
- Use sources of known related words (e.g., good old Wikipedia categories) for positives
- 使用已知相关词汇的来源(例如,好的旧*分类)来获取积极的信息。
- Most of the word not known as related are not related.
- 大多数不相关的词都不相关。
Then use all the measures you have as features of the pairs. Now you are in the domain of supervised classification problem. Build a classifier on the data set, evaluated according to your needs and get a similarity measure that fits your needs.
然后使用所有你拥有的度量作为对的特征。现在你进入了监督分类问题。在数据集上构建一个分类器,根据您的需要进行评估,并获得适合您需要的相似度度量。