Elasticsearch has two similar features to get "similar" documents:
Elasticsearch有两个类似的功能来获取“类似”文档:
There is the "More Like This API". It gives me documents similar to a given one. I can't use it in more complex expressions though.
有“更喜欢这个API”。它给了我类似于给定文件的文件。我不能在更复杂的表达中使用它。
There is also the "more_like_this"
query for use in the Search API I can use it in bool or boosting expressions, but I can't give it an id of a document. I have to provide the "like_text"
parameter.
在搜索API中也有“more_like_this”查询我可以在bool或boost表达式中使用它,但我不能给它一个文档的id。我必须提供“like_text”参数。
I have documents with tags and content. Some documents will have good tags and some won't have any. I want a "Similar documents" feature that will work every time but will rank documents with matching tags higher than documents with matching text. My idea was:
我有带标签和内容的文件。有些文件会有好的标签,有些则没有。我希望每次都能使用“类似文档”功能,但会将匹配标记的文档排序为高于具有匹配文本的文档。我的想法是:
{
"boosting" : {
"positive" : {
"more_like_this" : {
"fields" : ["tag"],
"id" : "23452",
"min_term_freq" : 1
}
},
"negative" : {
"more_like_this" : {
"fields" : ["tag"],
"id" : "23452",
}
},
"negative_boost" : 0.2
}
}
Obviously this doesn't work because there is no "id"
in "more_like_this"
. What are the alternatives?
显然这不起作用,因为“more_like_this”中没有“id”。有哪些替代方案?
2 个解决方案
#1
41
First of all a little introduction about the more like this functionality and how it works. The idea is that you have a specific document and you want to have some others that are similar to it.
首先介绍一下这个功能及其工作原理。这个想法是你有一个特定的文件,你想要一些其他类似的文件。
In order to achieve this we need to extract some content out of the current document and use it to make a query to get similar ones. We can extract content from the lucene stored fields (or the elasticsearch _source field, which is effectively a stored field in lucene) and somehow reanalyze it or use the information stored in the term vectors (if enabled while indexing) to get a list of terms that we can use to query, without having to reanalyze the text. I'm not sure whether elasticsearch tries this latter approach if term vectors are available though.
为了实现这一点,我们需要从当前文档中提取一些内容,并使用它来进行查询以获得类似的内容。我们可以从lucene存储的字段(或者弹性搜索_source字段,它实际上是lucene中的存储字段)中提取内容,并以某种方式重新分析它或使用术语向量中存储的信息(如果在索引时启用)以获取术语列表我们可以用来查询,而不必重新分析文本。如果术语向量可用,我不确定elasticsearch是否会尝试后一种方法。
The more like this query allows you to provide a text, regardless of where you got it from. That text will be used to query the fields that you select and get back similar documents. The text will not be entirely used, but reanalyzed, and only a maximum of max_query_terms
(default 25) will be kept, out of the terms that have at least the provided min_term_freq
(minimum term frequency, default 2) and document frequency between min_doc_freq
and max_doc_freq
. There are more parameters too that can influence the generated query.
这个查询更像是允许您提供文本,无论您从何处获取文本。该文本将用于查询您选择的字段并返回类似的文档。该文本将不会被完全使用,但会重新分析,并且只保留最多max_query_terms(默认值为25)的条款,其中至少包含所提供的min_term_freq(最小术语频率,默认值为2)且文档频率介于min_doc_freq和max_doc_freq。还有更多参数可以影响生成的查询。
The more like this api goes one step further, allowing to provide the id of a document and, again, a list of fields. The content of those fields will be extracted from that specific document and used to make a more like this query on the same fields. That means that the generated more like this query will have the property text containing the text previously extracted and will be performed on the same fields. As you can see the more like this api executes a more like this query under the hood.
这更像api更进一步,允许提供文档的id,再次提供字段列表。这些字段的内容将从该特定文档中提取,并用于在相同字段上进行更类似的查询。这意味着生成的更像此查询将具有包含先前提取的文本的属性文本,并将在相同的字段上执行。正如你所看到的更像这样的api执行更像是这个问题。
Let's say the more like this query gives you more flexibility, since you can combine it with other queries and you can get the text from whatever source you like. On the other hand the more like this api exposes the common functionality doing some more work for you but with some restrictions.
让我们说更像这个查询给你更多的灵活性,因为你可以将它与其他查询结合起来,你可以从你喜欢的任何来源获取文本。另一方面,更像这样的api暴露了常见的功能,为你做了一些更多的工作,但有一些限制。
In your case I would combine a couple of different more like this queries together, so that you can make use of the powerful elasticsearch query DSL, boost queries differently and so on. The downside is that you have to provide the text yourself, since you can't provide the id of the document to extract it from.
在你的情况下,我会将更多像这样的查询组合在一起,这样你就可以利用强大的elasticsearch查询DSL,以不同的方式提升查询等等。缺点是您必须自己提供文本,因为您无法提供文档的ID以从中提取文本。
There are different ways to achieve what you want. I would use a bool query to combine the two more like this queries in a should clause and give them a different weight. I would also use the more like this field query instead, since you want to query a single field at a time.
有不同的方法来实现你想要的。我会使用bool查询将两个更像这个查询组合在一个should子句中,并赋予它们不同的权重。我也会使用更像这个字段查询,因为你想一次查询一个字段。
{
"bool" : {
"must" : {
{"match_all" : { }}
},
"should" : [
{
"more_like_this_field" : {
"tags" : {
"like_text" : "here go the tags extracted from the current document!",
"boost" : 2.0
}
}
},
{
"more_like_this_field" : {
"content" : {
"like_text" : "here goes the content extracted from the current document!"
}
}
}
],
"minimum_number_should_match" : 1
}
}
This way at least one of the should clauses must match, and a match on tags is more important than a match on content.
这样,至少有一个should子句必须匹配,并且标签上的匹配比内容匹配更重要。
#2
9
This is possible now with the new like syntax:
现在可以使用新的类似语法:
{
"more_like_this" : {
"fields" : ["title", "description"],
"like" : [
{
"_index" : "imdb",
"_type" : "movies",
"_id" : "1"
},
{
"_index" : "imdb",
"_type" : "movies",
"_id" : "2"
}],
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
See here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
请参阅:https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
#1
41
First of all a little introduction about the more like this functionality and how it works. The idea is that you have a specific document and you want to have some others that are similar to it.
首先介绍一下这个功能及其工作原理。这个想法是你有一个特定的文件,你想要一些其他类似的文件。
In order to achieve this we need to extract some content out of the current document and use it to make a query to get similar ones. We can extract content from the lucene stored fields (or the elasticsearch _source field, which is effectively a stored field in lucene) and somehow reanalyze it or use the information stored in the term vectors (if enabled while indexing) to get a list of terms that we can use to query, without having to reanalyze the text. I'm not sure whether elasticsearch tries this latter approach if term vectors are available though.
为了实现这一点,我们需要从当前文档中提取一些内容,并使用它来进行查询以获得类似的内容。我们可以从lucene存储的字段(或者弹性搜索_source字段,它实际上是lucene中的存储字段)中提取内容,并以某种方式重新分析它或使用术语向量中存储的信息(如果在索引时启用)以获取术语列表我们可以用来查询,而不必重新分析文本。如果术语向量可用,我不确定elasticsearch是否会尝试后一种方法。
The more like this query allows you to provide a text, regardless of where you got it from. That text will be used to query the fields that you select and get back similar documents. The text will not be entirely used, but reanalyzed, and only a maximum of max_query_terms
(default 25) will be kept, out of the terms that have at least the provided min_term_freq
(minimum term frequency, default 2) and document frequency between min_doc_freq
and max_doc_freq
. There are more parameters too that can influence the generated query.
这个查询更像是允许您提供文本,无论您从何处获取文本。该文本将用于查询您选择的字段并返回类似的文档。该文本将不会被完全使用,但会重新分析,并且只保留最多max_query_terms(默认值为25)的条款,其中至少包含所提供的min_term_freq(最小术语频率,默认值为2)且文档频率介于min_doc_freq和max_doc_freq。还有更多参数可以影响生成的查询。
The more like this api goes one step further, allowing to provide the id of a document and, again, a list of fields. The content of those fields will be extracted from that specific document and used to make a more like this query on the same fields. That means that the generated more like this query will have the property text containing the text previously extracted and will be performed on the same fields. As you can see the more like this api executes a more like this query under the hood.
这更像api更进一步,允许提供文档的id,再次提供字段列表。这些字段的内容将从该特定文档中提取,并用于在相同字段上进行更类似的查询。这意味着生成的更像此查询将具有包含先前提取的文本的属性文本,并将在相同的字段上执行。正如你所看到的更像这样的api执行更像是这个问题。
Let's say the more like this query gives you more flexibility, since you can combine it with other queries and you can get the text from whatever source you like. On the other hand the more like this api exposes the common functionality doing some more work for you but with some restrictions.
让我们说更像这个查询给你更多的灵活性,因为你可以将它与其他查询结合起来,你可以从你喜欢的任何来源获取文本。另一方面,更像这样的api暴露了常见的功能,为你做了一些更多的工作,但有一些限制。
In your case I would combine a couple of different more like this queries together, so that you can make use of the powerful elasticsearch query DSL, boost queries differently and so on. The downside is that you have to provide the text yourself, since you can't provide the id of the document to extract it from.
在你的情况下,我会将更多像这样的查询组合在一起,这样你就可以利用强大的elasticsearch查询DSL,以不同的方式提升查询等等。缺点是您必须自己提供文本,因为您无法提供文档的ID以从中提取文本。
There are different ways to achieve what you want. I would use a bool query to combine the two more like this queries in a should clause and give them a different weight. I would also use the more like this field query instead, since you want to query a single field at a time.
有不同的方法来实现你想要的。我会使用bool查询将两个更像这个查询组合在一个should子句中,并赋予它们不同的权重。我也会使用更像这个字段查询,因为你想一次查询一个字段。
{
"bool" : {
"must" : {
{"match_all" : { }}
},
"should" : [
{
"more_like_this_field" : {
"tags" : {
"like_text" : "here go the tags extracted from the current document!",
"boost" : 2.0
}
}
},
{
"more_like_this_field" : {
"content" : {
"like_text" : "here goes the content extracted from the current document!"
}
}
}
],
"minimum_number_should_match" : 1
}
}
This way at least one of the should clauses must match, and a match on tags is more important than a match on content.
这样,至少有一个should子句必须匹配,并且标签上的匹配比内容匹配更重要。
#2
9
This is possible now with the new like syntax:
现在可以使用新的类似语法:
{
"more_like_this" : {
"fields" : ["title", "description"],
"like" : [
{
"_index" : "imdb",
"_type" : "movies",
"_id" : "1"
},
{
"_index" : "imdb",
"_type" : "movies",
"_id" : "2"
}],
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
See here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html
请参阅:https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html