如何将HTML内容排除在我的弹性搜索索引之外?

时间:2021-10-30 07:31:54

I'm using Elasticsearch, and writing my own wrapper using WebRequest since NEST (the usual choice) bafflingly seems to lack the ability to insert an item and have the generated ID returned.

我使用了Elasticsearch,并使用WebRequest编写自己的包装器,因为NEST(通常的选择)似乎缺乏插入项并返回生成ID的能力。

Anyway - no problems with the general method. But, any HTML content is indexed as-is, i.e. if I have <strong>test</strong> in a field, then a search for the query "strong" returns the item.

总之,一般方法没有问题。但是,任何HTML内容都是按原样索引的,例如,如果我在一个字段中有test,那么搜索“strong”将返回条目。

I've put this in elasticsearch.yml, based on a random message board post I found:

我把这个放到了弹性搜索中。yml,基于一个随机的留言板帖子,我发现:

index:
    analysis:
        analyzer:
            htmlContentAnalyzer:
                type: custom
                tokenizer: standard
                filter: standard
                char_filter: html_strip

Then, I create an mapping thusly for my index 'content', item type 'news':

然后,我为我的索引“content”创建一个映射,项目类型为“news”:

PUT http://localhost:9200/content/news/_mapping

{
    "news" : {
        "properties" : {
            "TextContent" : {
                "type" : "string",
                "index" : "analyzed",
                "analyzer" : "htmlContentAnalyzer",
                "store" : "yes"
                }
            }
        }
    }
}

The store/yes is just for "fun", it makes no difference. The above gives me a 200 OK.

这家店/是的只是为了“好玩”,没什么区别。上面给了我200美元。

However, the search returns the same results.

但是,搜索返回相同的结果。

What doesn't help is that elasticsearch documentation seems appalling. Check out this page:

没有帮助的是,弹性搜索文档看起来很糟糕。看看这个页面:

http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html

http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html

it gives you a brief rundown of what mapping is, and says more details are in the mapping section, i.e. this page:

它向您简要介绍了什么是映射,并说了更多的细节在映射部分,即这一页:

http://www.elasticsearch.org/guide/reference/mapping/

http://www.elasticsearch.org/guide/reference/mapping/

...which seems to be truly terrible. There's nothing referring to the format/object graph I found - no mention of "properties", "type", "analyzer", "index" etc. There are some sections on the menu on the right, e.g. "_index", but they seem to refer to the item as a whole? And where is that pointed out?

…这似乎真的很可怕。这里没有提到我发现的格式/对象图——没有提到“属性”、“类型”、“分析器”、“索引”等等。“_index”,但它们似乎是指整个项目?它指向哪里?

So my question is on two fronts:

所以我的问题是在两个方面:

  • How do I stop HTML tags (and entities, attribute values I guess) being indexed? - I still want the HTML stored, mind you
  • 如何停止被索引的HTML标记(和实体、属性值)?-注意,我还是想要HTML文件
  • Is there a better source for elasticsearch info/documentation? Or am I looking at it without the super-secret decoder glasses?
  • 有更好的弹性搜索信息/文档源吗?或者,我有没有看到没有超级机密的解码器眼镜?

1 个解决方案

#1


3  

With all credit to chrismale on #elasticsearch (freenode IRC) -

所有的荣誉都归功于“弹性搜索”(freenode IRC)。

Searching against _all is no good: that is indexed with its own analyzer. Querying on my TextContent field specifically worked as expected.

对_all进行搜索没有好处:它使用自己的分析器进行索引。在我的TextContent字段上进行查询特别地工作。

#1


3  

With all credit to chrismale on #elasticsearch (freenode IRC) -

所有的荣誉都归功于“弹性搜索”(freenode IRC)。

Searching against _all is no good: that is indexed with its own analyzer. Querying on my TextContent field specifically worked as expected.

对_all进行搜索没有好处:它使用自己的分析器进行索引。在我的TextContent字段上进行查询特别地工作。