Logstash + Kibana术语面板不会破坏单词

I have a Java application that writes to a log file in json format. The fields that come in the logs are variable. The logstash reads this logfile and sends it to Kibana.

我有一个Java应用程序,以json格式写入日志文件。日志中的字段是可变的。 logstash读取此日志文件并将其发送给Kibana。

I've configured the logstash with the following file:

我已使用以下文件配置了logstash:

input {
        file {
                path => ["[log_path]"]
                codec => "json"
        }
}

filter{
        json {
                source => "message"
        }

        date {
                match => [ "data", "dd-MM-yyyy HH:mm:ss.SSS" ]
                timezone => "America/Sao_Paulo"
        }
}

output {
        elasticsearch_http {
                flush_size => 1
                host => "[host]"
                index => "application-%{+YYYY.MM.dd}"
        }
}

I've managed to show correctly everything in Kibana without any mapping. But when I try to create a terms panel to show a count of the servers who sent those messages I have a problem. I have a field called server in my json, that show the servers name (like: a1-name-server1), but the terms panel split the server name because of the "-". Also I would like to count the number of times that a error message appears, but the same problem occurs, because the terms panel split the error message because of the spaces.

我已经设法正确显示Kibana中的所有内容而没有任何映射。但是当我尝试创建一个术语面板来显示发送这些消息的服务器的数量时,我遇到了问题。我的json中有一个名为server的字段,它显示服务器名称(如:a1-name-server1),但由于“ - ”,术语面板将服务器名称拆分。此外,我想计算出现错误消息的次数,但出现同样的问题,因为术语面板因空格而拆分错误消息。

I'm using Kibana 3 and Logstash 1.4. I've searched a lot on the web and couldn't find any solution. I also tried using the .raw from logstash, but it didn't work.

我正在使用Kibana 3和Logstash 1.4。我在网上搜索了很多,找不到任何解决方案。我也尝试过使用logstash中的.raw,但它没有用。

How can I manage this?

我该怎么办呢?

Thanks for the help.

谢谢您的帮助。

2 个解决方案

#1

Your problem here is that your data is being tokenized. This is helpful to make any search over your data. ES (by default) will split your field message split into different parts to be able to search them. For example you may want to search for the word ERROR in your logs, so you probably would like to see in the results messages like "There was an error in your cluster" or "Error processing whatever". If you don't analyze the data for that field with tokenizers, you won't be able to search like this.

您的问题是您的数据被标记化。这有助于对您的数据进行任何搜索。 ES(默认情况下)会将您的字段消息拆分为不同的部分,以便能够搜索它们。例如,您可能希望在日志中搜索单词ERROR,因此您可能希望在结果消息中看到“群集中存在错误”或“处理错误”等错误。如果您不使用tokenizer分析该字段的数据,则无法像这样进行搜索。

This analyzed behaviour is helpful when you want to search things, but it doesn't allow you to group when different messages that have the same content. This is your usecase. The solution to this is to update your mapping putting not_analyzed for that specific field that you don't want to split into tokens. This will probably work for your host field, but will probably break the search.

当您想要搜索内容时,此分析的行为很有用,但它不允许您在具有相同内容的不同消息时进行分组。这是你的用例。解决方案是更新您的映射,将not_analyzed放在您不想拆分为令牌的特定字段中。这可能适用于您的主机字段,但可能会破坏搜索。

What I usually do for these kind of situations is to use index templates and multifields. The index template allow me to set a mapping for every index that match a regex and the multifields allow me to have the analyzed and not_analyzed behaviour in a same field.

我通常为这种情况做的是使用索引模板和多字段。索引模板允许我为与正则表达式匹配的每个索引设置映射,并且多字段允许我在同一字段中具有已分析和未分析的行为。

Using the following query would do the job for your problem:

使用以下查询可以解决您的问题:

curl -XPUT https://example.org/_template/name_of_index_template -d '
{
    "template": "indexname*",
    "mappings": {
        "type": {
            "properties": {
               "field_name": {
                  "type": "multi_field",
                  "fields": {
                     "field_name": {
                         "type": "string",
                         "index": "analyzed"
                     },
                     "untouched": {
                         "type": "string",
                         "index": "not_analyzed"
                     }                      
                 }
            }
        }
    }
}'

And then in your terms panel you can use field.untouched, to consider the entire content of the field when you calculate the count of the different elements.

然后在您的术语面板中,您可以使用field.untouched,在计算不同元素的计数时考虑字段的全部内容。

If you don't want to use index templates (maybe your data is in a single index), setting the mapping with the Put Mapping API would do the job too. And if you use multifields, there is no need to reindex the data, because from the moment that you set the new mapping for the index, the new data will be duplicated in these two subfields (field_name and field_name.untouched). If you just change the mapping from analyzed to not_analyzed you won't be able to see any change until you reindex all your data.

如果您不想使用索引模板(可能您的数据在单个索引中),那么使用Put Mapping API设置映射也可以完成这项工作。如果使用多字段,则无需重新索引数据,因为从为索引设置新映射的那一刻起,新数据将在这两个子字段(field_name和field_name.untouched)中重复。如果您只是将映射从已分析更改为not_analyzed,则在重新索引所有数据之前,您将无法看到任何更改。

#2

Since you didn't define a mapping in elasticsearch, the default settings takes place for every field in your type in your index. The default settings for string fields (like your server field) is to analyze the field, meaning that elastic search will tokenize the field contents. That is why its splitting your server names to parts.

由于您未在elasticsearch中定义映射,因此默认设置将针对索引中类型中的每个字段进行。字符串字段(如服务器字段)的默认设置是分析字段,这意味着弹性搜索将标记字段内容。这就是为什么它将您的服务器名称拆分为部分。

You can overcome this issue by defining a mapping. You don't have to define all your fields, but only the ones that you don't want elasticsearch to analyze. In your particular case, sending the following put command will do the trick:

您可以通过定义映射来克服此问题。您不必定义所有字段,只需定义您不希望弹性搜索分析的字段。在您的特定情况下,发送以下put命令将起到作用:

http://[host]:9200/[index_name]/_mapping/[type]

{
    "type" : {
        "properties" : {
            "server" : {"type" : "string", "index" : "not_analyzed"}
        }
    }
}

You can't do this on an already existing index because switching from analyzed to not_analyzed is a major change in the mapping.

您不能在现有索引上执行此操作,因为从analyze分析切换到not_analyzed是映射中的重大更改。

#1