Kubernetes Fluentd+Elasticsearch+Kibana统一日志管理平台搭建的填坑指南

时间:2024-05-25 20:03:38

在初步完成Kubernetes集群架构的建立后,通过搭建一些监控组件,我们已经能够实现

  • 图形化的监控每个node,pod的状态信息和资源情况
  • 通过scale进行replicateSet的扩展和伸缩
  • 通过kubectl logs 或dashboard去查看每个Pod的运行日志

但是,在分布式架构中节点的规模往往是很庞大的,一个典型的生产环境可能有几十上百个minion节点,在这种情况下就需要建立一套集中的日志监控和管理系统,在本人前期的思考中,也想通过volumn外挂到存储的方式实现weblogic的日志输出到共享存储,但这种方式的问题在于:

  • 我们通过WebLogic单域的模式进行Docker的扩展服务, 这就意味着所有的日志路径和名字都是一致的,也就是都存放在pod的container节点的统一的路径下(/u01/oracle/user_projects/domains/base_domain/servers/AdminServer/logs/AdminServer.log),如果通过volumnMount映射到存储,会存在文件的冲突问题.
  • 无法获取pod和container的信息
  • 无法获取集群中其他节点的运行信息

因此还是需要寻找平台级别的架构方案.在kubernetes的官方文档中,https://kubernetes.io/docs/concepts/cluster-administration/logging/

Kubernetes给出了几种日志方案,并给出Cluster-level logging的参考架构:

Kubernetes Fluentd+Elasticsearch+Kibana统一日志管理平台搭建的填坑指南

也就是说,我们自己启动运行的Pod的内部容器进程通过streaming的方式把日志输出到minion主机,然后由运行在相同主机的另外一个pod,logging-agent-pod把日志获取到,同时把日志传回Backend, Bankend实际上是基于不同的实现,比如Elasticsearch-logging,以及展示的kibana平台。

Kubernetes建议采用这种结点级别的logging-agent,并提供了其中的两种,一种用于Google Cloud Platform的Stackdriver Logging,另一种就是Elasticsearch,两种都是采用fluentd做为在结点上运行的Agent(日志代理)

Using a node-level logging agent is the most common and encouraged approach for a Kubernetes cluster, because it creates only one agent per node, and it doesn’t require any changes to the applications running on the node. However, node-level logging only works for applications’ standard output and standard error.

Kubernetes doesn’t specify a logging agent, but two optional logging agents are packaged with the Kubernetes release: Stackdriver Logging for use with Google Cloud Platform, and Elasticsearch. You can find more information and instructions in the dedicated documents. Both use fluentd with custom configuration as an agent on the node.

好了,下文便开始我们的填坑指南

1. 准备工作

  • 先交代一下环境

操作系统: CentOS 7.3

Kubernetes version: 1.5

[root@k8s-master fluentd-elasticsearch]# kubectl version
Client Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"a55267932d501b9fbd6d73e5ded47d79b5763ce5", GitTreeState:"clean", BuildDate:"2017-04-14T13:36:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"5", GitVersion:"v1.5.2", GitCommit:"a55267932d501b9fbd6d73e5ded47d79b5763ce5", GitTreeState:"clean", BuildDate:"2017-04-14T13:36:25Z", GoVersion:"go1.7.4", Compiler:"gc", Platform:"linux/amd64"}

  • 将github中kubernetes代码都下栽到master本地。
git clone https://github.com/kubernetes/kubernetes
  • 配置ServiceAccount,这是因为之后下载的fluentd images需要用到SSL的方式去连接API Server,如果不准备修改并生成新的images的话,还是需要配置好, 配置指南参考
http://www.cnblogs.com/ericnie/p/6894688.html
  • 配置dns,kibana组件需要通过dns找到elasticsearch-logging的Service,如果不配置dns,那就需要修改kibana-controller.yaml中的地址为固定的service ip,配置指南参考
http://www.cnblogs.com/ericnie/p/6897142.html
  • 下载images

进入/root/kubernetes/cluster/addons/fluentd-elasticsearch目录,看到所有的yaml文件

Kubernetes Fluentd+Elasticsearch+Kibana统一日志管理平台搭建的填坑指南

其中fluentd-es-ds.yaml用于构建运行在每个结点的fluentd DamonSet负责logging Agent角色,es-controller.yaml和es-service.yaml用于构建elasticsearch logging,负责logging backend的日志汇总,而kibana-controller和kibana-service用于展示

把几个conroller.yaml文件中的images下载到各个minion节点

docker pull gcr.io/google_containers/elasticsearch:v2.4.1-
docker pull gcr.io/google_containers/fluentd-elasticsearch:1.22
docker pull gcr.io/google_containers/kibana:v4.6.1-

2.启动fluentd DaemonSet

  • 打Label(坑1)

Fluentd会运行在每一个minion节点上,通过

# kubectl create -f fluentd-es-ds.yaml
daemonset "fluentd-es-v1.22" created

然后在minion节点上通过tail -f /var/log/fluentd.log中查看,结果在minion节点上根本没有fluentd.log文件啊!

笔者通过

kubectl get pods -n kube-system

看了一下,发现根本没有fluentd相关的Pod在运行或者在pending! :(

通过

kubectl get -f fluentd-es-ds.yaml
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE-SELECTOR
AGE fluentd-es-v1.22 0 0 0 0 0 beta.kubernetes.io/fluentd-ds-ready=true 2m

查看一下,发现有个NODE-SELECTOR(正常的如下),beta.kubernetes.io/fluentd-ds-ready=true

Kubernetes Fluentd+Elasticsearch+Kibana统一日志管理平台搭建的填坑指南

kubectl describe nodes k8s-node-

发现我的minion节点根本没有这个Label,通过命令打label

kubectl label node k8s-node- beta.kubernetes.io/fluentd-ds-ready=true

重新create后,就发现在k8s-node-1中已经存在/var/log/fluentd.log文件了.

  •  创建configmap(坑2)
# tail -f /var/log/fluentd.log

-- :: + [info]: reading config file path="/etc/td-agent/td-agent.conf"
-- :: + [info]: starting fluentd-0.12.
-- :: + [info]: gem 'fluent-mixin-config-placeholders' version '0.4.0'
-- :: + [info]: gem 'fluent-mixin-plaintextformatter' version '0.2.6'
-- :: + [info]: gem 'fluent-plugin-docker_metadata_filter' version '0.1.3'
-- :: + [info]: gem 'fluent-plugin-elasticsearch' version '1.5.0'
-- :: + [info]: gem 'fluent-plugin-kafka' version '0.4.1'
-- :: + [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '0.24.0'
-- :: + [info]: gem 'fluent-plugin-mongo' version '0.7.16'
-- :: + [info]: gem 'fluent-plugin-rewrite-tag-filter' version '1.5.5'
-- :: + [info]: gem 'fluent-plugin-s3' version '0.8.0'
-- :: + [info]: gem 'fluent-plugin-scribe' version '0.10.14'
-- :: + [info]: gem 'fluent-plugin-td' version '0.10.29'
-- :: + [info]: gem 'fluent-plugin-td-monitoring' version '0.2.2'
-- :: + [info]: gem 'fluent-plugin-webhdfs' version '0.4.2'
-- :: + [info]: gem 'fluentd' version '0.12.31'
-- :: + [info]: adding match pattern="fluent.**" type="null"
-- :: + [info]: adding filter pattern="kubernetes.**" type="kubernetes_metadata"
-- :: + [error]: config error file="/etc/td-agent/td-agent.conf" error="Invalid Kubernetes API v1 endpoint https://192.168.0.105:443/api: 401 Unauthorized"
-- :: + [info]: process finished code=
-- :: + [warn]: process died within second. exit.

发现fluentd image是通过443端口去连我的ApiServer的,API Server开启了安全机制,因此需要配置ca_file、client_cert、client_key等key,如果不想重新做images,Kubernetes提供了ConfigMap这一强大的武器,我们可以将新版td-agent.conf制作成kubernetes的configmap资源,并挂载到fluentd pod的相应位置以替换image中默认的td-agent.conf。

td-agent.conf的目录是

/root/kubernetes/cluster/addons/fluentd-elasticsearch/fluentd-es-image

加入ca,client等后如

// td-agent.conf
... ...
<filter kubernetes.**>
type kubernetes_metadata
ca_file /srv/kubernetes/ca.crt
client_cert /srv/kubernetes/kubecfg.crt
client_key /srv/kubernetes/kubecfg.key
</filter>
... ...

需要注意的是

  • 在基于td-agent.conf创建configmap资源之前,需要将td-agent.conf中的注释行都删掉,建议先备份一份(后来发挥了作用)
  • fluentd pod将创建在kube-system下,因此configmap资源也需要创建在kube-system namespace下面,否则kubectl create无法找到对应的configmap。
# kubectl create configmap td-agent-config --from-file=./td-agent.conf -n kube-system
configmap "td-agent-config" created # kubectl get configmaps td-agent-config -o yaml
apiVersion: v1
data:
td-agent.conf: |
<match fluent.**>
type null
</match> <source>
type tail
path /var/log/containers/*.log
pos_file /var/log/es-containers.log.pos
time_format %Y-%m-%dT%H:%M:%S.%NZ
tag kubernetes.*
format json
read_from_head true
</source>
... ...

fluentd-es-ds.yaml也要随之做一些改动,主要是增加两个mount:

一个是mount 上面的configmap td-agent-config

另外一个就是mount hostpath:/srv/kubernetes以获取到相关client端的数字证书:

[root@k8s-master fluentd-elasticsearch]# cat fluentd-es-ds.yaml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: fluentd-es-v1.
namespace: kube-system
labels:
k8s-app: fluentd-es
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
version: v1.
spec:
template:
metadata:
labels:
k8s-app: fluentd-es
kubernetes.io/cluster-service: "true"
version: v1.
# This annotation ensures that fluentd does not get evicted if the node
# supports critical pod annotation based priority scheme.
# Note that this does not guarantee admission on the nodes (#).
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ''
spec:
containers:
- name: fluentd-es
image: gcr.io/google_containers/fluentd-elasticsearch:1.22
command:
- '/bin/sh'
- '-c'
- '/usr/sbin/td-agent 2>&1 >> /var/log/fluentd.log'
resources:
limits:
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: td-agent-config
mountPath: /etc/td-agent
- name: tls-files
mountPath: /srv/kubernetes
nodeSelector:
beta.kubernetes.io/fluentd-ds-ready: "true"
terminationGracePeriodSeconds:
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: td-agent-config
configMap:
name: td-agent-config
- name: tls-files
hostPath:
path: /srv/kubernetes
[root@k8s-master fluentd-elasticsearch]#

再次创建fluent-es-ds.yaml,然后看minion的/var/log/fluentd.log

......
client_cert /srv/kubernetes/kubecfg.crt
client_key /srv/kubernetes/kubecfg.key
</filter>
<match **>
type elasticsearch
log_level info
include_tag_key true
host elasticsearch-logging
port
logstash_format true
buffer_chunk_limit 2M
buffer_queue_limit
flush_interval 5s
max_retry_wait
disable_retry_limit
num_threads
</match>
</ROOT>

出现这个基本算是成功了,貌似没问题吧,其实有个坑,可以接下来继续配置elasticsearch log.

3.配置elasticsearch

创建elasticsearch,

# kubectl create -f es-controller.yaml
replicationcontroller "elasticsearch-logging-v1" created # kubectl create -f es-service.yaml
service "elasticsearch-logging" created get pods: kube-system elasticsearch-logging-v1-3bzt6 / Running 7s 172.16.57.8 10.46.181.146
kube-system elasticsearch-logging-v1-nvbe1 / Running 7s 172.16.99.10 10.47.136.60

查看日志

# kubectl logs -f elasticsearch-logging-v1-3bzt6 -n kube-system
F0302 ::41.036697 elasticsearch_logging_discovery.go:] kube-system namespace doesn't exist: the server has asked for the client to provide credentials (get namespaces kube-system)
goroutine [running]:
k8s.io/kubernetes/vendor/github.com/golang/glog.stacks(0x19a8100, 0xc400000000, 0xc2, 0x186)
... ...
main.main()
elasticsearch_logging_discovery.go: +0xb53 [-- ::,][INFO ][node ] [elasticsearch-logging-v1-3bzt6] version[2.4.], pid[], build[c67dc32/--27T18::55Z]
[-- ::,][INFO ][node ] [elasticsearch-logging-v1-3bzt6] initializing ...
[-- ::,][INFO ][plugins ] [elasticsearch-logging-v1-3bzt6] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
... ...
[-- ::,][INFO ][env ] [elasticsearch-logging-v1-3bzt6] heap size [.3mb], compressed ordinary object pointers [true]
[-- ::,][INFO ][node ] [elasticsearch-logging-v1-3bzt6] initialized
[-- ::,][INFO ][node ] [elasticsearch-logging-v1-3bzt6] starting ...
[-- ::,][INFO ][transport ] [elasticsearch-logging-v1-3bzt6] publish_address {172.16.57.8:}, bound_addresses {[::]:}
[-- ::,][INFO ][discovery ] [elasticsearch-logging-v1-3bzt6] kubernetes-logging/7_f_M2TKRZWOw4NhBc4EqA
[-- ::,][WARN ][discovery ] [elasticsearch-logging-v1-3bzt6] waited for 30s and no initial state was set by the discovery
[-- ::,][INFO ][http ] [elasticsearch-logging-v1-3bzt6] publish_address {172.16.57.8:}, bound_addresses {[::]:}
[-- ::,][INFO ][node ] [elasticsearch-logging-v1-3bzt6] started

发现错误,无法提供安全的credential,通过在网上参考Tony Bai的技术文档,发现是默认的Service Account的问题,其中原理还需要研究一下。

先run起来再说,解决方案如下:

创建一个新的serviceaccount在kube-system namespace下:

/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: k8s-efk # kubectl create -f serviceaccount.yaml -n kube-system
serviceaccount "k8s-efk" created # kubectl get serviceaccount -n kube-system
NAME SECRETS AGE
default 139d
k8s-efk 17s

修改es-controller.yaml使用service account “k8s-efk”:

[root@k8s-master fluentd-elasticsearch]# cat es-controller.yaml
apiVersion: v1
kind: ReplicationController
metadata:
name: elasticsearch-logging-v1
namespace: kube-system
labels:
k8s-app: elasticsearch-logging
version: v1
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
spec:
replicas:
selector:
k8s-app: elasticsearch-logging
version: v1
template:
metadata:
labels:
k8s-app: elasticsearch-logging
version: v1
kubernetes.io/cluster-service: "true"
spec:
serviceAccount: k8s-efk
containers:
- image: gcr.io/google_containers/elasticsearch:v2.4.1-
name: elasticsearch-logging
resources:
# need more cpu upon initialization, therefore burstable class
limits:
cpu: 1000m
requests:
cpu: 100m
ports:
- containerPort:
name: db
protocol: TCP
- containerPort:
name: transport
protocol: TCP
volumeMounts:
- name: es-persistent-storage
mountPath: /data
env:
- name: "NAMESPACE"
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumes:
- name: es-persistent-storage
emptyDir: {}

重新创建elasticsearch logging service后,我们再来查看elasticsearch-logging pod的日志,貌似OK,其实也是个坑,呆会我继续说:

[-- ::,][INFO ][node                     ] [elasticsearch-logging-v1-9jjf1] version[2.4.], pid[], build[c67dc32/--27T18::55Z]
[-- ::,][INFO ][node ] [elasticsearch-logging-v1-9jjf1] initializing ...
[-- ::,][INFO ][plugins ] [elasticsearch-logging-v1-9jjf1] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[-- ::,][INFO ][env ] [elasticsearch-logging-v1-9jjf1] using [] data paths, mounts [[/data (/dev/mapper/cl-root)]], net usable_space [25gb], net total_space [.2gb], spins? [possibly], types [xfs]
[-- ::,][INFO ][env ] [elasticsearch-logging-v1-9jjf1] heap size [.6mb], compressed ordinary object pointers [true]
[-- ::,][INFO ][node ] [elasticsearch-logging-v1-9jjf1] initialized
[-- ::,][INFO ][node ] [elasticsearch-logging-v1-9jjf1] starting ...
[-- ::,][INFO ][transport ] [elasticsearch-logging-v1-9jjf1] publish_address {192.168.10.6:}, bound_addresses {[::]:}
[-- ::,][INFO ][discovery ] [elasticsearch-logging-v1-9jjf1] kubernetes-logging/RY_IOcwSSSeuJNtC2E0W7A
[-- ::,][INFO ][cluster.service ] [elasticsearch-logging-v1-9jjf1] detected_master {elasticsearch-logging-v1-sbcgt}{--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:}{master=true}, added {{elasticsearch-logging-v1-sbcgt}{--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:}{master=true},}, reason: zen-disco-receive(from master [{elasticsearch-logging-v1-sbcgt}{--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:}{master=true}])
[-- ::,][INFO ][http ] [elasticsearch-logging-v1-9jjf1] publish_address {192.168.10.6:}, bound_addresses {[::]:}
[-- ::,][INFO ][node ] [elasticsearch-logging-v1-9jjf1] started

好了,继续.....

4.配置kibana

根据前辈们的经验,把上面新创建的serviceaccount:k8s-efk显式赋值给kibana-controller.yaml:

[root@k8s-master fluentd-elasticsearch]# cat kibana-controller.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: kibana-logging
namespace: kube-system
labels:
k8s-app: kibana-logging
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
spec:
replicas:
selector:
matchLabels:
k8s-app: kibana-logging
template:
metadata:
labels:
k8s-app: kibana-logging
spec:
serviceAccount: k8s-efk
containers:
- name: kibana-logging
image: gcr.io/google_containers/kibana:v4.6.1-
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 100m
requests:
cpu: 100m
env:
- name: "ELASTICSEARCH_URL"
value: "http://elasticsearch-logging:9200"
- name: "KIBANA_BASE_URL"
value: "/api/v1/proxy/namespaces/kube-system/services/kibana-logging"
ports:
- containerPort:
name: ui
protocol: TCP
[root@k8s-master fluentd-elasticsearch]#

启动kibana,并观察pod日志:

# kubectl logs -f kibana-logging--jby53 -n kube-system
ELASTICSEARCH_URL=http://elasticsearch-logging:9200
server.basePath: /api/v1/proxy/namespaces/kube-system/services/kibana-logging
{"type":"log","@timestamp":"2017-03-02T08:30:15Z","tags":["info","optimize"],"pid":,"message":"Optimizing and caching bundles for kibana and statusPage. This may take a few minutes"}

kibana启动需要十几分钟。抱歉,本人是在一台8G的笔记本电脑的virtualbox虚拟机上做,之后你将会看到如下日志:

# kubectl logs -f kibana-logging--jby53 -n kube-system
ELASTICSEARCH_URL=http://elasticsearch-logging:9200
server.basePath: /api/v1/proxy/namespaces/kube-system/services/kibana-logging
{"type":"log","@timestamp":"2017-03-02T08:30:15Z","tags":["info","optimize"],"pid":,"message":"Optimizing and caching bundles for kibana and statusPage. This may take a few minutes"}
{"type":"log","@timestamp":"2017-03-02T08:40:04Z","tags":["info","optimize"],"pid":,"message":"Optimization of bundles for kibana and statusPage complete in 588.60 seconds"}
{"type":"log","@timestamp":"2017-03-02T08:40:04Z","tags":["status","plugin:kibana@1.0.0","info"],"pid":,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:05Z","tags":["status","plugin:elasticsearch@1.0.0","info"],"pid":,"state":"yellow","message":"Status changed from uninitialized to yellow - Waiting for Elasticsearch","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:05Z","tags":["status","plugin:kbn_vislib_vis_types@1.0.0","info"],"pid":,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:05Z","tags":["status","plugin:markdown_vis@1.0.0","info"],"pid":,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:05Z","tags":["status","plugin:metric_vis@1.0.0","info"],"pid":,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:06Z","tags":["status","plugin:spyModes@1.0.0","info"],"pid":,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:06Z","tags":["status","plugin:statusPage@1.0.0","info"],"pid":,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:06Z","tags":["status","plugin:table_vis@1.0.0","info"],"pid":,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2017-03-02T08:40:06Z","tags":["listening","info"],"pid":,"message":"Server running at http://0.0.0.0:5601"}
{"type":"log","@timestamp":"2017-03-02T08:40:11Z","tags":["status","plugin:elasticsearch@1.0.0","info"],"pid":,"state":"yellow","message":"Status changed from yellow to yellow - No existing Kibana index found","prevState":"yellow","prevMsg":"Waiting for Elasticsearch"}
{"type":"log","@timestamp":"2017-03-02T08:40:14Z","tags":["status","plugin:elasticsearch@1.0.0","info"],"pid":,"state":"green","message":"Status changed from yellow to green - Kibana index ready","prevState":"yellow","prevMsg":"No existing Kibana index found"}

需要注意的是:(这也是坑阿)

  • 必须配置dns,否则会出现http://elasticsearch-logging:9200无法连接的状态
  • 如果不配置dns,那就只能修改control文件把elasticsearch-logging修改成具体的elasticsearch-logging服务的集群ip

通过

kubectl cluster-info

可以获取kibana服务的地址,其实也就是在

https://{API Server external IP}:{API Server secure port}/api/v1/proxy/namespaces/kube-system/services/kibana-logging/app/kibana#/settings/indices/

在下面这个界面中发现无论怎么搞create都不出现,无法添加index,当然除了直接输入*可以create,但是进去没有任何pod的信息,问题大了!!!!

Kubernetes Fluentd+Elasticsearch+Kibana统一日志管理平台搭建的填坑指南

 5.定位问题过程

仔细对照了Tony Bai的搭建文档,有参考了无数前辈的无数的帖子,都想换个CentOS 6.5版本重新来过了,无奈CentOS 6.5暂时也没装上kubernetes集群,所以放弃。

对照了一下日志,出问题的地方很可能是:

  • 根本没有日志
  • fluentd服务运行问题
  • elasticsearch logging日志收集问题
  • 修改elasticsearch logging的images(坑阿)

仔细看fluentd的日志/etc/log/fluent.log,发现根本就是没有任何日志的输出,排除fluentd连接elasticsearch logging:9200连接不上的问题,

感觉是elasticseach logging的自己的问题,然后对比tonybai的elasticsearch的日志,发现我的只有

[-- ::,][INFO ][cluster.service          ] [elasticsearch-logging-v1-9jjf1] detected_master {elasticsearch-logging-v1-sbcgt}{--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:}{master=true}, added {{elasticsearch-logging-v1-sbcgt}{--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:}{master=true},}, reason: zen-disco-receive(from master [{elasticsearch-logging-v1-sbcgt}{--uDYJOTqegj5ctbbCx_A}{192.168.10.8}{192.168.10.8:}{master=true}])
[-- ::,][INFO ][http ] [elasticsearch-logging-v1-9jjf1] publish_address {192.168.10.6:}, bound_addresses {[::]:}
[-- ::,][INFO ][node ] [elasticsearch-logging-v1-9jjf1] started

就结束了,而tonybai的是

[-- ::,][INFO ][http                     ] [elasticsearch-logging-v1-dklui] publish_address {172.16.57.8:}, bound_addresses {[::]:}
[-- ::,][INFO ][node ] [elasticsearch-logging-v1-dklui] started
[-- ::,][INFO ][gateway ] [elasticsearch-logging-v1-dklui] recovered [] indices into cluster_state
[-- ::,][INFO ][cluster.metadata ] [elasticsearch-logging-v1-dklui] [logstash-2017.03.] creating index, cause [auto(bulk api)], templates [], shards []/[], mappings []
[-- ::,][INFO ][cluster.metadata ] [elasticsearch-logging-v1-dklui] [logstash-2017.03.] creating index, cause [auto(bulk api)], templates [], shards []/[], mappings []
[-- ::,][INFO ][cluster.routing.allocation] [elasticsearch-logging-v1-dklui] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[logstash-2017.03.][], [logstash-2017.03.][]] ...]).
[-- ::,][INFO ][cluster.metadata ] [elasticsearch-logging-v1-dklui] [logstash-2017.03.] create_mapping [fluentd]
[-- ::,][INFO ][cluster.metadata ] [elasticsearch-logging-v1-dklui] [logstash-2017.03.] create_mapping [fluentd]
[-- ::,][INFO ][cluster.routing.allocation] [elasticsearch-logging-v1-dklui] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[logstash-2017.03.][]] ...]).

区别如下:

  • recover cluster的信息
  • 有logstash的信息

感觉很有可能是image问题,因此把原来的image修改成和tonybai保持一致,把es-control.yaml中官方的v2.4.1-2修改为

bigwhite/elasticsearch:v2.4.1-

然后启动,发现果然有了recover cluster的信息。但仍然没有logstash的信息。

  • 定位日志问题(又是个大坑)

又绕回到第一个问题,明明通过kubectl logs满屏幕的日志

[root@k8s-master fluentd-elasticsearch]# kubectl logs helloworld-service-4d72j
.
.
JAVA Memory arguments: -Djava.security.egd=file:/dev/./urandom
.
CLASSPATH=/u01/oracle/wlserver/../oracle_common/modules/javax.persistence_2..jar:/u01/oracle/wlserver/../wlserver/modules/com.oracle.weblogic.jpa21support_1.0.0.0_2-.jar:/usr/java/jdk1..0_101/lib/tools.jar:/u01/oracle/wlserver/server/lib/weblogic_sp.jar:/u01/oracle/wlserver/server/lib/weblogic.jar:/u01/oracle/wlserver/../oracle_common/modules/net.sf.antcontrib_1.1.0.0_1-0b3/lib/ant-contrib.jar:/u01/oracle/wlserver/modules/features/oracle.wls.common.nodemanager_2.0.0..jar:/u01/oracle/wlserver/../oracle_common/modules/com.oracle.cie.config-wls-online_8.1.0..jar:/u01/oracle/wlserver/common/derby/lib/derbyclient.jar:/u01/oracle/wlserver/common/derby/lib/derby.jar:/u01/oracle/wlserver/server/lib/xqrl.jar
.
PATH=/u01/oracle/wlserver/server/bin:/u01/oracle/wlserver/../oracle_common/modules/org.apache.ant_1.9.2/bin:/usr/java/jdk1..0_101/jre/bin:/usr/java/jdk1..0_101/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/java/default/bin:/u01/oracle/oracle_common/common/bin:/u01/oracle/oracle_common/common/bin:/u01/oracle/wlserver/common/bin:/u01/oracle/user_projects/domains/base_domain/bin:/u01/oracle
.
***************************************************
* To start WebLogic Server, use a username and *
* password assigned to an admin-level user. For *
* server administration, use the WebLogic Server *
* console at http://hostname:port/console *
***************************************************
starting weblogic with Java version:
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) -Bit Server VM (build 25.101-b13, mixed mode)
Starting WLS with line:
/usr/java/jdk1..0_101/bin/java -server -Djava.security.egd=file:/dev/./urandom -Dweblogic.Name=AdminServer -Djava.security.policy=/u01/oracle/wlserver/server/lib/weblogic.policy -Dweblogic.ProductionModeEnabled=true -Djava.endorsed.dirs=/usr/java/jdk1..0_101/jre/lib/endorsed:/u01/oracle/wlserver/../oracle_common/modules/endorsed -da -Dwls.home=/u01/oracle/wlserver/server -Dweblogic.home=/u01/oracle/wlserver/server -Dweblogic.utils.cmm.lowertier.ServiceDisabled=true weblogic.Server
<May , :: AM GMT> <Info> <Security> <BEA-> <Disabling the CryptoJ JCE Provider self-integrity check for better startup performance. To enable this check, specify -Dweblogic.security.allowCryptoJDefaultJCEVerification=true.>
<May , :: AM GMT> <Info> <Security> <BEA-> <Changing the default Random Number Generator in RSA CryptoJ from ECDRBG128 to FIPS186PRNG. To disable this change, specify -Dweblogic.security.allowCryptoJDefaultPRNG=true.>
<May , :: AM GMT> <Info> <WebLogicServer> <BEA-> <Starting WebLogic Server with Java HotSpot(TM) -Bit Server VM Version 25.101-b13 from Oracle Corporation.>
<May , :: AM GMT> <Info> <Management> <BEA-> <Version: WebLogic Server 12.1.3.0. Wed May :: PDT >
<May , :: AM GMT> <Notice> <WebLogicServer> <BEA-> <Server state changed to STARTING.>
<May , :: AM GMT> <Info> <WorkManager> <BEA-> <Initializing self-tuning thread pool.>
<May , :: AM GMT> <Info> <WorkManager> <BEA-> <CMM memory level becomes . Setting standby thread pool size to .>
<May , :: AM GMT> <Notice> <Log Management> <BEA-> <The server log file /u01/oracle/user_projects/domains/base_domain/servers/AdminServer/logs/AdminServer.log is opened. All server side log events will be written to this file.>
<May , :: AM GMT> <Notice> <Security> <BEA-> <Security initializing using security realm myrealm.>
<May , :: AM GMT> <Notice> <WebLogicServer> <BEA-> <Server state changed to STANDBY.>
<May , :: AM GMT> <Notice> <WebLogicServer> <BEA-> <Server state changed to STARTING.>
May , :: AM weblogic.wsee.WseeCoreMessages logWseeServiceStarting
INFO: The Wsee Service is starting
<May , :: AM GMT> <Warning> <Munger> <BEA-> <A version attribute was not found in element "web-app" in the deployment descriptor /u01/oracle/user_projects/domains/base_domain/servers/AdminServer/upload/HelloWorld.war/WEB-INF/web.xml. A version attribute is required, but this version of the WebLogic Server will assume that the latest version is used. Future versions of WebLogic Server will reject descriptors that do not specify the Java EE version. To eliminate this warning, add an appropriate "version=" to element "web-app" in the deployment descriptor.>
<May , :: AM GMT> <Notice> <Log Management> <BEA-> <The server has successfully established a connection with the Domain level Diagnostic Service.>
<May , :: AM GMT> <Notice> <WebLogicServer> <BEA-> <Server state changed to ADMIN.>
<May , :: AM GMT> <Notice> <WebLogicServer> <BEA-> <Server state changed to RESUMING.>
<May , :: AM GMT> <Notice> <Server> <BEA-> <Channel "Default[3]" is now listening on 127.0.0.1: for protocols iiop, t3, ldap, snmp, http.>
<May , :: AM GMT> <Notice> <Server> <BEA-> <Channel "Default" is now listening on 192.168.53.3: for protocols iiop, t3, ldap, snmp, http.>
<May , :: AM GMT> <Notice> <Server> <BEA-> <Channel "Default[2]" is now listening on :::::::: for protocols iiop, t3, ldap, snmp, http.>
<May , :: AM GMT> <Notice> <WebLogicServer> <BEA-> <Started the WebLogic Server Administration Server "AdminServer" for domain "base_domain" running in production mode.>
<May , :: AM GMT> <Error> <Server> <BEA-> <The server is unable to create a server socket for listening on channel "Default[1]". The address fe80:::::c0ff:fea8: might be incorrect or another process is using port : java.net.BindException: Cannot assign requested address>
<May , :: AM GMT> <Warning> <Server> <BEA-> <The hostname "localhost", maps to multiple IP addresses: 127.0.0.1, :::::::.>
<May , :: AM GMT> <Notice> <WebLogicServer> <BEA-> <The server started in RUNNING mode.>
<May , :: AM GMT> <Notice> <WebLogicServer> <BEA-> <Server state changed to RUNNING.>
[root@k8s-master fluentd-elasticsearch]#

再去minion机器上通过docker logs id也是满屏的日志阿!

[root@k8s-node- ~]# docker logs bec3e02b2490
.
.
JAVA Memory arguments: -Djava.security.egd=file:/dev/./urandom
.
CLASSPATH=/u01/oracle/wlserver/../oracle_common/modules/javax.persistence_2..jar:/u01/oracle/wlserver/../wlserver/modules/com.oracle.weblogic.jpa21support_1.0.0.0_2-.jar:/usr/java/jdk1..0_101/lib/tools.jar:/u01/oracle/wlserver/server/lib/weblogic_sp.jar:/u01/oracle/wlserver/server/lib/weblogic.jar:/u01/oracle/wlserver/../oracle_common/modules/net.sf.antcontrib_1.1.0.0_1-0b3/lib/ant-contrib.jar:/u01/oracle/wlserver/modules/features/oracle.wls.common.nodemanager_2.0.0..jar:/u01/oracle/wlserver/../oracle_common/modules/com.oracle.cie.config-wls-online_8.1.0..jar:/u01/oracle/wlserver/common/derby/lib/derbyclient.jar:/u01/oracle/wlserver/common/derby/lib/derby.jar:/u01/oracle/wlserver/server/lib/xqrl.jar
.
PATH=/u01/oracle/wlserver/server/bin:/u01/oracle/wlserver/../oracle_common/modules/org.apache.ant_1.9.2/bin:/usr/java/jdk1..0_101/jre/bin:/usr/java/jdk1..0_101/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/java/default/bin:/u01/oracle/oracle_common/common/bin:/u01/oracle/oracle_common/common/bin:/u01/oracle/wlserver/common/bin:/u01/oracle/user_projects/domains/base_domain/bin:/u01/oracle
.
***************************************************
* To start WebLogic Server, use a username and *
* password assigned to an admin-level user. For *
* server administration, use the WebLogic Server *
* console at http://hostname:port/console *
***************************************************
starting weblogic with Java version:
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) -Bit Server VM (build 25.101-b13, mixed mode)
Starting WLS with line:
/usr/java/jdk1..0_101/bin/java -server -Djava.security.egd=file:/dev/./urandom -Dweblogic.Name=AdminServer -Djava.security.policy=/u01/oracle/wlserver/server/lib/weblogic.policy -Dweblogic.ProductionModeEnabled=true -Djava.endorsed.dirs=/usr/java/jdk1..0_101/jre/lib/endorsed:/u01/oracle/wlserver/../oracle_common/modules/endorsed -da -Dwls.home=/u01/oracle/wlserver/server -Dweblogic.home=/u01/oracle/wlserver/server -Dweblogic.utils.cmm.lowertier.ServiceDisabled=true weblogic.Server
...........

因为/var/log/fluentd.log老是停留在配置上没有刷动,怀疑是configmap的配置问题,找出之前备份的td-agent.conf一看

# The Kubernetes fluentd plugin is used to write the Kubernetes metadata to the log
# record & add labels to the log record if properly configured. This enables users
# to filter & search logs on any metadata.
# For example a Docker container's logs might be in the directory:
#
# /var/lib/docker/containers/997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b
#
# and in the file:
#
# 997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b-json.log
#
# where 997599971ee6... is the Docker ID of the running container.
# The Kubernetes kubelet makes a symbolic link to this file on the host machine
# in the /var/log/containers directory which includes the pod name and the Kubernetes
# container name:
#
# synthetic-logger-.25lps-pod_default_synth-lgr-997599971ee6366d4a5920d25b79286ad45ff37a74494f262e3bc98d909d0a7b.log

终于发现问题, 原来都是通过/var/lib/docker/containers/目录去找,然而我的docker下面容器没有任何log文件。

仔细研究了一下docker,原来所有的docker日志都journal到系统日志 /var/log/messages下了.为什么呢? 因为经常有人说docker日志太多导致container容器增长比较快,所以都通过系统的journal进行统一处理。

修改/etc/sysconfig/docker配置文件,把原来的journal改回到当前json.log方式.

#OPTIONS='--selinux-enabled --log-driver=journald --signature-verification=false'
OPTIONS='--selinux-enabled --log-driver=json-file --signature-verification=false'

改完后就发现container下面有很多log文件了.

然后回到/var/log/fluentd.log文件,发现终于满屏的日志滚动,输出正常了

-- :: + [info]: following tail of /var/log/containers/fluentd-es-v1.-351lz_kube-system_POD-aca728523bc307598917d78b2526e718e6c7fdbb38b70c05900d2439399efa10.log
-- :: + [info]: following tail of /var/log/containers/helloworld-service-n5f0s_default_POD-ca013e9ab31b825cd4b85ab4700fad2fcaafd5f39c572778d10d438012ea4435.log
-- :: + [info]: following tail of /var/log/containers/fluentd-es-v1.-351lz_kube-system_POD-2eb78ece8c2b5c222313ab4cfb53ea6ec32f54e1b7616f729daf48b01d393b65.log
-- :: + [info]: following tail of /var/log/containers/helloworld-service-4d72j_default_POD-1dcbbc2ef71f7f542018069a1043a122117a97378c19f03ddb95b8a71dab4637.log
-- :: + [info]: following tail of /var/log/containers/helloworld-service-n5f0s_default_weblogichelloworld-d7229e5c23c6bf7582ed6559417ba24d99e33e44a68a6079159b4792fe05a673.log
-- :: + [info]: following tail of /var/log/containers/helloworld-service-4d72j_default_weblogichelloworld-71d1d7252dd7504fd45351d714d21c3c615facc5e2650553c68c0bf359e8434a.log
-- :: + [info]: following tail of /var/log/containers/kube-dns-v11-x0vr3_kube-system_kube2sky-c77121c354459f22712b0a99623eff1590f4fdb1a5d3ad2db09db000755f9c2c.log
-- :: + [info]: following tail of /var/log/containers/kube-dns-v11-x0vr3_kube-system_skydns-f3c0fbf4ea5cd840c968a807a40569042c90de08f7722e7344282845d5782a20.log
-- :: + [info]: following tail of /var/log/containers/fluentd-es-v1.-351lz_kube-system_fluentd-es-93795904ff4870758441dd7288972d4967ffac18f2f25272b12e99ea6b692d44.log
-- :: + [warn]: temporarily failed to flush the buffer. next_retry=-- :: + error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"elasticsearch-logging\", :port=>9200, :scheme=>\"http\"})!" plugin_id="object:3f986d0a5150"
-- :: + [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1./gems/fluent-plugin-elasticsearch-1.5./lib/fluent/plugin/out_elasticsearch.rb::in `client'

把所有的组件都启动起来

[root@k8s-master fluentd-elasticsearch]# kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
elasticsearch-logging-v1-1xwnq / Running 29s
elasticsearch-logging-v1-gx6lc / Running 29s
fluentd-es-v1.-351lz / Running 3h
kibana-logging--gcwrn / Running 15s
kube-dns-v11-x0vr3 / Running 1d

访问kibana

[root@k8s-master fluentd-elasticsearch]# kubectl cluster-info
Kubernetes master is running at http://localhost:8080
Elasticsearch is running at http://localhost:8080/api/v1/proxy/namespaces/kube-system/services/elasticsearch-logging
Kibana is running at http://localhost:8080/api/v1/proxy/namespaces/kube-system/services/kibana-logging
KubeDNS is running at http://localhost:8080/api/v1/proxy/namespaces/kube-system/services/kube-dns To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

终于出现绿色的create

Kubernetes Fluentd+Elasticsearch+Kibana统一日志管理平台搭建的填坑指南

create完后进入

Kubernetes Fluentd+Elasticsearch+Kibana统一日志管理平台搭建的填坑指南

点开kubernetes pod name,可以看到集群内部所有的pod的日志

Kubernetes Fluentd+Elasticsearch+Kibana统一日志管理平台搭建的填坑指南

如果我们只是关注helloworld-service,选择旁边的+,然后可以看到每条weblogic日志的输出.

Kubernetes Fluentd+Elasticsearch+Kibana统一日志管理平台搭建的填坑指南

好了,到目前为止配置完成。

6.填坑总结:

  • 这次为了配置EFK把DNS,ServiceAccount都配置了一便,也算是收获啦。
  • 对kubernetes内部集群的机制有更深的理解,也慢慢的帮助我们加速解决问题吧。

最后真心感谢网上那些尚未谋面的大师的帮助:

http://tonybai.com/2017/03/03/implement-kubernetes-cluster-level-logging-with-fluentd-and-elasticsearch-stack/

http://rootsongjc.github.io/blogs/kubernetes-fluentd-elasticsearch-installation/

http://www.tothenew.com/blog/how-to-install-kubernetes-on-centos/

http://blog.****.net/wenwst/article/details/53908144