部署alertmanager
考虑到prometheus需要在配置文件中设置alertmanager监听地址和端口,因此采用把alertmanager和prometheus部署在同一个pod中的方式,当然也可以另外以单独pod部署,然后通过service和port的方式来配置,但是不知为啥,没测试成功.增加相应的配置到prometheus.yml中:
prometheus.yml: |-
global:
scrape_interval: 90s
evaluation_interval: 90s
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
#- alertmanager:9093
rule_files:
- /etc/prometheus/rules.yml
增加alertmanager需要用的告警规则到prometheus.yml中:
rules.yml: |- groups: - name: test-rule rules: - alert: NodeFilesystemUsage expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80 for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High Filesystem usage detected" description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}" - alert: NodeMemoryUsage expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80 for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High Memory usage detected" description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}" - alert: NodeCPUUsage expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80 for: 2m labels: team: node annotations: summary: "{{$labels.instance}}: High CPU usage detected" description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"
修改prometheus-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-deployment
namespace: kube-system
#annotations:
# used to scrape app's metrics which deployed in pod
# prometheus.io/scrape: 'true'
# prometheus scrape path, default /metrics
# prometheus.io/path: '/metrics'
# prometheus.io/port relvant port
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
securityContext:
runAsUser: 0
containers:
- name: prometheus
image: prom/prometheus:v2.2.0
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- name: gluster-volume
mountPath: /prometheus
- name: config-volume
mountPath: /etc/prometheus
- name: alertmanager
image: x.x.x.x/library/prom/alertmanager:latest
args:
- '--config.file=/etc/alertmanager/config.yml'
ports:
- name: alertmanager
containerPort: 9093
volumeMounts:
- name: alert-volume
mountPath: /etc/alertmanager
imagePullSecrets:
- name: my-secret
volumes:
- name: gluster-volume
persistentVolumeClaim:
claimName: gluster-prometheus
- name: config-volume
configMap:
name: prometheus-server-conf
- name: alert-volume
configMap:
name: alertmanager
准备alertmanager告警需要用到的邮件设置:
kind: ConfigMap apiVersion: v1 metadata: name: alertmanager namespace: kube-system data: config.yml: |- global: smtp_smarthost: 'smtp.163.com:25' smtp_from: 'xxxx@163.com' smtp_auth_username: 'xxxx@163.com' smtp_auth_password: 'xxxx' templates: - '/root/alertmanager/template/*.tmpl' route: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 10m receiver: default-receiver receivers: - name: 'default-receiver' email_configs: - to: 'xxxx@xxx.com'
注意,163的邮箱设置中必须打开SMTP,否则会报如下错误:
evel=error ts=2018-04-03T03:39:32.793284112Z caller=notify.go:303 component=dispatcher msg="Error on notify" err="*notify.loginAuth failed: 550 User has no permission" level=error ts=2018-04-03T03:39:32.793463167Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="*notify.loginAuth failed: 550 User has no permission"
进行创建部署即可.