prometheus + confd + etcd 自动发现
-
架构
- Prometheus的配置文件都是经由confd从etcd中读取并生成
- 采集端采用node-exporter,kafka-exporter,mysql-exporter等进行采集,启动的时候需要调用cmdb接口将自身数据写入etcd
- codoon-alert通过与etcd进行交互,对rules,告警屏蔽等进行配置
-
主配置文件
global: scrape_interval: 10s #抓取间隔 scrape_timeout: 10s #抓取超时时间 evaluation_interval: 15s #评估规则间隔 alerting: alertmanagers: - scheme: http timeout: 10s api_version: v1 static_configs: - targets: - 127.0.0.1:9093 rule_files: - /codoon/prometheus/etc/rules/rule_*.yml scrape_configs: - job_name: prometheus honor_timestamps: true scrape_interval: 10s scrape_timeout: 10s metrics_path: /metrics scheme: http static_configs: - targets: - 127.0.0.1:9090 - job_name: codoon_ops honor_timestamps: true scrape_interval: 10s scrape_timeout: 10s metrics_path: /metrics scheme: http file_sd_configs: - files: - /codoon/prometheus/etc/targets/target_*.json refresh_interval: 20s #重载配置文件间隔
-
prometheus启动命令
/codoon/prometheus/prometheus --web.enable-lifecycle --config.file=/codoon/prometheus/etc/prometheus.yml --storage.tsdb.path=/codoon/prometheus nohup ./prometheus --web.enable-lifecycle --config.file=./etc/prometheus.yml --storage.tsdb.path=/codoon/prometheus --web.external-url=xxx.com/ 2>&1 > prometheus.log &
-
confd配置文件
#服务发现 #conf.d/discovery_host.toml [template] src = "discovery_host.tmpl" dest = "/codoon/prometheus/etc/targets/target_host.json" mode = "0777" keys = [ "/prometheus/discovery/host", ] reload_cmd = "curl -XPOST 'http://127.0.0.1:9090/-/reload'" #templates/discovery_host.tmpl [ {{- range $index, $info := getvs "/prometheus/discovery/host/*" -}} {{- $data := json $info -}} {{- if ne $index 0 }},{{- end }} { "targets": [ "{{$data.address}}" ], "labels":{ "instance": "{{$data.name}}" {{- if $data.labels -}} {{- range $data.labels -}} ,"{{.key}}": "{{.val}}" {{- end}} {{- end}} } }{{- end }} ] #规则下发 #conf.d/rule_host.toml [template] src = "rule_host.tmpl" dest = "/codoon/prometheus/etc/rules/rule_host.yml" mode = "0777" keys = [ "/prometheus/rule/host", ] reload_cmd = "curl -XPOST 'http://127.0.0.1:9090/-/reload'" #templates/rule_host.tmpl groups: - name: host rules: {{- range $info := getvs "/prometheus/rule/host/*"}} {{- $data := json $info}} {{- if $data.status}} - alert: {{$data.alert}} expr: {{$data.expr}} for: {{$data.for}} {{- if $data.labels}} labels: {{- range $data.labels}} {{.key}}: {{.val}} {{- end}} {{- end}} annotations: {{- if $data.summary}} summary: "{{$data.summary}}" {{- end}} {{- if $data.description}} description: "{{$data.description}}" {{- end}} {{- end }} {{- end }}
-
confd启动命令
/codoon/prometheus/confd-0.16.0-linux-amd64 -confdir /codoon/prometheus/confd/ -backend etcdv3 -watch -node http://127.0.0.1:2379 nohup ./confd-0.16.0-linux-amd64 -confdir ./confd/ -backend etcdv3 -watch -node http://127.0.0.1:2379 2>&1 > confd.log &
-
模拟服务发现
#标签默认有instance: name etcdctl put /prometheus/discovery/host/test1 '{"name":"test1","address":"10.12.10.1:9091"}' #自定义标签 etcdctl put /prometheus/discovery/host/test2 '{"name":"test2","address":"10.12.10.1:9092","labels":[{"key":"label1","val":"test1"},{"key":"label2","val":"test2"}]}'
-
模拟规则下发
etcdctl put /prometheus/rule/host/test1 '{"alert":"test1 is down","expr":"up == 0","for":"30s","summary":"s1","description":"d1"}' #自定义标签 etcdctl put /prometheus/rule/host/test2 '{"alert":"test2 is down","expr":"up == 0","for":"1m","summary":"s1","description":"d1","labels":[{"key":"label1","val":"test1"},{"key":"label2","val":"test2"}]}'
-
alertmanager
nohup ./alertmanager-0.21.0.linux-amd64/alertmanager --config.file=alertmanager-0.21.0.linux-amd64/alertmanager.yml 2>&1 > alertmanager.log &
-
常用promsql
/prometheus/rule/host/nodata #无数据 {"status":true,"alert":"no data","expr":"up == 0","for":"5m","summary":"no data","description":"{{$labels.instance}} no data for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]} /prometheus/rule/host/availcpult20 #cpu可用率小于20% {"status":true,"alert":"avail cpu lt 20%","expr":"avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) by (type,instance,env,ip) < 0.2","for":"5m","summary":"avail cpu lt 20%","description":"avail cpu lt 20% for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]} /prometheus/rule/host/availmemlt20 #mem可用率小于20% {"status":true,"alert":"avail mem lt 20%","expr":"1-(node_memory_MemTotal_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes - node_memory_MemFree_bytes) /node_memory_MemTotal_bytes < 0.2","for":"5m","summary":"avail mem lt 20%","description":"avail mem lt 20% for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]} /prometheus/rule/host/availdisklt20 #disk可用率小于20% {"status":true,"alert":"avail disk lt 20%","expr":"node_filesystem_avail_bytes{fstype=~\"ext.*|xfs\",mountpoint!~\".*docker.*|.*pod.*|.*container|.*kubelet\"} /node_filesystem_size_bytes{fstype=~\"ext.*|xfs\",mountpoint!~\".*docker.*|.*pod.*|.*container|.*kubelet\"} < 0.2","for":"5m","summary":"avail disk lt 20%","description":"mount: {{ $labels.mountpoint }} avail lt 20G for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]} /prometheus/rule/host/load1toohigh #1分钟负载 {"status":true,"alert":"load1 is too high","expr":"node_load1/2 > on(type,instance,env,ip) count(node_cpu_seconds_total{mode=\"system\"}) by (type,instance,env,ip)","for":"5m","summary":"load1 is too high","description":"load1 is too high for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]} /prometheus/rule/host/useiopsgt80 #iops使用率大于80% {"status": true,"alert":"iops too high","expr":"rate(node_disk_io_time_seconds_total[5m]) > 0.8","for":"5m","summary":"iops too high","description":"iops too high for 5m, curr: {{ $value }}","labels":[{"key":"diyk","val":"diyv"}]} (1 - (node_memory_MemFree_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} +node_memory_Buffers_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} +node_memory_Cached_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} / (node_memory_MemTotal_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"})))* 100 ((node_memory_MemTotal_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} - node_memory_MemFree_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} - node_memory_Buffers_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes{origin_prometheus=~"$origin_prometheus",job=~"$job"} )) * 100 #告警规则整理 1分钟的负载大于cpu核心数 持续5m node_load1 > on(instance,ip) count(node_cpu_seconds_total{mode="system"}) by (instance,ip) CPU可用率小于20% 持续5m avg(rate(node_cpu_seconds_total{mode="system"}[5m])) by (instance) *100 avg(rate(node_cpu_seconds_total{mode="user"}[5m])) by (instance) *100 avg(rate(node_cpu_seconds_total{mode="iowait"}[5m])) by (instance) *100 avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) *100 磁盘可用率小于20%且可用小于20G 持续5m (node_filesystem_avail_bytes{fstype=~\"ext.*|xfs\",mountpoint!~\".*pod.*|.*docker-lib.*\"} / node_filesystem_size_bytes{fstype=~\"ext.*|xfs\",mountpoint!~\".*pod.*|.*docker-lib.*\"} < 0.2) and node_filesystem_avail_bytes{fstype=~\"ext.*|xfs\",mountpoint!~\".*pod.*|.*docker-lib.*\"} < 20*1024^3 内存使用率大于80% 持续5m (node_memory_MemTotal_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes - node_memory_MemFree_bytes) /node_memory_MemTotal_bytes IOPS write大于300 read 大于2000 持续5m rate(node_disk_reads_completed_total[5m]) > 1000 or rate(node_disk_writes_completed_total[5m]) > 200 网卡 1小时总流量 5分钟速率 increase(node_network_receive_bytes_total[60m]) /1024/1024 increase(node_network_transmit_bytes_total[60m]) /1024/1024 rate(node_network_receive_bytes_total[5m])*8 rate(node_network_transmit_bytes_total[5m])*8
-
temp
{"status": true,"alert":"rw iops too high","expr":"rate(node_disk_io_time_seconds_total[5m]) > 0.8","for":"5m","summary":"iops too high","description":"iops too high for 5m, curr: {{ $value }}","labels":[{"key":"receiver","val":"xxxx,xxxx,xxx"} etcdctl put /prometheus/discovery/host/codoon-istio-master01 '{"name":"codoon-istio-master01","address":"10.10.16.73:9100","labels": [{"key":"type","val":"host"},{"key":"ip","val":"10.10.16.73"}]}' etcdctl put /prometheus/rule/host/cpuavail20 '{"alert":"cpu avail less 20","expr":"avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) by (instance) < 0.2","for":"5m","summary":"avail less 20","description":"cpu avail less 20 for 5m, curr: {{ $value }}","labels":[{"key":"receiver","val":"xxx"}]}' etcdctl put /prometheus/rule/host/memuse80 '{"alert":"mem use gt 80","expr":"(node_memory_MemTotal_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes - node_memory_MemFree_bytes) /node_memory_MemTotal_bytes > 0.8","for":"5m","summary":"use gt 80","description":"mem use gt 80 for 5m, curr: {{ $value }}","labels":[{"key":"receiver","val":"xxx"}]}' etcdctl put /prometheus/rule/host/iopsth '{"alert":"rw iops too high","expr":"rate(node_disk_reads_completed_total[5m]) > 1000 or rate(node_disk_writes_completed_total[5m]) > 200","for":"5m","summary":"iops too high","description":"iops too high for 5m, curr: {{ $value }}","labels":[{"key":"receiver","val":"xxxx"}]}' { "status": true, "alert": "avail disk lt 20%", "expr": "node_filesystem_avail_bytes{fstype=~\"ext.*|xfs\",mountpoint!~\".*docker.*|.*pod.*|.*container|.*kubelet\"} /node_filesystem_size_bytes{fstype=~\"ext.*|xfs\",mountpoint!~\".*docker.*|.*pod.*|.*container|.*kubelet\"} < 0.2 and node_filesystem_avail_bytes{fstype=~\"ext.*|xfs\",mountpoint!~\".*docker.*|.*pod.*|.*container|.*kubelet\"} < 50*1024^3", "for": "2m", "summary": "avail disk lt 20%", "description": "mount: {{ $labels.mountpoint }} avail lt 20% for 2m, curr: {{ $value }}", "labels": [{ "key": "severity", "val": "warnning" }] } etcdctl put /prometheus/rule/host/load1too2high '{"status":true,"alert":"load1 is too2 high","expr":"node_load1 > on(type,instance,env,ip) count(node_cpu_seconds_total{mode=\"system\"}) by (type,instance,env,ip) /1.5","for":"2m","summary":"load1 is too2 high","description":"load1 is too2 high for 2m, curr: {{ $value }}","labels":[{"key":"severity","val":"critical"}]}'
-
启动脚本
vim /usr/lib/systemd/system/prometheus.service [Unit] Description=prometheus Documentation=codoon_ops After=network.target [Service] EnvironmentFile=-/etc/sysconfig/prometheus User=prometheus ExecStart=/usr/local/prometheus/prometheus \ --web.enable-lifecycle \ --storage.tsdb.path=/codoon/prometheus/data \ --config.file=/codoon/prometheus/etc/prometheus.yml \ --web.listen-address=0.0.0.0:9090 \ --web.external-url= $PROM_EXTRA_ARGS \ --log.level=debug Restart=on-failure StartLimitInterval=1 RestartSec=3 [Install] WantedBy=multi-user.target systemctl daemon-reload systemctl enable prometheus
-
docker
docker run --name promconfd -d -v /codoon/prometheus/etc:/opt/prometheus/etc -v /codoon/prometheus/data:/opt/prometheus/data -v /codoon/prometheus/confd/etc:/opt/confd/etc -p 9090:9090 dockerhub.xxxx.com/prom/prometheus:v2.24.1
-
部署方式
prometheus+confd 以docker方式部署 prom-monitor tsdb数据库存放路径:/codoon/prometheus/data prometheus配置文件路径:/codoon/prometheus/etc confd配置文件路径:/codoon/prometheus/confd/etc ops-etcd0|1|2 etcd服务自动发现 /prometheus/discovery/host/* /prometheus/discovery/db/* ... 规则自动下发 /prometheus/rule/host/* /prometheus/rule/host/* ...
-
发送消息策略
1、warnning级别告警首次先等1分钟再发, 看同类型是否有critical级别告警,若有立即发送,warnning级别告警不再发送 2、warnning级别告警间隔20分钟发送1次 3、critical级别告警间隔10分钟发送1次
-
静默配置
通过opscenter配置,原理是通过标签判断过滤,会找最优匹配 抑制逻辑:同一告警高优先级自动抑制低优先级,高优先级恢复后自动解除抑制 静默配置保存到ops-etcd /prometheus/silencev2 支持alertname:instance:lables... 告警名称、实例、IP、级别等正则匹配 新增静默 POST curl -X POST -H 'Content-Type: application/json' -d '{"sc_key":"tidb","sc_val":"instance:severity:alertname:tidb-(node|ssd-[0-9]+)warnning(load1.*|avail cpu.*)"}' codoon-alert.in.xxx.com:8875/backend/codoon_alert/api/v1/silence 删除静默 DELETE curl -X DELETE codoon-alert.in.xxx.com:8875/backend/codoon_alert/api/v1/silence/tidb 查看静默 GET curl codoon-alert.in.xx.com:8875/backend/codoon_alert/api/v1/silence 查看alertconfig配置 GET curl codoon-alert.in.xxx.com:8875/backend/codoon_alert/api/v1/alertconfig?cfg_key=notice|wait|clear|reslove { "data": { "apitmporcheckall": "instance:alertname:(nginx-api-tmp|apicheck(-[0-9])?)(.*)", "intwarnall": "instance:severity:alertname:integrationwarnning(.*)", "istio": "instance:severity:alertname:(codoon[0-9]+istio)warnning(load1.*)", "monitor_roy": "instance:severity:alertname:monitor_roywarnning(load1.*)", "testall": "instance:alertname:testall(.*)", "tidb": "instance:severity:alertname:tidb-(node|ssd-[0-9]+)warnning(load1.*|avail cpu.*)" }, "description": "ok", "status": "OK" }
-
告警配置
和静默配置原理一样,通过标签过滤,默认会找最优匹配,标签匹配逻辑, 优先检查=、!=,其次检查=~、!~(正则) 告警配置保存到ops-etcd /prometheus/receiver
-
告警模板
通过opscenter自定义,告警大于3条时会自动收拢, 同时会再发一封邮件(包括完整告警信息) 告警配置保存到ops-etcd /prometheus/template
-
其他说明
标签type=service会根据服务名称(service=xxx)通过cmdb获取告警人 不希望收到恢复通知,可在标签中配置resolved=no pod cpu/mem(pprof_type=memory/cpu)告警会发pprof service error/panic(log_type: ERRO/PANIC)会从loki获取详情并发送 servicemap 日志名与服务映射,watch err_check/service_map