一、前言与目录 监控是当前云原生时代下可观测性中关键性的一环,较之前相比,云原生时代已经发生了诸多变化,诸如微服务,容器化等技术层出不穷,且云原生时代的演进速度,更新速度极快,相对应监控所产生的数据量大大增加,对实时性的要求也大大增加。为应对变化,Prometheus应运而生,其所可实现的功能,与云原生极好的契合度,集成第三方开源组件的便利性,无疑使其成为无疑是最为耀眼的明星之一。
本文着重在于介绍如何利用Prometheus搭建监控系统,涵盖探针,指标设定,可视化,告警设定,容器监控等。这是一篇入门级教程,暂不涵盖gateway,K8S集群等的相关内容。关于Prometheus的基本知识与概念,自行google之,本文重点描述实战过程。
目录:
部署Prometheus Server 部署监控探针 部署Grafana 部署AlertManager 部署PrometheusAlert 配置告警规则 二、部署Prometheus Server 本节主要介绍以docker的方式部署Prometheus Server,并预留映射相关配置项
2.1 配置环境 创建文件夹并授予权限1 2 sudo mkdir -pv /data/docker/prometheus/{data,alert_rules,job} sudo chown -R myusername:myusername /data/docker/prometheus/
其中, data文件夹用于存放prometheus产生的数据 alert_rules文件夹用于存放prometheus alert告警规则配置文件 job用于存放监控对象配置json文件 myusername可替换为实际的用户名 执行本条命令以避免出现 permission denied 错误
1 sudo chown 65534:65534 -R /data/docker/prometheus/data
拷贝配置文件到指定的目录,注意下,需要关注该文件中涉及“$ip”的部分,后续配置,诸如添加AlertManager后,记得返回修改修改此处。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: # - targets: ["$ip:9093"] # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" - /etc/prometheus/alert_rules/*.rules # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' file_sd_configs: - files: - /etc/prometheus/job/prometheus.json refresh_interval: 1m # 重载配置文件 # Node 主机组 - job_name: 'host' #basic_auth: # username: prometheus # password: prometheus file_sd_configs: - files: - /etc/prometheus/job/host.json refresh_interval: 1m # cadvisor 容器组 - job_name: 'cadvisor' file_sd_configs: - files: - /etc/prometheus/job/cadvisor.json refresh_interval: 1m # mysql exporter 组 - job_name: 'mysqld-exporter' file_sd_configs: - files: - /etc/prometheus/job/mysqld-exporter.json refresh_interval: 1m # blackbox ping 组 - job_name: 'blackbox_ping' scrape_interval: 5s scrape_timeout: 2s metrics_path: /probe params: module: [ping] file_sd_configs: - files: - /etc/prometheus/job/blackbox/ping/*.json refresh_interval: 1m relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: $ip:9115 # blackbox http get 2xx 组 - job_name: 'blackbox_http_2xx' scrape_interval: 5s metrics_path: /probe params: module: [http_2xx] file_sd_configs: - files: - /etc/prometheus/job/blackbox/http_2xx/*.json refresh_interval: 1m relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: $ip:9115 - job_name: "blackbox_tcp" metrics_path: /probe params: module: [tcp_connect] file_sd_configs: - files: - /etc/prometheus/job/blackbox/tcp/*.json refresh_interval: 1m relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: $ip:9115 - job_name: 'blackbox_ssh_banner' metrics_path: /probe params: module: [ssh_banner] file_sd_configs: - files: - /etc/prometheus/job/blackbox/ssh_banner/*.json refresh_interval: 1m relabel_configs: # Ensure port is 22, pass as URL parameter - source_labels: [__address__] regex: (.*?)(:.*)? replacement: ${1}:22 target_label: __param_target # Make instance label the target - source_labels: [__param_target] target_label: instance # Actually talk to the blackbox exporter though - target_label: __address__ replacement: $ip:9115 - job_name: "blackbox_dns" metrics_path: /probe params: module: [dns_udp] file_sd_configs: - files: - /etc/prometheus/job/blackbox/dns/*.json refresh_interval: 1m relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: $ip:9115
2.2 启动服务端 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 docker run -itd \ -p 9090:9090 \ -v /data/docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro \ -v /data/docker/prometheus/alert_rules:/etc/prometheus/alert_rules \ -v /data/docker/prometheus/job:/etc/prometheus/job \ -v /data/docker/prometheus/data:/data/prometheus/ \ -v /etc/timezone:/etc/timezone:ro \ -v /etc/localtime:/etc/localtime:ro \ --name prometheus \ --restart=always \ prom/prometheus:v2.28.1 \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/data/prometheus/ \ --storage.tsdb.retention.time=30d \ --web.read-timeout=5m \ --web.max-connections=10 \ --query.max-concurrency=20 \ --query.timeout=2m \ --web.enable-lifecycle
启动成功后,通过浏览器访问 http://$ip:9090 可看到界面。
如果系统打开了防火墙,你可能需要给以下几个端口开白名单,以centos7为例,
1 2 3 4 sudo firewall-cmd --zone=public --add-port=9090/tcp --permanent sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent sudo firewall-cmd --zone=public --add-port=3000/tcp --permanent sudo firewall-cmd --reload
2.3 部署 Prometheus Server 参考文档 https://prometheus.io/docs/prometheus/latest/configuration/configuration/
三、部署监控探针 Prometheus与Zabbix不同,Prometheus主要采用主动拉取的模式,通过Exporter提供的接口读取监控数据。Exporter负责采集数据,可以把Exporter理解为探针,并通过http的方式提供接口供Server调用读取数据,读者可自行google本文未描述的各个exporter提供的返回结果内字段的含义。
3.1 部署node_exporter node_exporter用于监控主机的CPU,内存,磁盘,I/O等的信息。侧重点在于主机系统本身的数据采集。
下载 node exporter 并解压 登录需要被监控的主机,可从 此处 下载 node exporter
或者运行curl -O https://github.com/prometheus/node_exporter/releases/download/v1.2.0/node_exporter-1.2.0.linux-amd64.tar.gz
下载完成后,运行以下命令解压二进制包
1 2 3 tar xvfz node_exporter-1.2.0.linux-amd64.tar.gz sudo mkdir -p /data/node_exporter/ sudo mv node_exporter-1.2.0.linux-amd64/* /data/node_exporter/
创建prometheus用户
1 2 3 sudo groupadd prometheus sudo useradd -g prometheus -m -d /var/lib/prometheus -s /sbin/nologin prometheus sudo chown prometheus.prometheus -R /data/node_exporter/
创建Systemd服务
添加并编辑文件
1 sudo nano /etc/systemd/system/node_exporter.service
写入以下内容
1 2 3 4 5 6 7 8 9 10 11 [Unit] Description=node_exporter Documentation=https://prometheus.io/ After=network.target [Service] Type=simple User=prometheus ExecStart=/data/node_exporter/node_exporter Restart=on-failure [Install] WantedBy=multi-user.target
使用systemctl 启动 node exporter 启动并查看服务是否正常1 2 sudo systemctl start node_exporter sudo systemctl status node_exporter
应该返回类似以下的文本
1 2 3 4 5 6 ● node_exporter.service - node_exporter Loaded: loaded (/etc/systemd/system/node_exporter.service; disabled; vendor preset: disabled) Active: active (running) since 三 2019-06-05 09:18:56 GMT; 3s ago Main PID: 11050 (node_exporter) CGroup: /system.slice/node_exporter.service └─11050 /usr/local/prometheus/node_exporter/node_exporter
设置开机启动: sudo systemctl enable node_exporter
开启防火墙白名单 执行curl localhost:9100
,如可以看到返回的网页,说明 node exporter 已经启动成功了。 在同网段内其他机器执行curl http://$ip:9100/
,应同样可以看到返回的页面。 如果看不到返回的页面,可以检查下是否为防火墙端口未开
1 2 sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent sudo firewall-cmd --reload
配置 Prometheus 登录 Prometheus 服务端,编辑以下文件nano /data/docker/prometheus/job/host.json
, 内容参考如下,ip地址自行更改实际的ip地址
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [ { "targets": [ "192.168.1.100:9100"], "labels": { "subject": "node_exporter", "hostname": "server1" } }, { "targets": [ "192.168.1.101:9100"], "labels": { "subject": "node_exporter", "hostname": "server2" } } ]
部署 node_exporter 可参考文档https://github.com/prometheus/node_exporter https://prometheus.io/docs/guides/node-exporter/ https://www.jianshu.com/p/7bec152d1a1f
3.2 部署mysqld-exporter mysqld-exporter 用于监控MySQL数据库的性能等数据。
登录mysql数据库所在主机,并通过docker方式启动1 2 3 4 5 6 7 docker run -d \ -p 9104:9104 \ --link mysql \ --name mysqld-exporter \ --restart on-failure:5 \ -e DATA_SOURCE_NAME="root:pwdpwdpwdpwdpwd@(mysql:3306)/" \ prom/mysqld-exporter:v0.13.0
启动后,访问http://127.0.0.1:9104/metrics
,可看到监控信息,同时从Prometheus服务端访问也应要可访问的到。
部署 mysqld-exporter 可参考文档https://github.com/prometheus/mysqld_exporter https://registry.hub.docker.com/r/prom/mysqld-exporter/ 3.3 部署cadvisor cadvisor用于监控容器的状态。
登录docker所在主机并通过运行以下脚本启动cadvisor1 2 3 4 5 6 7 8 9 10 11 12 13 docker run \ --volume=/:/rootfs:ro \ --volume=/var/run:/var/run:ro \ --volume=/sys:/sys:ro \ --volume=/var/lib/docker/:/var/lib/docker:ro \ --volume=/dev/disk/:/dev/disk:ro \ --publish=9101:8080 \ --detach=true \ --name=cadvisor \ --restart on-failure:5 \ --privileged \ --device=/dev/kmsg \ gcr.io/cadvisor/cadvisor:v0.38.6
你可能会找到两种 cadvisor,一种是 gcr.io/cadvisor/cadvisor, 另一种是 google/cadvisor, 建议使用 gcr.io/cadvisor/cadvisor
配置 Prometheus 服务端 登录Prometheus 服务端所在主机,编辑 nano /data/docker/prometheus/job/cadvisor.json
文件, 内容参考如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [ { "targets": [ "192.168.1.100:9101"], "labels": { "subject": "cadvisor", "hostname": "server1" } }, { "targets": [ "192.168.1.101:9101"], "labels": { "subject": "cadvisor", "hostname": "server2" } } ]
如果docker所在主机存在防火墙,记得添加防火墙白名单
1 2 sudo firewall-cmd --zone=public --add-port=9101/tcp --permanent sudo firewall-cmd --reload
部署 cadvisor 可参考文档https://github.com/google/cadvisor
3.4 部署blackbox_exporter blackbox_exporter 是以黑盒方式进行监控的工具
创建配置文件 登录 Prometheus 服务端主机,执行以下命令1 2 sudo mkdir -p /data/docker/blackbox/conf sudo chown -R myusername:myusername /data/docker/blackbox
并添加编辑该文件
1 nano /data/docker/blackbox/conf/blackbox.yml
yml文件范本如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 modules: ping: prober: icmp timeout: 5s icmp: preferred_ip_protocol: "ip4" http_2xx: prober: http timeout: 5s http: method: GET preferred_ip_protocol: "ip4" # defaults to "ip4" ip_protocol_fallback: false # no fallback to "ip6" http_post_2xx: prober: http timeout: 5s http: method: POST preferred_ip_protocol: "ip4" http_post_2xx_json: prober: http timeout: 30s http: preferred_ip_protocol: "ip4" method: POST headers: Content-Type: application/json body: '{"key1":""vlaue1,"params":{"param2":"vlaue2"}}' http_basic_auth: prober: http timeout: 60s http: method: POST headers: Host: "login.example.com" basic_auth: username: "username" password: "mysecret" tls_connect: prober: tcp timeout: 5s tcp: tls: true tcp_connect: prober: tcp timeout: 5s pop3s_banner: prober: tcp tcp: query_response: - expect: "^+OK" tls: true ssh_banner: prober: tcp tcp: query_response: - expect: "^SSH-2.0-" - send: SSH-2.0-blackbox-ssh-check irc_banner: prober: tcp tcp: query_response: - send: "NICK prober" - send: "USER prober prober prober :prober" - expect: "PING :([^ ]+)" send: "PONG ${1}" - expect: "^:[^ ]+ 001" dns_udp: prober: dns timeout: 10s dns: transport_protocol: udp preferred_ip_protocol: ip4 query_name: "www.example.cn" query_type: "A"
配置 Prometheus 继续在 Prometheus 服务端主机,执行以下命令1 2 3 4 sudo mkdir -p /data/docker/prometheus/job/blackbox/ sudo mkdir -pv /data/docker/prometheus/job/blackbox/{dns,http_2xx,ping,ssh_banner,tcp} sudo chown -R myusername:myusername /data/docker/prometheus/job/blackbox/
以下依次在/data/docker/prometheus/job/blackbox/
下的对应的文件夹中,创建json文件,并参考样本写入配置
在dns文件夹下,创建 dns.json,样本如下
1 2 3 4 5 6 7 8 9 [ { "targets": [ "192.168.1.1"], "labels": { "subject": "blackbox_dns", "app": "my_dns" } } ]
在http_2xx文件夹下,创建 search-site.json,样本如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [ { "targets": [ "https://www.google.cn/?HealthCheck"], "labels": { "app": "google", "subject": "blackbox_http_2xx", "hostname": "server-01" } }, { "targets": [ "https://cn.bing.com/?HealthCheck"], "labels": { "app": "bing", "subject": "blackbox_http_2xx", "hostname": "server-02" } } ]
在ping文件夹下,创建 search-site.json,样本如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [ { "targets": [ "www.google.cn"], "labels": { "app": "google", "subject": "blackbox_ping", "hostname": "server-01" } }, { "targets": [ "cn.bing.com"], "labels": { "app": "bing", "subject": "blackbox_ping", "hostname": "server-02" } } ]
在ssh_banner文件夹下,创建 ssh-banner.json,样本如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 [ { "targets": [ "192.168.1.100:22"], "labels": { "subject": "blackbox_ssh_banner", "hostname": "server-01" } }, { "targets": [ "192.168.1.101:22"], "labels": { "subject": "blackbox_ssh_banner", "hostname": "server-02" } } ]
在tcp文件夹下,创建 tcp.json,样本如下
1 2 3 4 5 6 7 8 9 10 [ { "targets": [ "$ip:3306"], "labels": { "app": "mysql.example.cn", "subject": "blackbox_tcp", "hostname": "mysql" } } ]
运行blackbox_exporter 在 Prometheus服务端所在的主机,运行以下命令,使用容器启动blackbox_exporter
1 2 3 4 5 6 7 docker run -d \ --restart on-failure:5 \ -p 9115:9115 \ -v /data/docker/blackbox/conf/blackbox.yml:/config/blackbox.yml:ro \ --name blackbox_exporter \ prom/blackbox-exporter:v0.19.0 \ --config.file=/config/blackbox.yml
启动成功后,访问http://$ip:9090/targets
,可看到至今为止,我们配置的所有探针所反馈回来的数据,其中,State应为UP状态。
部署 blackbox_exporter 可参考文档https://github.com/prometheus/blackbox_exporter https://yunlzheng.gitbook.io/prometheus-book/part-ii-prometheus-jin-jie/exporter/commonly-eporter-usage/install_blackbox_exporter 四、部署Grafana 接下来,部署可视化工具Grafana,Grafana可快速集成Prometheus,并通过设定甚至是使用现成的模板,快速将采集结果转变为图形化的页面。
4.1 启动 运行以下命令,做启动前的准备工作
1 2 sudo mkdir -p /data/docker/grafana sudo chown 472:472 /data/docker/grafana -R
通过docker运行grafana
1 2 3 4 5 6 7 8 docker run -d \ -p 3000:3000 \ --name=grafana \ -v /data/docker/grafana:/var/lib/grafana \ -v /etc/localtime:/etc/localtime:ro \ --restart=always \ --name grafana \ grafana/grafana:8.0.6
启动成功后,可通过http://$ip:3000
访问页面,默认账号密码: admin / admin 。
4.2 配置 配置数据源 点击“Configuration -> Data sources”,进入 http://$ip:3000/datasources
,增加Prometheus数据源,并做好配置。
配置Dashboards
点击“Dashboards -> Manage -> import”,进入http://$ip:3000/dashboard/import
,导入 Grafana Dashboards 模板,在Import via grafana.com
处,填入你想要导入的模板id,常用的模板id如下:
node exporter ID: 8919 Cadvisor ID: 14282 mysqld-exporter ID: 7362 你也可以在https://grafana.com/grafana/dashboards
,自行搜索 Dashboards 模板。也可以自行创建dashboard面板。
4.3 部署 Grafana 可参考文档 https://grafana.com/docs/grafana/latest/installation/docker/
五、部署AlertManager 截至到现在,我们已经部署好Prometheus Server,Exporter,Grafana可视化组件,我们还需要配置告警组件,当故障出现时,监控系统可通过多种方式告知接收人,以便接收人及时知晓并处理。但Prometheus本身并不自带告警工具,Prometheus可以通过预配置的规则,将信息发送到AlertManager,由AlertManager统一处理告警信息,并通过邮箱,短信,微信,钉钉等方式告知告警接收人。和Grafana一样,AlertManager同样不仅仅支持Prometheus,也支持集成处理其他程序的信息。
5.1 准备工作 运行以下命令
1 2 3 sudo mkdir -pv /data/docker/alertmanager sudo chown -R myusername:myusername /data/docker/alertmanager/ cd /data/docker/alertmanager
在/data/docker/alertmanager
文件夹中,创建alertmanager.yml 和 email.tmpl 文件,
alertmanager.yml的样例如下,注意要设置smtp相关配置项与webhook的ddurl:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 global: resolve_timeout: 5m # 邮件SMTP配置 smtp_smarthost: 'smtp.gmail.com:465' smtp_from: 'example@gmail.com' smtp_auth_username: 'example@gmail.com' smtp_auth_password: 'xxxxx' smtp_require_tls: false # 自定义通知模板 templates: - '/etc/alertmanager/email.tmpl' # route用来设置报警的分发策略 route: # 采用哪个标签来作为分组依据 group_by: ['alertname'] # 组告警等待时间。也就是告警产生后等待10s,如果有同组告警一起发出 group_wait: 10s # 两组告警的间隔时间 group_interval: 10s # 重复告警的间隔时间,减少相同邮件的发送频率 repeat_interval: 1h # 设置默认接收人 receiver: 'myreceiver' routes: # 可以指定哪些组接手哪些消息 - receiver: 'myreceiver' continue: true group_wait: 10s receivers: - name: 'myreceiver' #send_resolved: true email_configs: # - to: 'example@gmail.com, example2@gmail.com' - to: 'example@gmail.com' html: '{{ template "email.to.html" . }}' headers: { Subject: "Prometheus [Warning] 报警邮件" } # 钉钉配置 webhook_configs: - url: 'http://$ip:18080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxx'
email.tmpl的样例如下,注意样例中有一个”2006-01-02 15:04:05”,这个时间不能改,否则报警显示时间可能会不正确:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 {{ define "email.to.html" }} {{- if gt (len .Alerts.Firing) 0 -}} {{ range .Alerts }} =========start==========<br> 告警程序: prometheus_alert <br> 告警级别: {{ .Labels.severity }} <br> 告警类型: {{ .Labels.alertname }} <br> 告警应用: {{ .Labels.app }} <br> 告警主机: {{ .Labels.instance }} <br> 告警主题: {{ .Annotations.summary }} <br> 告警详情: {{ .Annotations.description }} <br> 触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br> =========end==========<br> {{ end }}{{ end -}} {{- if gt (len .Alerts.Resolved) 0 -}} {{ range .Alerts }} =========start==========<br> 告警程序: prometheus_alert <br> 告警级别: {{ .Labels.severity }} <br> 告警类型: {{ .Labels.alertname }} <br> 告警应用: {{ .Labels.app }} <br> 告警主机: {{ .Labels.instance }} <br> 告警主题: {{ .Annotations.summary }} <br> 告警详情: {{ .Annotations.description }} <br> 触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br> 恢复时间: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br> =========end==========<br> {{ end }}{{ end -}} {{- end }}
这两个配置文件,可参考 https://prometheus.io/docs/alerting/latest/configuration/ 进行修改。
5.2 启动AlertManager 运行以下命令
1 2 3 4 5 6 docker run -d -p 9093:9093 \ -v /data/docker/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro \ -v /data/docker/alertmanager/email.tmpl/:/etc/alertmanager/email.tmpl:ro \ --name alertmanager \ --restart=always \ prom/alertmanager:v0.22.2
5.3 访问 启动成功后,可通过http://$ip:9093
访问alertmanager组件
六、部署PrometheusAlert 上节已经提到,Prometheus告警需由两部分组成,上节我们已经部署好AlertManager用于信息处理与通知,本节我们需要定义好Prometheus的配置规则,如此Prometheus便可以产生告警信息并发送到AlertManager。
6.1 准备工作 运行以下命令
1 2 3 sudo mkdir -p /data/docker/prometheus-alert/conf sudo chown -R fenixadar:fenixadar /data/docker/prometheus-alert/ nano /data/docker/prometheus-alert/conf/app.conf
从 https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/conf/app-example.conf 下载文件并移动到 /data/docker/prometheus-alert/conf/app.conf
6.2 启动 运行以下命令开启prometheus-alert
1 2 3 4 5 6 docker run -d --publish=18080:8080 \ -v /data/docker/prometheus-alert/conf/:/app/conf:ro \ -v /data/docker/prometheus-alert/db/:/app/db \ -v /data/docker/prometheus-alert/log/:/app/logs \ --name prometheusalert-center \ feiyu563/prometheus-alert:v-4.5.0
开启成功后,通过http://$ip:18080
,访问prometheus-alert界面。用户密码已在 app.conf 中设置。
如果系统开启了防火墙,记得开放白名单
1 2 sudo firewall-cmd --zone=public --add-port=18080/tcp --permanent sudo firewall-cmd --reload
6.3 配置 配置告警模板 点击AlertTemplate,进入http://$ip:18080/template
,此处有各类可对接的第三方系统的模板。 以钉钉的告警模板为例,将模版内容改为如下,主要是修正时间显示慢8小时的问题,以及增加一些信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 {{ $var := .externalURL}}{{ range $k,$v:=.alerts }} {{if eq $v.status "resolved"}} ## [Prometheus恢复信息]({{$v.generatorURL}}) #### [{{$v.labels.alertname}}]({{$var}}) ###### 告警级别:{{$v.labels.level}} ###### 开始时间:{{GetCSTtime $v.startsAt}} ###### 结束时间:{{GetCSTtime $v.endsAt}} ###### 故障主机名:{{$v.labels.hostname}} ###### 故障主机IP:{{$v.labels.instance}} ###### 故障应用:{{$v.labels.app}} ###### 故障主机对象:{{$v.labels.subject}} ##### {{$v.annotations.description}} ![Prometheus](https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png) {{else}} ## [Prometheus告警信息]({{$v.generatorURL}}) #### [{{$v.labels.alertname}}]({{$var}}) ###### 告警级别:{{$v.labels.level}} ###### 开始时间:{{GetCSTtime $v.startsAt}} ###### 故障主机名:{{$v.labels.hostname}} ###### 故障主机IP:{{$v.labels.instance}} ###### 故障应用:{{$v.labels.app}} ###### 故障主机对象:{{$v.labels.subject}} ##### {{$v.annotations.description}} ![Prometheus](https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png) {{end}} {{ end }}
设置钉钉机器人 在钉钉中,新建一个钉钉群,点击“群设置 -> 智能群助手 -> 添加机器人 -> 自定义 -> 安全设置”,把发送信息的服务器IP地址加进去,而后就会有 Webhook 地址。可参考 https://blog.csdn.net/knight_zhou/article/details/105583741 6.4 部署 PrometheusAlert 可参考文档 https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/install.md
七、配置告警规则 我们还需要在Prometheus Server中配置告警规则,告警规则文件引用配置在prometheus.yml文件的rule_files一节中。规则文件格式是yml,依照2.1小节的配置,在/data/docker/prometheus/alert_rules/
文件夹创建yml文件,内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 groups: - name: Node_exporter Down rules: - alert: 实例丢失 expr: up{job="node_exporter"} == 0 for: 1m labels: level: Warning annotations: summary: "{{ $labels.job }}" address: "{{ $labels.instance }}" description: "已经有1分钟连接不上实例了." - alert: CPU使用率过高(> 80) expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80 for: 1m labels: level: Warning annotations: summary: "{{ $labels.instance }} CPU使用率过高" description: "{{ $labels.instance }}: CPU使用率超过80%,当前使用率{{ $value }}" - alert: 内存使用率过高(> 80) expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80 for: 1m #告警持续时间,超过这个时间才会发送给alertmanager labels: level: Warning annotations: summary: "{{ $labels.instance }} 内存使用率过高" description: "{{ $labels.instance }}:内存使用率超过80%. 当前使用率{{ $value }}" - alert: 内存压力过大 (> 1000) expr: rate(node_vmstat_pgmajfault[1m]) > 1000 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 内存压力过大" description: "{{ $labels.instance }}:内存压力很大. 当前值{{ $value }}" - alert: 主机网络接口接收了太多的数据 (> 2MB/s) expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 2 for: 5m labels: level: warning annotations: summary: "(instance {{ $labels.instance }})主机入向流量异常" description: "{{ $labels.instance }}:持续3分钟网口接收太多数据(> 2MB/s). 当前使用入向流量{{ $value }}MB每秒." - alert: 主机网络接口发送了太多的数据 (> 2MB/s) expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100 for: 3m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 主机出向流量异常" description: "{{ $labels.instance }}:持续3分钟网口发送太多数据(> 2MB/s). 当前使用入向流量{{ $value }}MB每秒." - alert: 磁盘每秒读数据(> 50 MB/s) expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50 for: 3m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 主机IO读取异常" description: "{{ $labels.instance }}:主机的IO读取有些问题. 当前值每秒{{ $value }}MB" - alert: 磁盘每秒写数据(> 50 MB/s) expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 主机IO写入异常" description: "{{ $labels.instance }}:主机的IO写入有些问题. 当前值每秒{{ $value }}MB" # Please add ignored mountpoints in node_exporter parameters like # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)". # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users. - alert: 磁盘可用空间(<10% left) expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 主机磁盘告急" description: "{{ $labels.instance }}: 主机大约还剩10%的磁盘存储. 当前可用剩余{{ $value }}%" - alert: 磁盘读取延迟大(>100ms) expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 主机IO读取延迟大" description: "{{ $labels.instance }}: 主机的IO读取延迟有些大 >100ms . 当前值{{ $value }}" - alert: 磁盘写入延迟大(>100ms) expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 主机IO写入延迟大" description: "{{ $labels.instance }}: 主机的IO写入延迟有些大 >100ms . 当前值{{ $value }}" # 1000 context switches is an arbitrary number. # Alert threshold depends on nature of application. # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58 #- alert: 上下文切换的节点越来越多(>1500/s) # expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1500 # for: 3m # labels: # level: warning # annotations: # summary: "(instance {{ $labels.instance }}) 主机上下文节点堆积" # description: "{{ $labels.instance }}: 主机上下文节点堆积严重 >1500/s . 当前值{{ $value }}" - alert: 主机 swap 交换分区使用情况 (> 80%) expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 主机交换空间警告" description: "{{ $labels.instance }}: 主机交换内存到达 > 80% . 当前值{{ $value }}" - alert: 主机 systemctl 管理的服务 down expr: node_systemd_unit_state{state="failed"} == 1 for: 0m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 有systemctl服务被DOWN" description: "{{ $labels.instance }}: 的{{ $value }}服务被systemctl方式DOWN了" - alert: 物理机温度过高( >75°) expr: node_hwmon_temp_celsius > 75 for: 5m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 主机物理机温度告警" description: "{{ $labels.instance }}: 主机物理机温度异常( >75°),当前值{{ $value }}" - alert: 触发物理节点温度报警 expr: node_hwmon_temp_crit_alarm_celsius == 1 for: 0m labels: level: critical annotations: summary: "(instance {{ $labels.instance }}) 主机主板温度告警" description: "{{ $labels.instance }}: 主板的温度过高,当前值{{ $value }}" - alert: 主机五分钟内接收到错误包 expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 主机网络接收到错误包" description: "主机 {{ $labels.instance }} interface {{ $labels.device }} 在过去五分钟内遇到了 {{ printf \"%.0f\" $value }} 接收错误" - alert: 主机五分钟内发送了错误包 expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 主机网络发送了错误包" description: "主机 {{ $labels.instance }} interface {{ $labels.device }} 在过去五分钟内遇到了 {{ printf \"%.0f\" $value }} 接收错误" - alert: TCP连接时间过长 expr: probe_duration_seconds{job="blackbox_tcp"} > 5 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) TCP连接时间大于5秒" description: "TCP连接时间大于5秒, 当前值{{ $value }}" - alert: 主机TCP连接数 expr: node_netstat_Tcp_CurrEstab > 800 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) TCP连接数过多" description: "{{ $labels.instance }}: 检测过多TCP连接 > 800, 当前值{{ $value }}" - alert: 待关闭的TCP连接数 > 4000 expr: node_sockstat_TCP_tw > 4000 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 等待关闭的TCP连接数 > 4000" description: "{{ $labels.instance }}: 检测到过多待关闭的TCP连接数, 当前值{{ $value }}" - alert: 检测到时钟偏差 expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0) for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 检测到时钟偏差" description: "{{ $labels.instance }}: 检测到时钟偏差。时钟不同步, 当前值{{ $value }}" - alert: 容器停止运行检测 expr: time() - container_last_seen > 300 for: 0m labels: level: warning annotations: summary: "(instance {{ $labels.instance }}) 容器可能已经停止运行了" description: "容器:{{ $labels.name }} - {{ $value }}可能已经停止运行了" # cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly. # If you want to exclude it from this alert, exclude the serie having an empty name: container_cpu_usage_seconds_total{name!=""} - alert: 容器CPU使用情况 expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 300 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }})容器CPU过高" description: "{{ $labels.instance }}: 容器CPU 使用率 >300% , 当前值{{ $value }}%" # See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d #- alert: 容器内存使用情况 # expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 85 # for: 2m # labels: # level: warning # annotations: # summary: "(instance {{ $labels.instance }})容器内存过高" # description: "{{ $labels.instance }}: 容器{{ $labels.name }}内存 使用率 >85% , 当前值{{ $value }}%" - alert: 容器卷使用情况 expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }})容器Volume过高" description: "{{ $labels.instance }}: 容器{{ $labels.name }},Volume 使用率 >80% , 当前值{{ $value }}%" - alert: 容器卷IO使用率 expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }})容器Volume IO过高" description: "{{ $labels.instance }}: 容器{{ $labels.name }},Volume IO使用率 >80% , 当前值{{ $value }}%" - alert: 容器高字节流情况 expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }})容器字节流过高" description: "{{ $labels.instance }}: 容器{{ $labels.name }}字节流过高" - alert: Blackbox探针状态 expr: probe_success == 0 for: 5m labels: level: critical annotations: summary: "(instance {{ $labels.instance }})黑盒检测发现问题" description: "任务组{{ $labels.job }}采集到问题" - alert: Blackbox慢采集 expr: avg_over_time(probe_duration_seconds[1m]) > 15 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }})黑盒采集过慢" description: "Blackbox用了{{ $value }}秒多的时间才完成, {{ $labels }}" - alert: Blackbox Ping时间过长 expr: probe_duration_seconds{job="blackbox_ping"} > 5 for: 2m labels: level: warning annotations: summary: "(instance {{ $labels.instance }})Blackbox Ping时间过长" description: "Ping 时间大于5秒, {{ $value }},{{ $labels }}" - alert: Blackbox探测Http失败 expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400 for: 2m labels: level: critical annotations: summary: "(instance {{ $labels.instance }})Blackbox探测Http失败" description: "HTTP状态代码不是200-399, {{ $value }},{{ $labels }}" - alert: SSL 证书 30 天后到期 expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30 for: 60m labels: level: warning annotations: summary: Blackbox SSL证书将很快过期(instance {{ $labels.instance }}) description: "SSL 证书将在30天后过期,{{ $value }},{{ $labels }}" - alert: SSL 证书 3 天后到期 expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3 for: 60m labels: level: critical annotations: summary: Blackbox SSL证书将很快过期(instance {{ $labels.instance }}) description: "SSL 证书将在3天后过期,{{ $value }},{{ $labels }}" - alert: SSL 证书已经到期 expr: probe_ssl_earliest_cert_expiry - time() <= 0 for: 60m labels: level: critical annotations: summary: Blackbox SSL证书已经过期(instance {{ $labels.instance }}) description: "SSL 证书过期了 ,{{ $value }},{{ $labels }}" - alert: Blackbox采集HTTP过慢 expr: avg_over_time(probe_http_duration_seconds[1m]) > 3 for: 2m labels: level: warning annotations: summary: "Blackbox 探测慢速Http (instance {{ $labels.instance }})" description: "HTTP请求耗时超过3s, 当前值{{ $value }},任务对象{{ $labels.instance }}" - alert: Blackbox采集ICMP过慢 expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 3 for: 2m labels: level: warning annotations: summary: "Blackbox 探测慢速icmp (instance {{ $labels.instance }})" description: "ICMP请求耗时超过3s, 当前值{{ $value }},任务对象{{ $labels.instance }}" - alert: DNS服务器宕机 expr: probe_dns_answer_rrs == 0 for: 1m labels: level: Warning annotations: summary: "DNS服务器宕机" description: "DNS服务器已经有1分钟未响应了,可能已宕机."
配置好规则文件后,重启Prometheus Server,可在http://$ip:9090/rules
页面查看规则。可以自行搜索下警报状态相关的知识点。
可参考文档:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/