极简Prometheus监控实战

2022-08-04 0 Comments Word Count: 7.7k(words) Read Count: 38(minutes)

一、前言与目录

监控是当前云原生时代下可观测性中关键性的一环，较之前相比，云原生时代已经发生了诸多变化，诸如微服务，容器化等技术层出不穷，且云原生时代的演进速度，更新速度极快，相对应监控所产生的数据量大大增加，对实时性的要求也大大增加。为应对变化，Prometheus应运而生，其所可实现的功能，与云原生极好的契合度，集成第三方开源组件的便利性，无疑使其成为无疑是最为耀眼的明星之一。

本文着重在于介绍如何利用Prometheus搭建监控系统，涵盖探针，指标设定，可视化，告警设定，容器监控等。这是一篇入门级教程，暂不涵盖gateway，K8S集群等的相关内容。关于Prometheus的基本知识与概念，自行google之，本文重点描述实战过程。

部署Prometheus Server
部署监控探针
部署Grafana
部署AlertManager
部署PrometheusAlert
配置告警规则

二、部署Prometheus Server

本节主要介绍以docker的方式部署Prometheus Server，并预留映射相关配置项

2.1 配置环境

创建文件夹并授予权限

1 2	sudo mkdir -pv /data/docker/prometheus/{data,alert_rules,job} sudo chown -R myusername:myusername /data/docker/prometheus/

其中,

data文件夹用于存放prometheus产生的数据
alert_rules文件夹用于存放prometheus alert告警规则配置文件
job用于存放监控对象配置json文件
myusername可替换为实际的用户名

执行本条命令以避免出现 permission denied 错误
1
sudo chown 65534:65534 -R /data/docker/prometheus/data

拷贝配置文件到指定的目录，注意下，需要关注该文件中涉及“$ip”的部分，后续配置，诸如添加AlertManager后，记得返回修改修改此处。

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    # - targets: ["$ip:9093"]
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  - /etc/prometheus/alert_rules/*.rules

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    file_sd_configs:
    - files:
      - /etc/prometheus/job/prometheus.json
      refresh_interval: 1m  # 重载配置文件

  # Node 主机组
  - job_name: 'host'
    #basic_auth:
    #  username: prometheus
    #  password: prometheus
    file_sd_configs:
    - files:
      - /etc/prometheus/job/host.json
      refresh_interval: 1m

  # cadvisor 容器组
  - job_name: 'cadvisor'
    file_sd_configs:
    - files:
      - /etc/prometheus/job/cadvisor.json
      refresh_interval: 1m

  # mysql exporter 组
  - job_name: 'mysqld-exporter'
    file_sd_configs:
    - files:
      - /etc/prometheus/job/mysqld-exporter.json
      refresh_interval: 1m

  # blackbox ping 组
  - job_name: 'blackbox_ping'
    scrape_interval: 5s
    scrape_timeout: 2s
    metrics_path: /probe
    params:
      module: [ping]
    file_sd_configs:
    - files:
      - /etc/prometheus/job/blackbox/ping/*.json
      refresh_interval: 1m
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: $ip:9115

  # blackbox http get 2xx 组
  - job_name: 'blackbox_http_2xx'
    scrape_interval: 5s
    metrics_path: /probe
    params:
      module: [http_2xx]
    file_sd_configs:
    - files:
      - /etc/prometheus/job/blackbox/http_2xx/*.json
      refresh_interval: 1m
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: $ip:9115

  - job_name: "blackbox_tcp"
    metrics_path: /probe
    params:
      module: [tcp_connect]
    file_sd_configs:
    - files:
      - /etc/prometheus/job/blackbox/tcp/*.json
      refresh_interval: 1m
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: $ip:9115

  - job_name: 'blackbox_ssh_banner'
    metrics_path: /probe
    params:
      module: [ssh_banner]
    file_sd_configs:
    - files:
      - /etc/prometheus/job/blackbox/ssh_banner/*.json
      refresh_interval: 1m
    relabel_configs:
      # Ensure port is 22, pass as URL parameter
      - source_labels: [__address__]
        regex: (.*?)(:.*)?
        replacement: ${1}:22
        target_label: __param_target
      # Make instance label the target
      - source_labels: [__param_target]
        target_label: instance
      # Actually talk to the blackbox exporter though
      - target_label: __address__
        replacement: $ip:9115

  - job_name: "blackbox_dns"
    metrics_path: /probe
    params:
      module: [dns_udp]
    file_sd_configs:
    - files:
      - /etc/prometheus/job/blackbox/dns/*.json
      refresh_interval: 1m
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: $ip:9115

2.2 启动服务端

docker run -itd  \
  -p 9090:9090 \
  -v /data/docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
  -v /data/docker/prometheus/alert_rules:/etc/prometheus/alert_rules \
  -v /data/docker/prometheus/job:/etc/prometheus/job \
  -v /data/docker/prometheus/data:/data/prometheus/ \
  -v /etc/timezone:/etc/timezone:ro \
  -v /etc/localtime:/etc/localtime:ro \
  --name prometheus \
  --restart=always \
  prom/prometheus:v2.28.1 \
  --config.file=/etc/prometheus/prometheus.yml  \
  --storage.tsdb.path=/data/prometheus/ \
  --storage.tsdb.retention.time=30d \
  --web.read-timeout=5m \
  --web.max-connections=10 \
  --query.max-concurrency=20 \
  --query.timeout=2m \
  --web.enable-lifecycle

启动成功后，通过浏览器访问 http://$ip:9090 可看到界面。

如果系统打开了防火墙，你可能需要给以下几个端口开白名单，以centos7为例，

sudo firewall-cmd --zone=public --add-port=9090/tcp --permanent
sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent
sudo firewall-cmd --zone=public --add-port=3000/tcp --permanent
sudo firewall-cmd --reload

2.3 部署 Prometheus Server 参考文档

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

三、部署监控探针

Prometheus与Zabbix不同，Prometheus主要采用主动拉取的模式，通过Exporter提供的接口读取监控数据。Exporter负责采集数据，可以把Exporter理解为探针，并通过http的方式提供接口供Server调用读取数据，读者可自行google本文未描述的各个exporter提供的返回结果内字段的含义。

3.1 部署node_exporter

node_exporter用于监控主机的CPU，内存，磁盘，I/O等的信息。侧重点在于主机系统本身的数据采集。

下载 node exporter 并解压

登录需要被监控的主机，可从此处下载 node exporter

或者运行curl -O https://github.com/prometheus/node_exporter/releases/download/v1.2.0/node_exporter-1.2.0.linux-amd64.tar.gz

下载完成后，运行以下命令解压二进制包

1
2
3

tar xvfz node_exporter-1.2.0.linux-amd64.tar.gz
sudo mkdir -p /data/node_exporter/
sudo mv node_exporter-1.2.0.linux-amd64/* /data/node_exporter/

创建prometheus用户

1
2
3

sudo groupadd prometheus
sudo useradd -g prometheus -m -d /var/lib/prometheus -s /sbin/nologin prometheus
sudo chown prometheus.prometheus -R /data/node_exporter/

创建Systemd服务

添加并编辑文件

1	sudo nano /etc/systemd/system/node_exporter.service

写入以下内容

[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/data/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

使用systemctl 启动 node exporter
启动并查看服务是否正常

1 2	sudo systemctl start node_exporter sudo systemctl status node_exporter

应该返回类似以下的文本

● node_exporter.service - node_exporter
   Loaded: loaded (/etc/systemd/system/node_exporter.service; disabled; vendor preset: disabled)
   Active: active (running) since 三 2019-06-05 09:18:56 GMT; 3s ago
 Main PID: 11050 (node_exporter)
   CGroup: /system.slice/node_exporter.service
           └─11050 /usr/local/prometheus/node_exporter/node_exporter

设置开机启动: sudo systemctl enable node_exporter

开启防火墙白名单
执行curl localhost:9100，如可以看到返回的网页，说明 node exporter 已经启动成功了。
在同网段内其他机器执行curl http://$ip:9100/，应同样可以看到返回的页面。

如果看不到返回的页面，可以检查下是否为防火墙端口未开

1 2	sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent sudo firewall-cmd --reload

配置 Prometheus
登录 Prometheus 服务端，编辑以下文件
nano /data/docker/prometheus/job/host.json, 内容参考如下，ip地址自行更改实际的ip地址

[
  {
    "targets": [ "192.168.1.100:9100"],
    "labels": {
      "subject": "node_exporter",
      "hostname": "server1"
    }
  },
  {
    "targets": [ "192.168.1.101:9100"],
    "labels": {
      "subject": "node_exporter",
      "hostname": "server2"
    }
  }
]

部署 node_exporter 可参考文档
https://github.com/prometheus/node_exporter
https://prometheus.io/docs/guides/node-exporter/
https://www.jianshu.com/p/7bec152d1a1f

3.2 部署mysqld-exporter

mysqld-exporter 用于监控MySQL数据库的性能等数据。

登录mysql数据库所在主机，并通过docker方式启动

docker run -d \
  -p 9104:9104 \
  --link mysql  \
  --name mysqld-exporter \
  --restart on-failure:5 \
  -e DATA_SOURCE_NAME="root:pwdpwdpwdpwdpwd@(mysql:3306)/" \
  prom/mysqld-exporter:v0.13.0

启动后，访问http://127.0.0.1:9104/metrics，可看到监控信息，同时从Prometheus服务端访问也应要可访问的到。

部署 mysqld-exporter 可参考文档
https://github.com/prometheus/mysqld_exporter
https://registry.hub.docker.com/r/prom/mysqld-exporter/

3.3 部署cadvisor

cadvisor用于监控容器的状态。

登录docker所在主机并通过运行以下脚本启动cadvisor

docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=9101:8080 \
  --detach=true \
  --name=cadvisor \
  --restart on-failure:5 \
  --privileged \
  --device=/dev/kmsg \
  gcr.io/cadvisor/cadvisor:v0.38.6

你可能会找到两种 cadvisor，一种是 gcr.io/cadvisor/cadvisor, 另一种是 google/cadvisor, 建议使用 gcr.io/cadvisor/cadvisor

配置 Prometheus 服务端
登录Prometheus 服务端所在主机，编辑 nano /data/docker/prometheus/job/cadvisor.json 文件, 内容参考如下：

[
  {
    "targets": [ "192.168.1.100:9101"],
    "labels": {
      "subject": "cadvisor",
      "hostname": "server1"
    }
  },
  {
    "targets": [ "192.168.1.101:9101"],
    "labels": {
      "subject": "cadvisor",
      "hostname": "server2"
    }
  }
]

如果docker所在主机存在防火墙，记得添加防火墙白名单

1 2	sudo firewall-cmd --zone=public --add-port=9101/tcp --permanent sudo firewall-cmd --reload

部署 cadvisor 可参考文档
https://github.com/google/cadvisor

3.4 部署blackbox_exporter

blackbox_exporter 是以黑盒方式进行监控的工具

创建配置文件
登录 Prometheus 服务端主机，执行以下命令

1 2	sudo mkdir -p /data/docker/blackbox/conf sudo chown -R myusername:myusername /data/docker/blackbox

并添加编辑该文件

1	nano /data/docker/blackbox/conf/blackbox.yml

yml文件范本如下：

modules:
  ping:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"
  http_2xx:
    prober: http
    timeout: 5s
    http:
      method: GET
      preferred_ip_protocol: "ip4" # defaults to "ip4"
      ip_protocol_fallback: false  # no fallback to "ip6"
  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      preferred_ip_protocol: "ip4"
  http_post_2xx_json:
    prober: http
    timeout: 30s
    http:
      preferred_ip_protocol: "ip4"
      method: POST
      headers:
        Content-Type: application/json
      body: '{"key1":""vlaue1,"params":{"param2":"vlaue2"}}'
  http_basic_auth:
    prober: http
    timeout: 60s
    http:
      method: POST
      headers:
        Host: "login.example.com"
      basic_auth:
        username: "username"
        password: "mysecret"

  tls_connect:
    prober: tcp
    timeout: 5s
    tcp:
      tls: true
  tcp_connect:
    prober: tcp
    timeout: 5s

  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true

  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
      - send: SSH-2.0-blackbox-ssh-check

  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"

  dns_udp:
    prober: dns
    timeout: 10s
    dns:
      transport_protocol: udp
      preferred_ip_protocol: ip4
      query_name: "www.example.cn"
      query_type: "A"

配置 Prometheus
继续在 Prometheus 服务端主机，执行以下命令

sudo mkdir -p /data/docker/prometheus/job/blackbox/

sudo mkdir -pv /data/docker/prometheus/job/blackbox/{dns,http_2xx,ping,ssh_banner,tcp}
sudo chown -R myusername:myusername /data/docker/prometheus/job/blackbox/

以下依次在/data/docker/prometheus/job/blackbox/下的对应的文件夹中，创建json文件，并参考样本写入配置

在dns文件夹下，创建 dns.json，样本如下

[
  {
    "targets": [ "192.168.1.1"],
    "labels": {
      "subject": "blackbox_dns",
      "app": "my_dns"
    }
  }
]

在http_2xx文件夹下，创建 search-site.json，样本如下

[
  {
    "targets": [ "https://www.google.cn/?HealthCheck"],
    "labels": {
      "app": "google",
      "subject": "blackbox_http_2xx",
      "hostname": "server-01"
    }
  },
  {
    "targets": [ "https://cn.bing.com/?HealthCheck"],
    "labels": {
      "app": "bing",
      "subject": "blackbox_http_2xx",
      "hostname": "server-02"
    }
  }
]

在ping文件夹下，创建 search-site.json，样本如下

[
  {
    "targets": [ "www.google.cn"],
    "labels": {
      "app": "google",
      "subject": "blackbox_ping",
      "hostname": "server-01"
    }
  },
  {
    "targets": [ "cn.bing.com"],
    "labels": {
      "app": "bing",
      "subject": "blackbox_ping",
      "hostname": "server-02"
    }
  }
]

在ssh_banner文件夹下，创建 ssh-banner.json，样本如下

[
  {
    "targets": [ "192.168.1.100:22"],
    "labels": {
      "subject": "blackbox_ssh_banner",
      "hostname": "server-01"
    }
  },
  {
    "targets": [ "192.168.1.101:22"],
    "labels": {
      "subject": "blackbox_ssh_banner",
      "hostname": "server-02"
    }
  }
]

在tcp文件夹下，创建 tcp.json，样本如下

[
  {
    "targets": [ "$ip:3306"],
    "labels": {
      "app": "mysql.example.cn",
      "subject": "blackbox_tcp",
      "hostname": "mysql"
    }
  }
]

运行blackbox_exporter

在 Prometheus服务端所在的主机，运行以下命令，使用容器启动blackbox_exporter

docker run -d \
  --restart on-failure:5 \
  -p 9115:9115 \
  -v /data/docker/blackbox/conf/blackbox.yml:/config/blackbox.yml:ro \
  --name blackbox_exporter \
  prom/blackbox-exporter:v0.19.0 \
  --config.file=/config/blackbox.yml

启动成功后，访问http://$ip:9090/targets，可看到至今为止，我们配置的所有探针所反馈回来的数据，其中，State应为UP状态。

部署 blackbox_exporter 可参考文档
https://github.com/prometheus/blackbox_exporter
https://yunlzheng.gitbook.io/prometheus-book/part-ii-prometheus-jin-jie/exporter/commonly-eporter-usage/install_blackbox_exporter

四、部署Grafana

接下来，部署可视化工具Grafana，Grafana可快速集成Prometheus，并通过设定甚至是使用现成的模板，快速将采集结果转变为图形化的页面。

4.1 启动

运行以下命令，做启动前的准备工作

1 2	sudo mkdir -p /data/docker/grafana sudo chown 472:472 /data/docker/grafana -R

通过docker运行grafana

docker run -d \
  -p 3000:3000 \
  --name=grafana \
  -v /data/docker/grafana:/var/lib/grafana \
  -v /etc/localtime:/etc/localtime:ro \
  --restart=always \
  --name grafana \
  grafana/grafana:8.0.6

启动成功后，可通过http://$ip:3000访问页面，默认账号密码: admin / admin 。

4.2 配置

配置数据源
点击“Configuration -> Data sources”，进入 http://$ip:3000/datasources，增加Prometheus数据源，并做好配置。
配置Dashboards

点击“Dashboards -> Manage -> import”，进入http://$ip:3000/dashboard/import，导入 Grafana Dashboards 模板，在Import via grafana.com处，填入你想要导入的模板id，常用的模板id如下：

node exporter ID: 8919
Cadvisor ID: 14282
mysqld-exporter ID: 7362

你也可以在https://grafana.com/grafana/dashboards，自行搜索 Dashboards 模板。也可以自行创建dashboard面板。

4.3 部署 Grafana 可参考文档

https://grafana.com/docs/grafana/latest/installation/docker/

五、部署AlertManager

截至到现在，我们已经部署好Prometheus Server，Exporter，Grafana可视化组件，我们还需要配置告警组件，当故障出现时，监控系统可通过多种方式告知接收人，以便接收人及时知晓并处理。但Prometheus本身并不自带告警工具，Prometheus可以通过预配置的规则，将信息发送到AlertManager，由AlertManager统一处理告警信息，并通过邮箱，短信，微信，钉钉等方式告知告警接收人。和Grafana一样，AlertManager同样不仅仅支持Prometheus，也支持集成处理其他程序的信息。

5.1 准备工作

运行以下命令

1
2
3

sudo mkdir -pv /data/docker/alertmanager
sudo chown -R myusername:myusername /data/docker/alertmanager/
cd /data/docker/alertmanager

在/data/docker/alertmanager文件夹中，创建alertmanager.yml 和 email.tmpl 文件，

alertmanager.yml的样例如下，注意要设置smtp相关配置项与webhook的ddurl：

global:
  resolve_timeout: 5m
  # 邮件SMTP配置
  smtp_smarthost: 'smtp.gmail.com:465'
  smtp_from: 'example@gmail.com'
  smtp_auth_username: 'example@gmail.com'
  smtp_auth_password: 'xxxxx'
  smtp_require_tls: false
# 自定义通知模板
templates:
  - '/etc/alertmanager/email.tmpl'
# route用来设置报警的分发策略
route:
  # 采用哪个标签来作为分组依据
  group_by: ['alertname']
  # 组告警等待时间。也就是告警产生后等待10s，如果有同组告警一起发出
  group_wait: 10s
  # 两组告警的间隔时间
  group_interval: 10s
  # 重复告警的间隔时间，减少相同邮件的发送频率
  repeat_interval: 1h
  # 设置默认接收人
  receiver: 'myreceiver'
  routes:   # 可以指定哪些组接手哪些消息
  - receiver: 'myreceiver'
    continue: true
    group_wait: 10s
receivers:
- name: 'myreceiver'
#send_resolved: true
  email_configs:
  # - to: 'example@gmail.com, example2@gmail.com'
  - to: 'example@gmail.com'
    html: '{{ template "email.to.html" . }}'
    headers: { Subject: "Prometheus [Warning] 报警邮件" }
  # 钉钉配置
  webhook_configs:
  - url: 'http://$ip:18080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxx'

email.tmpl的样例如下，注意样例中有一个”2006-01-02 15:04:05”，这个时间不能改，否则报警显示时间可能会不正确：

{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警应用: {{ .Labels.app }} <br>
告警主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }}  <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}{{ end -}}

{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警应用: {{ .Labels.app }} <br>
告警主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
恢复时间: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}{{ end -}}

{{- end }}

这两个配置文件，可参考 https://prometheus.io/docs/alerting/latest/configuration/ 进行修改。

5.2 启动AlertManager

运行以下命令

docker run -d -p 9093:9093 \
  -v /data/docker/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro \
  -v /data/docker/alertmanager/email.tmpl/:/etc/alertmanager/email.tmpl:ro \
  --name alertmanager \
  --restart=always \
  prom/alertmanager:v0.22.2

5.3 访问

启动成功后，可通过http://$ip:9093访问alertmanager组件

六、部署PrometheusAlert

上节已经提到，Prometheus告警需由两部分组成，上节我们已经部署好AlertManager用于信息处理与通知，本节我们需要定义好Prometheus的配置规则，如此Prometheus便可以产生告警信息并发送到AlertManager。

6.1 准备工作

运行以下命令

1
2
3

sudo mkdir -p /data/docker/prometheus-alert/conf
sudo chown -R fenixadar:fenixadar /data/docker/prometheus-alert/
nano /data/docker/prometheus-alert/conf/app.conf

从 https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/conf/app-example.conf 下载文件并移动到 /data/docker/prometheus-alert/conf/app.conf

6.2 启动

运行以下命令开启prometheus-alert

docker run -d --publish=18080:8080 \
  -v /data/docker/prometheus-alert/conf/:/app/conf:ro \
  -v /data/docker/prometheus-alert/db/:/app/db \
  -v /data/docker/prometheus-alert/log/:/app/logs \
  --name prometheusalert-center \
  feiyu563/prometheus-alert:v-4.5.0

开启成功后，通过http://$ip:18080，访问prometheus-alert界面。用户密码已在 app.conf 中设置。

如果系统开启了防火墙，记得开放白名单

1 2	sudo firewall-cmd --zone=public --add-port=18080/tcp --permanent sudo firewall-cmd --reload

6.3 配置

配置告警模板

点击AlertTemplate，进入http://$ip:18080/template，此处有各类可对接的第三方系统的模板。
以钉钉的告警模板为例，将模版内容改为如下，主要是修正时间显示慢8小时的问题，以及增加一些信息

{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}
{{if eq $v.status "resolved"}}
## [Prometheus恢复信息]({{$v.generatorURL}})
#### [{{$v.labels.alertname}}]({{$var}})
###### 告警级别：{{$v.labels.level}}
###### 开始时间：{{GetCSTtime $v.startsAt}}
###### 结束时间：{{GetCSTtime $v.endsAt}}
###### 故障主机名：{{$v.labels.hostname}}
###### 故障主机IP：{{$v.labels.instance}}
###### 故障应用：{{$v.labels.app}}
###### 故障主机对象：{{$v.labels.subject}}
##### {{$v.annotations.description}}
![Prometheus](https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png)
{{else}}
## [Prometheus告警信息]({{$v.generatorURL}})
#### [{{$v.labels.alertname}}]({{$var}})
###### 告警级别：{{$v.labels.level}}
###### 开始时间：{{GetCSTtime $v.startsAt}}
###### 故障主机名：{{$v.labels.hostname}}
###### 故障主机IP：{{$v.labels.instance}}
###### 故障应用：{{$v.labels.app}}
###### 故障主机对象：{{$v.labels.subject}}
##### {{$v.annotations.description}}
![Prometheus](https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png)
{{end}}
{{ end }}

设置钉钉机器人
在钉钉中，新建一个钉钉群，点击“群设置 -> 智能群助手 -> 添加机器人 -> 自定义 -> 安全设置”，把发送信息的服务器IP地址加进去，而后就会有 Webhook 地址。可参考 https://blog.csdn.net/knight_zhou/article/details/105583741

6.4 部署 PrometheusAlert 可参考文档

https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/install.md

七、配置告警规则

我们还需要在Prometheus Server中配置告警规则，告警规则文件引用配置在prometheus.yml文件的rule_files一节中。规则文件格式是yml，依照2.1小节的配置，在/data/docker/prometheus/alert_rules/文件夹创建yml文件，内容如下：

groups:
- name: Node_exporter Down
  rules:
  - alert: 实例丢失
    expr: up{job="node_exporter"} == 0
    for: 1m
    labels:
      level: Warning
    annotations:
      summary: "{{ $labels.job }}"
      address: "{{ $labels.instance }}"
      description: "已经有1分钟连接不上实例了."
  - alert: CPU使用率过高(> 80)
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
    for: 1m
    labels:
      level: Warning
    annotations:
      summary: "{{ $labels.instance }} CPU使用率过高"
      description: "{{ $labels.instance }}: CPU使用率超过80%，当前使用率{{ $value }}"
  - alert: 内存使用率过高(> 80)
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
    for: 1m  #告警持续时间，超过这个时间才会发送给alertmanager
    labels:
      level: Warning
    annotations:
      summary: "{{ $labels.instance }} 内存使用率过高"
      description: "{{ $labels.instance }}：内存使用率超过80%. 当前使用率{{ $value }}"
  - alert: 内存压力过大 (> 1000)
    expr: rate(node_vmstat_pgmajfault[1m]) > 1000
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 内存压力过大"
      description: "{{ $labels.instance }}：内存压力很大. 当前值{{ $value }}"
  - alert: 主机网络接口接收了太多的数据 (> 2MB/s)
    expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 2
    for: 5m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }})主机入向流量异常"
      description: "{{ $labels.instance }}：持续3分钟网口接收太多数据(> 2MB/s). 当前使用入向流量{{ $value }}MB每秒."
  - alert: 主机网络接口发送了太多的数据 (> 2MB/s)
    expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
    for: 3m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 主机出向流量异常"
      description: "{{ $labels.instance }}：持续3分钟网口发送太多数据(> 2MB/s). 当前使用入向流量{{ $value }}MB每秒."
  - alert: 磁盘每秒读数据（> 50 MB/s）
    expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
    for: 3m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 主机IO读取异常"
      description: "{{ $labels.instance }}：主机的IO读取有些问题. 当前值每秒{{ $value }}MB"
  - alert: 磁盘每秒写数据（> 50 MB/s）
    expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
    for: 2m
    labels:
      level: warning
    annotations:
     summary: "(instance {{ $labels.instance }}) 主机IO写入异常"
     description: "{{ $labels.instance }}：主机的IO写入有些问题. 当前值每秒{{ $value }}MB"
  # Please add ignored mountpoints in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
  # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
  - alert: 磁盘可用空间（<10% left）
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 主机磁盘告急"
      description: "{{ $labels.instance }}: 主机大约还剩10%的磁盘存储. 当前可用剩余{{ $value }}%"
  - alert: 磁盘读取延迟大（>100ms）
    expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 主机IO读取延迟大"
      description: "{{ $labels.instance }}: 主机的IO读取延迟有些大 >100ms . 当前值{{ $value }}"
  - alert: 磁盘写入延迟大（>100ms）
    expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 主机IO写入延迟大"
      description: "{{ $labels.instance }}: 主机的IO写入延迟有些大 >100ms . 当前值{{ $value }}"

  # 1000 context switches is an arbitrary number.
  # Alert threshold depends on nature of application.
  # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
  #- alert: 上下文切换的节点越来越多(>1500/s)
  #  expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1500
  #  for: 3m
  #  labels:
  #    level: warning
  #  annotations:
  #    summary: "(instance {{ $labels.instance }}) 主机上下文节点堆积"
  #    description: "{{ $labels.instance }}: 主机上下文节点堆积严重 >1500/s . 当前值{{ $value }}"
  - alert: 主机 swap 交换分区使用情况 (> 80%)
    expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 主机交换空间警告"
      description: "{{ $labels.instance }}: 主机交换内存到达 > 80% . 当前值{{ $value }}"
  - alert: 主机 systemctl 管理的服务 down
    expr: node_systemd_unit_state{state="failed"} == 1
    for: 0m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 有systemctl服务被DOWN"
      description: "{{ $labels.instance }}: 的{{ $value }}服务被systemctl方式DOWN了"
  - alert: 物理机温度过高( >75°)
    expr: node_hwmon_temp_celsius > 75
    for: 5m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 主机物理机温度告警"
      description: "{{ $labels.instance }}: 主机物理机温度异常( >75°)，当前值{{ $value }}"
  - alert: 触发物理节点温度报警
    expr: node_hwmon_temp_crit_alarm_celsius == 1
    for: 0m
    labels:
      level: critical
    annotations:
      summary: "(instance {{ $labels.instance }}) 主机主板温度告警"
      description: "{{ $labels.instance }}: 主板的温度过高，当前值{{ $value }}"
  - alert: 主机五分钟内接收到错误包
    expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 主机网络接收到错误包"
      description: "主机  {{ $labels.instance }} interface {{ $labels.device }} 在过去五分钟内遇到了 {{ printf \"%.0f\" $value }} 接收错误"
  - alert: 主机五分钟内发送了错误包
    expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 主机网络发送了错误包"
      description: "主机  {{ $labels.instance }} interface {{ $labels.device }} 在过去五分钟内遇到了 {{ printf \"%.0f\" $value }} 接收错误"
  - alert: TCP连接时间过长
    expr: probe_duration_seconds{job="blackbox_tcp"} > 5
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) TCP连接时间大于5秒"
      description: "TCP连接时间大于5秒, 当前值{{ $value }}"
  - alert: 主机TCP连接数
    expr: node_netstat_Tcp_CurrEstab > 800
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) TCP连接数过多"
      description: "{{ $labels.instance }}: 检测过多TCP连接 > 800, 当前值{{ $value }}"
  - alert: 待关闭的TCP连接数 > 4000
    expr: node_sockstat_TCP_tw > 4000
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 等待关闭的TCP连接数 > 4000"
      description: "{{ $labels.instance }}: 检测到过多待关闭的TCP连接数, 当前值{{ $value }}"
  - alert: 检测到时钟偏差
    expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 检测到时钟偏差"
      description: "{{ $labels.instance }}: 检测到时钟偏差。时钟不同步, 当前值{{ $value }}"
  - alert: 容器停止运行检测
    expr: time() - container_last_seen > 300
    for: 0m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }}) 容器可能已经停止运行了"
      description: "容器：{{ $labels.name }} - {{ $value }}可能已经停止运行了"
  # cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly.
  # If you want to exclude it from this alert, exclude the serie having an empty name: container_cpu_usage_seconds_total{name!=""}
  - alert: 容器CPU使用情况
    expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 300
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }})容器CPU过高"
      description: "{{ $labels.instance }}: 容器CPU 使用率 >300% , 当前值{{ $value }}%"
  # See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
  #- alert: 容器内存使用情况
  #  expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 85
  #  for: 2m
  #  labels:
  #    level: warning
  #  annotations:
  #    summary: "(instance {{ $labels.instance }})容器内存过高"
  #    description: "{{ $labels.instance }}: 容器{{ $labels.name }}内存 使用率 >85% , 当前值{{ $value }}%"
  - alert: 容器卷使用情况
    expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }})容器Volume过高"
      description: "{{ $labels.instance }}: 容器{{ $labels.name }},Volume 使用率 >80% , 当前值{{ $value }}%"
  - alert: 容器卷IO使用率
    expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }})容器Volume IO过高"
      description: "{{ $labels.instance }}: 容器{{ $labels.name }}，Volume IO使用率 >80% , 当前值{{ $value }}%"
  - alert: 容器高字节流情况
    expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }})容器字节流过高"
      description: "{{ $labels.instance }}: 容器{{ $labels.name }}字节流过高"
  - alert: Blackbox探针状态
    expr: probe_success == 0
    for: 5m
    labels:
      level: critical
    annotations:
      summary: "(instance {{ $labels.instance }})黑盒检测发现问题"
      description: "任务组{{ $labels.job }}采集到问题"
  - alert: Blackbox慢采集
    expr: avg_over_time(probe_duration_seconds[1m]) > 15
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }})黑盒采集过慢"
      description: "Blackbox用了{{ $value }}秒多的时间才完成, {{ $labels }}"
  - alert: Blackbox Ping时间过长
    expr: probe_duration_seconds{job="blackbox_ping"} > 5
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "(instance {{ $labels.instance }})Blackbox Ping时间过长"
      description: "Ping 时间大于5秒, {{ $value }},{{ $labels }}"
  - alert: Blackbox探测Http失败
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
    for: 2m
    labels:
      level: critical
    annotations:
      summary: "(instance {{ $labels.instance }})Blackbox探测Http失败"
      description: "HTTP状态代码不是200-399, {{ $value }},{{ $labels }}"
  - alert: SSL 证书 30 天后到期
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 60m
    labels:
      level: warning
    annotations:
      summary: Blackbox SSL证书将很快过期(instance {{ $labels.instance }})
      description: "SSL 证书将在30天后过期，{{ $value }}，{{ $labels }}"
  - alert: SSL 证书 3 天后到期
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
    for: 60m
    labels:
      level: critical
    annotations:
      summary: Blackbox SSL证书将很快过期(instance {{ $labels.instance }})
      description: "SSL 证书将在3天后过期，{{ $value }}，{{ $labels }}"
  - alert: SSL 证书已经到期
    expr: probe_ssl_earliest_cert_expiry - time() <= 0
    for: 60m
    labels:
      level: critical
    annotations:
      summary: Blackbox SSL证书已经过期(instance {{ $labels.instance }})
      description: "SSL 证书过期了 ，{{ $value }}，{{ $labels }}"
  - alert: Blackbox采集HTTP过慢
    expr: avg_over_time(probe_http_duration_seconds[1m]) > 3
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "Blackbox 探测慢速Http (instance {{ $labels.instance }})"
      description: "HTTP请求耗时超过3s, 当前值{{ $value }},任务对象{{ $labels.instance }}"
  - alert: Blackbox采集ICMP过慢
    expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 3
    for: 2m
    labels:
      level: warning
    annotations:
      summary: "Blackbox 探测慢速icmp (instance {{ $labels.instance }})"
      description: "ICMP请求耗时超过3s, 当前值{{ $value }},任务对象{{ $labels.instance }}"
  - alert: DNS服务器宕机
    expr: probe_dns_answer_rrs == 0
    for: 1m
    labels:
      level: Warning
    annotations:
      summary: "DNS服务器宕机"
      description: "DNS服务器已经有1分钟未响应了，可能已宕机."

配置好规则文件后，重启Prometheus Server，可在http://$ip:9090/rules页面查看规则。可以自行搜索下警报状态相关的知识点。

可参考文档：https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

本文链接： https://fenixadar.github.io/2022/08/04/极简Prometheus监控实战/
版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

FenixAdar

极简Prometheus监控实战

一、前言与目录

二、部署Prometheus Server

2.1 配置环境

2.2 启动服务端

2.3 部署 Prometheus Server 参考文档

三、部署监控探针

3.1 部署node_exporter

3.2 部署mysqld-exporter

3.3 部署cadvisor

3.4 部署blackbox_exporter

四、部署Grafana

4.1 启动

4.2 配置

4.3 部署 Grafana 可参考文档

五、部署AlertManager

5.1 准备工作

5.2 启动AlertManager

5.3 访问

六、部署PrometheusAlert

6.1 准备工作

6.2 启动

6.3 配置

6.4 部署 PrometheusAlert 可参考文档

七、配置告警规则

FenixAdar