极简Prometheus监控实战

一、前言与目录

监控是当前云原生时代下可观测性中关键性的一环,较之前相比,云原生时代已经发生了诸多变化,诸如微服务,容器化等技术层出不穷,且云原生时代的演进速度,更新速度极快,相对应监控所产生的数据量大大增加,对实时性的要求也大大增加。为应对变化,Prometheus应运而生,其所可实现的功能,与云原生极好的契合度,集成第三方开源组件的便利性,无疑使其成为无疑是最为耀眼的明星之一。

本文着重在于介绍如何利用Prometheus搭建监控系统,涵盖探针,指标设定,可视化,告警设定,容器监控等。这是一篇入门级教程,暂不涵盖gateway,K8S集群等的相关内容。关于Prometheus的基本知识与概念,自行google之,本文重点描述实战过程。

目录:

  1. 部署Prometheus Server
  2. 部署监控探针
  3. 部署Grafana
  4. 部署AlertManager
  5. 部署PrometheusAlert
  6. 配置告警规则

二、部署Prometheus Server

本节主要介绍以docker的方式部署Prometheus Server,并预留映射相关配置项

2.1 配置环境

  1. 创建文件夹并授予权限
    1
    2
    sudo mkdir -pv /data/docker/prometheus/{data,alert_rules,job}
    sudo chown -R myusername:myusername /data/docker/prometheus/
    其中,
  • data文件夹用于存放prometheus产生的数据
  • alert_rules文件夹用于存放prometheus alert告警规则配置文件
  • job用于存放监控对象配置json文件
  • myusername可替换为实际的用户名
  1. 执行本条命令以避免出现 permission denied 错误

    1
    sudo chown 65534:65534 -R /data/docker/prometheus/data
  2. 拷贝配置文件到指定的目录,注意下,需要关注该文件中涉及“$ip”的部分,后续配置,诸如添加AlertManager后,记得返回修改修改此处。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    # my global config
    global:
    scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
    evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
    # scrape_timeout is set to the global default (10s).

    # Alertmanager configuration
    alerting:
    alertmanagers:
    - static_configs:
    # - targets: ["$ip:9093"]
    # - alertmanager:9093

    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
    # - "first_rules.yml"
    # - "second_rules.yml"
    - /etc/prometheus/alert_rules/*.rules

    # A scrape configuration containing exactly one endpoint to scrape:
    # Here it's Prometheus itself.
    scrape_configs:
    # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
    - job_name: 'prometheus'
    file_sd_configs:
    - files:
    - /etc/prometheus/job/prometheus.json
    refresh_interval: 1m # 重载配置文件

    # Node 主机组
    - job_name: 'host'
    #basic_auth:
    # username: prometheus
    # password: prometheus
    file_sd_configs:
    - files:
    - /etc/prometheus/job/host.json
    refresh_interval: 1m

    # cadvisor 容器组
    - job_name: 'cadvisor'
    file_sd_configs:
    - files:
    - /etc/prometheus/job/cadvisor.json
    refresh_interval: 1m

    # mysql exporter 组
    - job_name: 'mysqld-exporter'
    file_sd_configs:
    - files:
    - /etc/prometheus/job/mysqld-exporter.json
    refresh_interval: 1m

    # blackbox ping 组
    - job_name: 'blackbox_ping'
    scrape_interval: 5s
    scrape_timeout: 2s
    metrics_path: /probe
    params:
    module: [ping]
    file_sd_configs:
    - files:
    - /etc/prometheus/job/blackbox/ping/*.json
    refresh_interval: 1m
    relabel_configs:
    - source_labels: [__address__]
    target_label: __param_target
    - source_labels: [__param_target]
    target_label: instance
    - target_label: __address__
    replacement: $ip:9115

    # blackbox http get 2xx 组
    - job_name: 'blackbox_http_2xx'
    scrape_interval: 5s
    metrics_path: /probe
    params:
    module: [http_2xx]
    file_sd_configs:
    - files:
    - /etc/prometheus/job/blackbox/http_2xx/*.json
    refresh_interval: 1m
    relabel_configs:
    - source_labels: [__address__]
    target_label: __param_target
    - source_labels: [__param_target]
    target_label: instance
    - target_label: __address__
    replacement: $ip:9115

    - job_name: "blackbox_tcp"
    metrics_path: /probe
    params:
    module: [tcp_connect]
    file_sd_configs:
    - files:
    - /etc/prometheus/job/blackbox/tcp/*.json
    refresh_interval: 1m
    relabel_configs:
    - source_labels: [__address__]
    target_label: __param_target
    - source_labels: [__param_target]
    target_label: instance
    - target_label: __address__
    replacement: $ip:9115

    - job_name: 'blackbox_ssh_banner'
    metrics_path: /probe
    params:
    module: [ssh_banner]
    file_sd_configs:
    - files:
    - /etc/prometheus/job/blackbox/ssh_banner/*.json
    refresh_interval: 1m
    relabel_configs:
    # Ensure port is 22, pass as URL parameter
    - source_labels: [__address__]
    regex: (.*?)(:.*)?
    replacement: ${1}:22
    target_label: __param_target
    # Make instance label the target
    - source_labels: [__param_target]
    target_label: instance
    # Actually talk to the blackbox exporter though
    - target_label: __address__
    replacement: $ip:9115

    - job_name: "blackbox_dns"
    metrics_path: /probe
    params:
    module: [dns_udp]
    file_sd_configs:
    - files:
    - /etc/prometheus/job/blackbox/dns/*.json
    refresh_interval: 1m
    relabel_configs:
    - source_labels: [__address__]
    target_label: __param_target
    - source_labels: [__param_target]
    target_label: instance
    - target_label: __address__
    replacement: $ip:9115

2.2 启动服务端

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
docker run -itd  \
-p 9090:9090 \
-v /data/docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
-v /data/docker/prometheus/alert_rules:/etc/prometheus/alert_rules \
-v /data/docker/prometheus/job:/etc/prometheus/job \
-v /data/docker/prometheus/data:/data/prometheus/ \
-v /etc/timezone:/etc/timezone:ro \
-v /etc/localtime:/etc/localtime:ro \
--name prometheus \
--restart=always \
prom/prometheus:v2.28.1 \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus/ \
--storage.tsdb.retention.time=30d \
--web.read-timeout=5m \
--web.max-connections=10 \
--query.max-concurrency=20 \
--query.timeout=2m \
--web.enable-lifecycle

启动成功后,通过浏览器访问 http://$ip:9090 可看到界面。

如果系统打开了防火墙,你可能需要给以下几个端口开白名单,以centos7为例,

1
2
3
4
sudo firewall-cmd --zone=public --add-port=9090/tcp --permanent
sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent
sudo firewall-cmd --zone=public --add-port=3000/tcp --permanent
sudo firewall-cmd --reload

2.3 部署 Prometheus Server 参考文档

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

三、部署监控探针

Prometheus与Zabbix不同,Prometheus主要采用主动拉取的模式,通过Exporter提供的接口读取监控数据。Exporter负责采集数据,可以把Exporter理解为探针,并通过http的方式提供接口供Server调用读取数据,读者可自行google本文未描述的各个exporter提供的返回结果内字段的含义。

3.1 部署node_exporter

node_exporter用于监控主机的CPU,内存,磁盘,I/O等的信息。侧重点在于主机系统本身的数据采集。

  1. 下载 node exporter 并解压

登录需要被监控的主机,可从 此处 下载 node exporter

或者运行curl -O https://github.com/prometheus/node_exporter/releases/download/v1.2.0/node_exporter-1.2.0.linux-amd64.tar.gz

下载完成后,运行以下命令解压二进制包

1
2
3
tar xvfz node_exporter-1.2.0.linux-amd64.tar.gz
sudo mkdir -p /data/node_exporter/
sudo mv node_exporter-1.2.0.linux-amd64/* /data/node_exporter/
  1. 创建prometheus用户

    1
    2
    3
    sudo groupadd prometheus
    sudo useradd -g prometheus -m -d /var/lib/prometheus -s /sbin/nologin prometheus
    sudo chown prometheus.prometheus -R /data/node_exporter/
  2. 创建Systemd服务

添加并编辑文件

1
sudo nano /etc/systemd/system/node_exporter.service

写入以下内容

1
2
3
4
5
6
7
8
9
10
11
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/data/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
  1. 使用systemctl 启动 node exporter
    启动并查看服务是否正常
    1
    2
    sudo systemctl start node_exporter
    sudo systemctl status node_exporter

应该返回类似以下的文本

1
2
3
4
5
6
● node_exporter.service - node_exporter
Loaded: loaded (/etc/systemd/system/node_exporter.service; disabled; vendor preset: disabled)
Active: active (running) since 三 2019-06-05 09:18:56 GMT; 3s ago
Main PID: 11050 (node_exporter)
CGroup: /system.slice/node_exporter.service
└─11050 /usr/local/prometheus/node_exporter/node_exporter

设置开机启动: sudo systemctl enable node_exporter

  1. 开启防火墙白名单
    执行curl localhost:9100,如可以看到返回的网页,说明 node exporter 已经启动成功了。
    在同网段内其他机器执行curl http://$ip:9100/,应同样可以看到返回的页面。

如果看不到返回的页面,可以检查下是否为防火墙端口未开

1
2
sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent
sudo firewall-cmd --reload
  1. 配置 Prometheus
    登录 Prometheus 服务端,编辑以下文件
    nano /data/docker/prometheus/job/host.json, 内容参考如下,ip地址自行更改实际的ip地址

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    [
    {
    "targets": [ "192.168.1.100:9100"],
    "labels": {
    "subject": "node_exporter",
    "hostname": "server1"
    }
    },
    {
    "targets": [ "192.168.1.101:9100"],
    "labels": {
    "subject": "node_exporter",
    "hostname": "server2"
    }
    }
    ]
  2. 部署 node_exporter 可参考文档
    https://github.com/prometheus/node_exporter
    https://prometheus.io/docs/guides/node-exporter/
    https://www.jianshu.com/p/7bec152d1a1f

3.2 部署mysqld-exporter

mysqld-exporter 用于监控MySQL数据库的性能等数据。

  1. 登录mysql数据库所在主机,并通过docker方式启动
    1
    2
    3
    4
    5
    6
    7
    docker run -d \
    -p 9104:9104 \
    --link mysql \
    --name mysqld-exporter \
    --restart on-failure:5 \
    -e DATA_SOURCE_NAME="root:pwdpwdpwdpwdpwd@(mysql:3306)/" \
    prom/mysqld-exporter:v0.13.0

启动后,访问http://127.0.0.1:9104/metrics,可看到监控信息,同时从Prometheus服务端访问也应要可访问的到。

  1. 部署 mysqld-exporter 可参考文档
    https://github.com/prometheus/mysqld_exporter
    https://registry.hub.docker.com/r/prom/mysqld-exporter/

3.3 部署cadvisor

cadvisor用于监控容器的状态。

  1. 登录docker所在主机并通过运行以下脚本启动cadvisor
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    docker run \
    --volume=/:/rootfs:ro \
    --volume=/var/run:/var/run:ro \
    --volume=/sys:/sys:ro \
    --volume=/var/lib/docker/:/var/lib/docker:ro \
    --volume=/dev/disk/:/dev/disk:ro \
    --publish=9101:8080 \
    --detach=true \
    --name=cadvisor \
    --restart on-failure:5 \
    --privileged \
    --device=/dev/kmsg \
    gcr.io/cadvisor/cadvisor:v0.38.6

你可能会找到两种 cadvisor,一种是 gcr.io/cadvisor/cadvisor, 另一种是 google/cadvisor, 建议使用 gcr.io/cadvisor/cadvisor

  1. 配置 Prometheus 服务端
    登录Prometheus 服务端所在主机,编辑 nano /data/docker/prometheus/job/cadvisor.json 文件, 内容参考如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    [
    {
    "targets": [ "192.168.1.100:9101"],
    "labels": {
    "subject": "cadvisor",
    "hostname": "server1"
    }
    },
    {
    "targets": [ "192.168.1.101:9101"],
    "labels": {
    "subject": "cadvisor",
    "hostname": "server2"
    }
    }
    ]
  2. 如果docker所在主机存在防火墙,记得添加防火墙白名单

    1
    2
    sudo firewall-cmd --zone=public --add-port=9101/tcp --permanent
    sudo firewall-cmd --reload
  3. 部署 cadvisor 可参考文档
    https://github.com/google/cadvisor

3.4 部署blackbox_exporter

blackbox_exporter 是以黑盒方式进行监控的工具

  1. 创建配置文件
    登录 Prometheus 服务端主机,执行以下命令
    1
    2
    sudo mkdir -p /data/docker/blackbox/conf
    sudo chown -R myusername:myusername /data/docker/blackbox

并添加编辑该文件

1
nano /data/docker/blackbox/conf/blackbox.yml

yml文件范本如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
modules:
ping:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
http_2xx:
prober: http
timeout: 5s
http:
method: GET
preferred_ip_protocol: "ip4" # defaults to "ip4"
ip_protocol_fallback: false # no fallback to "ip6"
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
preferred_ip_protocol: "ip4"
http_post_2xx_json:
prober: http
timeout: 30s
http:
preferred_ip_protocol: "ip4"
method: POST
headers:
Content-Type: application/json
body: '{"key1":""vlaue1,"params":{"param2":"vlaue2"}}'
http_basic_auth:
prober: http
timeout: 60s
http:
method: POST
headers:
Host: "login.example.com"
basic_auth:
username: "username"
password: "mysecret"

tls_connect:
prober: tcp
timeout: 5s
tcp:
tls: true
tcp_connect:
prober: tcp
timeout: 5s

pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true

ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
- send: SSH-2.0-blackbox-ssh-check

irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"

dns_udp:
prober: dns
timeout: 10s
dns:
transport_protocol: udp
preferred_ip_protocol: ip4
query_name: "www.example.cn"
query_type: "A"
  1. 配置 Prometheus
    继续在 Prometheus 服务端主机,执行以下命令
    1
    2
    3
    4
    sudo mkdir -p /data/docker/prometheus/job/blackbox/

    sudo mkdir -pv /data/docker/prometheus/job/blackbox/{dns,http_2xx,ping,ssh_banner,tcp}
    sudo chown -R myusername:myusername /data/docker/prometheus/job/blackbox/

以下依次在/data/docker/prometheus/job/blackbox/下的对应的文件夹中,创建json文件,并参考样本写入配置

在dns文件夹下,创建 dns.json,样本如下

1
2
3
4
5
6
7
8
9
[
{
"targets": [ "192.168.1.1"],
"labels": {
"subject": "blackbox_dns",
"app": "my_dns"
}
}
]

在http_2xx文件夹下,创建 search-site.json,样本如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[
{
"targets": [ "https://www.google.cn/?HealthCheck"],
"labels": {
"app": "google",
"subject": "blackbox_http_2xx",
"hostname": "server-01"
}
},
{
"targets": [ "https://cn.bing.com/?HealthCheck"],
"labels": {
"app": "bing",
"subject": "blackbox_http_2xx",
"hostname": "server-02"
}
}
]

在ping文件夹下,创建 search-site.json,样本如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[
{
"targets": [ "www.google.cn"],
"labels": {
"app": "google",
"subject": "blackbox_ping",
"hostname": "server-01"
}
},
{
"targets": [ "cn.bing.com"],
"labels": {
"app": "bing",
"subject": "blackbox_ping",
"hostname": "server-02"
}
}
]

在ssh_banner文件夹下,创建 ssh-banner.json,样本如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[
{
"targets": [ "192.168.1.100:22"],
"labels": {
"subject": "blackbox_ssh_banner",
"hostname": "server-01"
}
},
{
"targets": [ "192.168.1.101:22"],
"labels": {
"subject": "blackbox_ssh_banner",
"hostname": "server-02"
}
}
]

在tcp文件夹下,创建 tcp.json,样本如下

1
2
3
4
5
6
7
8
9
10
[
{
"targets": [ "$ip:3306"],
"labels": {
"app": "mysql.example.cn",
"subject": "blackbox_tcp",
"hostname": "mysql"
}
}
]
  1. 运行blackbox_exporter

在 Prometheus服务端所在的主机,运行以下命令,使用容器启动blackbox_exporter

1
2
3
4
5
6
7
docker run -d \
--restart on-failure:5 \
-p 9115:9115 \
-v /data/docker/blackbox/conf/blackbox.yml:/config/blackbox.yml:ro \
--name blackbox_exporter \
prom/blackbox-exporter:v0.19.0 \
--config.file=/config/blackbox.yml

启动成功后,访问http://$ip:9090/targets,可看到至今为止,我们配置的所有探针所反馈回来的数据,其中,State应为UP状态。

  1. 部署 blackbox_exporter 可参考文档
    https://github.com/prometheus/blackbox_exporter
    https://yunlzheng.gitbook.io/prometheus-book/part-ii-prometheus-jin-jie/exporter/commonly-eporter-usage/install_blackbox_exporter

四、部署Grafana

接下来,部署可视化工具Grafana,Grafana可快速集成Prometheus,并通过设定甚至是使用现成的模板,快速将采集结果转变为图形化的页面。

4.1 启动

运行以下命令,做启动前的准备工作

1
2
sudo mkdir -p /data/docker/grafana
sudo chown 472:472 /data/docker/grafana -R

通过docker运行grafana

1
2
3
4
5
6
7
8
docker run -d \
-p 3000:3000 \
--name=grafana \
-v /data/docker/grafana:/var/lib/grafana \
-v /etc/localtime:/etc/localtime:ro \
--restart=always \
--name grafana \
grafana/grafana:8.0.6

启动成功后,可通过http://$ip:3000访问页面,默认账号密码: admin / admin 。

4.2 配置

  1. 配置数据源
    点击“Configuration -> Data sources”,进入 http://$ip:3000/datasources,增加Prometheus数据源,并做好配置。

  2. 配置Dashboards

点击“Dashboards -> Manage -> import”,进入http://$ip:3000/dashboard/import,导入 Grafana Dashboards 模板,在Import via grafana.com处,填入你想要导入的模板id,常用的模板id如下:

  • node exporter ID: 8919
  • Cadvisor ID: 14282
  • mysqld-exporter ID: 7362

你也可以在https://grafana.com/grafana/dashboards,自行搜索 Dashboards 模板。也可以自行创建dashboard面板。

4.3 部署 Grafana 可参考文档

https://grafana.com/docs/grafana/latest/installation/docker/

五、部署AlertManager

截至到现在,我们已经部署好Prometheus Server,Exporter,Grafana可视化组件,我们还需要配置告警组件,当故障出现时,监控系统可通过多种方式告知接收人,以便接收人及时知晓并处理。但Prometheus本身并不自带告警工具,Prometheus可以通过预配置的规则,将信息发送到AlertManager,由AlertManager统一处理告警信息,并通过邮箱,短信,微信,钉钉等方式告知告警接收人。和Grafana一样,AlertManager同样不仅仅支持Prometheus,也支持集成处理其他程序的信息。

5.1 准备工作

运行以下命令

1
2
3
sudo mkdir -pv /data/docker/alertmanager
sudo chown -R myusername:myusername /data/docker/alertmanager/
cd /data/docker/alertmanager

/data/docker/alertmanager文件夹中,创建alertmanager.yml 和 email.tmpl 文件,

alertmanager.yml的样例如下,注意要设置smtp相关配置项与webhook的ddurl:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
global:
resolve_timeout: 5m
# 邮件SMTP配置
smtp_smarthost: 'smtp.gmail.com:465'
smtp_from: 'example@gmail.com'
smtp_auth_username: 'example@gmail.com'
smtp_auth_password: 'xxxxx'
smtp_require_tls: false
# 自定义通知模板
templates:
- '/etc/alertmanager/email.tmpl'
# route用来设置报警的分发策略
route:
# 采用哪个标签来作为分组依据
group_by: ['alertname']
# 组告警等待时间。也就是告警产生后等待10s,如果有同组告警一起发出
group_wait: 10s
# 两组告警的间隔时间
group_interval: 10s
# 重复告警的间隔时间,减少相同邮件的发送频率
repeat_interval: 1h
# 设置默认接收人
receiver: 'myreceiver'
routes: # 可以指定哪些组接手哪些消息
- receiver: 'myreceiver'
continue: true
group_wait: 10s
receivers:
- name: 'myreceiver'
#send_resolved: true
email_configs:
# - to: 'example@gmail.com, example2@gmail.com'
- to: 'example@gmail.com'
html: '{{ template "email.to.html" . }}'
headers: { Subject: "Prometheus [Warning] 报警邮件" }
# 钉钉配置
webhook_configs:
- url: 'http://$ip:18080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxx'

email.tmpl的样例如下,注意样例中有一个”2006-01-02 15:04:05”,这个时间不能改,否则报警显示时间可能会不正确:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警应用: {{ .Labels.app }} <br>
告警主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}{{ end -}}

{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警应用: {{ .Labels.app }} <br>
告警主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
恢复时间: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}{{ end -}}

{{- end }}

这两个配置文件,可参考 https://prometheus.io/docs/alerting/latest/configuration/ 进行修改。

5.2 启动AlertManager

运行以下命令

1
2
3
4
5
6
docker run -d -p 9093:9093 \
-v /data/docker/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro \
-v /data/docker/alertmanager/email.tmpl/:/etc/alertmanager/email.tmpl:ro \
--name alertmanager \
--restart=always \
prom/alertmanager:v0.22.2

5.3 访问

启动成功后,可通过http://$ip:9093访问alertmanager组件

六、部署PrometheusAlert

上节已经提到,Prometheus告警需由两部分组成,上节我们已经部署好AlertManager用于信息处理与通知,本节我们需要定义好Prometheus的配置规则,如此Prometheus便可以产生告警信息并发送到AlertManager。

6.1 准备工作

运行以下命令

1
2
3
sudo mkdir -p /data/docker/prometheus-alert/conf
sudo chown -R fenixadar:fenixadar /data/docker/prometheus-alert/
nano /data/docker/prometheus-alert/conf/app.conf

https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/conf/app-example.conf 下载文件并移动到 /data/docker/prometheus-alert/conf/app.conf

6.2 启动

运行以下命令开启prometheus-alert

1
2
3
4
5
6
docker run -d --publish=18080:8080 \
-v /data/docker/prometheus-alert/conf/:/app/conf:ro \
-v /data/docker/prometheus-alert/db/:/app/db \
-v /data/docker/prometheus-alert/log/:/app/logs \
--name prometheusalert-center \
feiyu563/prometheus-alert:v-4.5.0

开启成功后,通过http://$ip:18080,访问prometheus-alert界面。用户密码已在 app.conf 中设置。

如果系统开启了防火墙,记得开放白名单

1
2
sudo firewall-cmd --zone=public --add-port=18080/tcp --permanent
sudo firewall-cmd --reload

6.3 配置

  1. 配置告警模板

点击AlertTemplate,进入http://$ip:18080/template,此处有各类可对接的第三方系统的模板。
以钉钉的告警模板为例,将模版内容改为如下,主要是修正时间显示慢8小时的问题,以及增加一些信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}
{{if eq $v.status "resolved"}}
## [Prometheus恢复信息]({{$v.generatorURL}})
#### [{{$v.labels.alertname}}]({{$var}})
###### 告警级别:{{$v.labels.level}}
###### 开始时间:{{GetCSTtime $v.startsAt}}
###### 结束时间:{{GetCSTtime $v.endsAt}}
###### 故障主机名:{{$v.labels.hostname}}
###### 故障主机IP:{{$v.labels.instance}}
###### 故障应用:{{$v.labels.app}}
###### 故障主机对象:{{$v.labels.subject}}
##### {{$v.annotations.description}}
![Prometheus](https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png)
{{else}}
## [Prometheus告警信息]({{$v.generatorURL}})
#### [{{$v.labels.alertname}}]({{$var}})
###### 告警级别:{{$v.labels.level}}
###### 开始时间:{{GetCSTtime $v.startsAt}}
###### 故障主机名:{{$v.labels.hostname}}
###### 故障主机IP:{{$v.labels.instance}}
###### 故障应用:{{$v.labels.app}}
###### 故障主机对象:{{$v.labels.subject}}
##### {{$v.annotations.description}}
![Prometheus](https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png)
{{end}}
{{ end }}
  1. 设置钉钉机器人
    在钉钉中,新建一个钉钉群,点击“群设置 -> 智能群助手 -> 添加机器人 -> 自定义 -> 安全设置”,把发送信息的服务器IP地址加进去,而后就会有 Webhook 地址。可参考 https://blog.csdn.net/knight_zhou/article/details/105583741

6.4 部署 PrometheusAlert 可参考文档

https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/install.md

七、配置告警规则

我们还需要在Prometheus Server中配置告警规则,告警规则文件引用配置在prometheus.yml文件的rule_files一节中。规则文件格式是yml,依照2.1小节的配置,在/data/docker/prometheus/alert_rules/文件夹创建yml文件,内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
groups:
- name: Node_exporter Down
rules:
- alert: 实例丢失
expr: up{job="node_exporter"} == 0
for: 1m
labels:
level: Warning
annotations:
summary: "{{ $labels.job }}"
address: "{{ $labels.instance }}"
description: "已经有1分钟连接不上实例了."
- alert: CPU使用率过高(> 80)
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 1m
labels:
level: Warning
annotations:
summary: "{{ $labels.instance }} CPU使用率过高"
description: "{{ $labels.instance }}: CPU使用率超过80%,当前使用率{{ $value }}"
- alert: 内存使用率过高(> 80)
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m #告警持续时间,超过这个时间才会发送给alertmanager
labels:
level: Warning
annotations:
summary: "{{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }}:内存使用率超过80%. 当前使用率{{ $value }}"
- alert: 内存压力过大 (> 1000)
expr: rate(node_vmstat_pgmajfault[1m]) > 1000
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 内存压力过大"
description: "{{ $labels.instance }}:内存压力很大. 当前值{{ $value }}"
- alert: 主机网络接口接收了太多的数据 (> 2MB/s)
expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 2
for: 5m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})主机入向流量异常"
description: "{{ $labels.instance }}:持续3分钟网口接收太多数据(> 2MB/s). 当前使用入向流量{{ $value }}MB每秒."
- alert: 主机网络接口发送了太多的数据 (> 2MB/s)
expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
for: 3m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机出向流量异常"
description: "{{ $labels.instance }}:持续3分钟网口发送太多数据(> 2MB/s). 当前使用入向流量{{ $value }}MB每秒."
- alert: 磁盘每秒读数据(> 50 MB/s)
expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
for: 3m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO读取异常"
description: "{{ $labels.instance }}:主机的IO读取有些问题. 当前值每秒{{ $value }}MB"
- alert: 磁盘每秒写数据(> 50 MB/s)
expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO写入异常"
description: "{{ $labels.instance }}:主机的IO写入有些问题. 当前值每秒{{ $value }}MB"
# Please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
- alert: 磁盘可用空间(<10% left)
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机磁盘告急"
description: "{{ $labels.instance }}: 主机大约还剩10%的磁盘存储. 当前可用剩余{{ $value }}%"
- alert: 磁盘读取延迟大(>100ms)
expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO读取延迟大"
description: "{{ $labels.instance }}: 主机的IO读取延迟有些大 >100ms . 当前值{{ $value }}"
- alert: 磁盘写入延迟大(>100ms)
expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO写入延迟大"
description: "{{ $labels.instance }}: 主机的IO写入延迟有些大 >100ms . 当前值{{ $value }}"

# 1000 context switches is an arbitrary number.
# Alert threshold depends on nature of application.
# Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
#- alert: 上下文切换的节点越来越多(>1500/s)
# expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1500
# for: 3m
# labels:
# level: warning
# annotations:
# summary: "(instance {{ $labels.instance }}) 主机上下文节点堆积"
# description: "{{ $labels.instance }}: 主机上下文节点堆积严重 >1500/s . 当前值{{ $value }}"
- alert: 主机 swap 交换分区使用情况 (> 80%)
expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机交换空间警告"
description: "{{ $labels.instance }}: 主机交换内存到达 > 80% . 当前值{{ $value }}"
- alert: 主机 systemctl 管理的服务 down
expr: node_systemd_unit_state{state="failed"} == 1
for: 0m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 有systemctl服务被DOWN"
description: "{{ $labels.instance }}: 的{{ $value }}服务被systemctl方式DOWN了"
- alert: 物理机温度过高( >75°)
expr: node_hwmon_temp_celsius > 75
for: 5m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机物理机温度告警"
description: "{{ $labels.instance }}: 主机物理机温度异常( >75°),当前值{{ $value }}"
- alert: 触发物理节点温度报警
expr: node_hwmon_temp_crit_alarm_celsius == 1
for: 0m
labels:
level: critical
annotations:
summary: "(instance {{ $labels.instance }}) 主机主板温度告警"
description: "{{ $labels.instance }}: 主板的温度过高,当前值{{ $value }}"
- alert: 主机五分钟内接收到错误包
expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机网络接收到错误包"
description: "主机 {{ $labels.instance }} interface {{ $labels.device }} 在过去五分钟内遇到了 {{ printf \"%.0f\" $value }} 接收错误"
- alert: 主机五分钟内发送了错误包
expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机网络发送了错误包"
description: "主机 {{ $labels.instance }} interface {{ $labels.device }} 在过去五分钟内遇到了 {{ printf \"%.0f\" $value }} 接收错误"
- alert: TCP连接时间过长
expr: probe_duration_seconds{job="blackbox_tcp"} > 5
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) TCP连接时间大于5秒"
description: "TCP连接时间大于5秒, 当前值{{ $value }}"
- alert: 主机TCP连接数
expr: node_netstat_Tcp_CurrEstab > 800
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) TCP连接数过多"
description: "{{ $labels.instance }}: 检测过多TCP连接 > 800, 当前值{{ $value }}"
- alert: 待关闭的TCP连接数 > 4000
expr: node_sockstat_TCP_tw > 4000
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 等待关闭的TCP连接数 > 4000"
description: "{{ $labels.instance }}: 检测到过多待关闭的TCP连接数, 当前值{{ $value }}"
- alert: 检测到时钟偏差
expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 检测到时钟偏差"
description: "{{ $labels.instance }}: 检测到时钟偏差。时钟不同步, 当前值{{ $value }}"
- alert: 容器停止运行检测
expr: time() - container_last_seen > 300
for: 0m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 容器可能已经停止运行了"
description: "容器:{{ $labels.name }} - {{ $value }}可能已经停止运行了"
# cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly.
# If you want to exclude it from this alert, exclude the serie having an empty name: container_cpu_usage_seconds_total{name!=""}
- alert: 容器CPU使用情况
expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 300
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器CPU过高"
description: "{{ $labels.instance }}: 容器CPU 使用率 >300% , 当前值{{ $value }}%"
# See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
#- alert: 容器内存使用情况
# expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 85
# for: 2m
# labels:
# level: warning
# annotations:
# summary: "(instance {{ $labels.instance }})容器内存过高"
# description: "{{ $labels.instance }}: 容器{{ $labels.name }}内存 使用率 >85% , 当前值{{ $value }}%"
- alert: 容器卷使用情况
expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器Volume过高"
description: "{{ $labels.instance }}: 容器{{ $labels.name }},Volume 使用率 >80% , 当前值{{ $value }}%"
- alert: 容器卷IO使用率
expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器Volume IO过高"
description: "{{ $labels.instance }}: 容器{{ $labels.name }},Volume IO使用率 >80% , 当前值{{ $value }}%"
- alert: 容器高字节流情况
expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器字节流过高"
description: "{{ $labels.instance }}: 容器{{ $labels.name }}字节流过高"
- alert: Blackbox探针状态
expr: probe_success == 0
for: 5m
labels:
level: critical
annotations:
summary: "(instance {{ $labels.instance }})黑盒检测发现问题"
description: "任务组{{ $labels.job }}采集到问题"
- alert: Blackbox慢采集
expr: avg_over_time(probe_duration_seconds[1m]) > 15
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})黑盒采集过慢"
description: "Blackbox用了{{ $value }}秒多的时间才完成, {{ $labels }}"
- alert: Blackbox Ping时间过长
expr: probe_duration_seconds{job="blackbox_ping"} > 5
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})Blackbox Ping时间过长"
description: "Ping 时间大于5秒, {{ $value }},{{ $labels }}"
- alert: Blackbox探测Http失败
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 2m
labels:
level: critical
annotations:
summary: "(instance {{ $labels.instance }})Blackbox探测Http失败"
description: "HTTP状态代码不是200-399, {{ $value }},{{ $labels }}"
- alert: SSL 证书 30 天后到期
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 60m
labels:
level: warning
annotations:
summary: Blackbox SSL证书将很快过期(instance {{ $labels.instance }})
description: "SSL 证书将在30天后过期,{{ $value }},{{ $labels }}"
- alert: SSL 证书 3 天后到期
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
for: 60m
labels:
level: critical
annotations:
summary: Blackbox SSL证书将很快过期(instance {{ $labels.instance }})
description: "SSL 证书将在3天后过期,{{ $value }},{{ $labels }}"
- alert: SSL 证书已经到期
expr: probe_ssl_earliest_cert_expiry - time() <= 0
for: 60m
labels:
level: critical
annotations:
summary: Blackbox SSL证书已经过期(instance {{ $labels.instance }})
description: "SSL 证书过期了 ,{{ $value }},{{ $labels }}"
- alert: Blackbox采集HTTP过慢
expr: avg_over_time(probe_http_duration_seconds[1m]) > 3
for: 2m
labels:
level: warning
annotations:
summary: "Blackbox 探测慢速Http (instance {{ $labels.instance }})"
description: "HTTP请求耗时超过3s, 当前值{{ $value }},任务对象{{ $labels.instance }}"
- alert: Blackbox采集ICMP过慢
expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 3
for: 2m
labels:
level: warning
annotations:
summary: "Blackbox 探测慢速icmp (instance {{ $labels.instance }})"
description: "ICMP请求耗时超过3s, 当前值{{ $value }},任务对象{{ $labels.instance }}"
- alert: DNS服务器宕机
expr: probe_dns_answer_rrs == 0
for: 1m
labels:
level: Warning
annotations:
summary: "DNS服务器宕机"
description: "DNS服务器已经有1分钟未响应了,可能已宕机."

配置好规则文件后,重启Prometheus Server,可在http://$ip:9090/rules页面查看规则。可以自行搜索下警报状态相关的知识点。

可参考文档:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/