1、 新建项目文件夹,上传安装包,解压安装包, 新建prometheus账号
mkdir /opt/gnp
tar zxf grafana-9.5.15.linux-amd64.tar.gz
mv grafana-v9.5.15/ grafana
tar zxf node_exporter-1.7.0.linux-amd64.tar.gz
mv node_exporter-1.7.0.linux-amd64 node_exporter
tar zxf prometheus-2.45.2.linux-amd64.tar.gz
mv prometheus-2.45.2.linux-amd64 prometheus
useradd -s /sbin/nologin -M prometheus
chown -R prometheus:prometheus prometheus
2、为node_exporter设置安全账号,启动配置
用户名:prometheus
密码:nhlrs@202419
安装htpasswd工具
debian:apt-get install apache2-utils
centos: yum -y install httpd-tools
生成bcrypt密钥:
htpasswd -nBC 12 '' |tr -d '\:n'
New password:
Re-type new password:
$2y$12$rNSJIMLIIAN24YJc3LNEvuERW8JkqPHWyRlaNrWE8XAr3alItKgeG
cat >/opt/gnp/node_exporter/config.yml
basic_auth_users:
prometheus: $2y$12$rNSJIMLIIAN24YJc3LNEvuERW8JkqPHWyRlaNrWE8XAr3alItKgeG
3、prometheus的配置,红蓝字体部分需要根据情况定义
cat >/opt/gnp/prometheus/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
basic_auth:
username: prometheus
password: nhlrs@202419
static_configs:
- targets: ["192.168.126.128:9090","192.168.126.128:9100"]
4、服务启动和开机自动启动
cat >/usr/lib/systemd/system/grafana.service
[Unit]
Description=grafana
After=network.target
[Service]
WorkingDirectory=/opt/gnp/grafana
ExecStart=/opt/gnp/grafana/bin/grafana-server
[Install]
WantedBy=multi-user.target
cat >/usr/lib/systemd/system/prometheus.service
[Unit]
Description=prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
WorkingDirectory=/opt/gnp/prometheus
ExecStart=/opt/gnp/prometheus/prometheus --web.enable-lifecycle #支持热重启 curl -X POST localhost:9090/-/reload
[Install]
WantedBy=multi-user.target
cat >/usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
After=network.target
[Service]
Type=simple
User=root
ExecStart=/opt/gnp/node_exporter/node_exporter --web.config.file=/opt/gnp/node_exporter/config.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
for i in grafana prometheus node_exporter;do systemctl enable --now $i;done
5、web配置
grafana:
http://192.168.126.128:3000 默认用户名/密码:admin/admin
新增数据源是http://localhost:9090
dashboard import导入离线模板
prometheus:
http://192.168.126.128:9090/targets?search=
node_exporter:
http://192.168.126.128:9100/metrics
6、告警
prometheus.yml配置:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.46.143.50:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*_rules.yml"
- "rules/*_alerts.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
basic_auth:
username: prometheus
password: nhlrs@202419
static_configs:
- targets: ["10.46.143.50:9090", "10.46.143.49:9100", "10.46.143.50:9100", "10.46.143.51:9100", "10.46.143.52:9100", "10.46.143.53:9100", "10.46.143.65:9100", "10.46.143.66:9100", "10.46.143.164:9100", "10.46.143.165:9100", "10.46.143.166:9100", "10.46.143.171:9100", "10.46.143.172:9100"]
- job_name: 'alertmanager'
static_configs:
- targets: ['10.46.143.50:9093']
rules/node_alerts.yml的配置:
groups:
- name: 主机状态-监控告警
rules:
- alert: 主机状态
expr: up {job="kubernetes-nodes"} == 0
for: 15s
labels:
status: 非常严重
annotations:
summary: "{{.instance}}:服务器宕机"
description: "{{.instance}}:服务器延时超过15s"
- alert: CPU使用情况
expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60
for: 1m
labels:
status: warning
annotations:
summary: "{{$labels.instance}}: High CPU Usage Detected"
description: "{{$labels.instance}}: CPU usage is {{$value}}, above 60%"
- alert: NodeFilesystemUsage
expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
for: 1m
labels:
severity: warning
annotations:
summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"
- alert: 内存使用
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{ $labels.instance}} 内存使用率过高!"
description: "{{ $labels.instance }} 内存使用大于80%(目前使用:{{ $value}}%)"
- alert: IO性能
expr: (avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) > 60
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{$labels.instance}} 流入磁盘IO使用率过高!"
description: "{{ $labels.instance }} 流入磁盘IO大于60%(目前使用:{{ $value }})"
- alert: 网络
expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{ $labels.instance}} 流入网络带宽过高!"
description: "{{ $labels.instance }}流入网络带宽持续2分钟高于100M. RX带宽使用率{{ $value }}"
- alert: TCP会话
expr: node_netstat_Tcp_CurrEstab > 1000
for: 1m
labels:
status: 严重告警
annotations:
summary: "{{ $labels.instance }} TCP_ESTABLISHED过高!"
description: "{{ $labels.instance }} TCP_ESTABLISHED大于1000%(目前使用:{{ $value }}%)"
curl -X POST localhost:9090/-/reload #热生效配置
alertmanager.yml配置:
global:
resolve_timeout: 1m # 每1分钟检测一次是否恢复
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_corp_id: '' # 企业微信中企业ID
wechat_api_secret: ''
templates:
- '/opt/gnp/alertmanager/template/*.tmpl'
route:
receiver: 'wechat'
group_by: ['env','instance','type','group','job','alertname']
group_wait: 10s
group_interval: 5s
repeat_interval: 1h
receivers:
- name: 'wechat'
wechat_configs:
- send_resolved: true
message: '{{ template "wechat.default.message" . }}'
to_party: '1'
agent_id: '' #企微后台查询的agentid
to_user : "@all"
api_secret: '' #后台查询的secret
template/wechat.tmpl的配置:
{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
=========云环境监控报警 =========
告警状态:{{ .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }} {{ $alert.Labels.pod }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
触发阀值:{{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = end = =========
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
=========云环境异常恢复 =========
告警类型:{{ .Labels.alertname }}
告警状态:{{ .Status }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
========= = end = =========
{{- end }}
{{- end }}
{{- end }}
{{- end }}
注册alertmanager为系统服务
cat /lib/systemd/system/alertmanager.service
[Unit]
Description=prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
WorkingDirectory=/opt/gnp/alertmanager
ExecStart=/opt/gnp/alertmanager/alertmanager
[Install]
WantedBy=multi-user.target
启动和开机自启动:systemctl daemon-reload && systemctl enable --now alertmanager
热生效配置:curl -X POST localhost:9093/-/reload
7、安装cadvisor,监控docker容器
mkdir -p /opt/gnp
if [ -d "/opt/gnp/cadvisor" ];then
/bin/rm -rf /opt/gnp/cadvisor
fi
mkdir -p /opt/gnp/cadvisor
mv /root/cadvisor-v0.47.2-linux-amd64 /opt/gnp/cadvisor/cadvisor
docker_root=$(docker info |egrep "Docker Root Dir: " |awk -F': ' '{print $2}')
cat >/usr/lib/systemd/system/cadvisor.service <<EOF
[Unit]
Description=cadvisor
After=network.target
[Service]
WorkingDirectory=/opt/gnp/cadvisor
ExecStart=/opt/gnp/cadvisor/cadvisor -port 28080 -docker_root ${docker_root:-'/var/lib/docker'}
[Install]
WantedBy=multi-user.target
EOF
chmod a+x /opt/gnp/cadvisor/cadvisor
systemctl daemon-reload
systemctl restart cadvisor
systemctl enable cadvisor
iptables -I INPUT -p tcp --dport 28080 -j ACCEPT
prometheus.yml:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.46.143.50:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*_rules.yml"
- "rules/*_alerts.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
basic_auth:
username: prometheus
password: nhlrs@202419
static_configs:
- targets: ["10.46.143.50:9090", "10.46.143.49:9100", "10.46.143.50:9100", "10.46.143.51:9100", "10.46.143.52:9100", "10.46.143.53:9100", "10.46.143.65:9100", "10.46.143.66:9100", "10.46.143.164:9100", "10.46.143.165:9100", "10.46.143.166:9100", "10.46.143.171:9100", "10.46.143.172:9100", "10.46.143.49:28080", "10.46.143.50:28080", "10.46.143.51:28080", "10.46.143.52:28080", "10.46.143.53:28080", "10.46.143.65:28080", "10.46.143.66:28080", "10.46.143.164:28080", "10.46.143.165:28080", "10.46.143.166:28080", "10.46.143.171:28080", "10.46.143.172:28080"]
- job_name: 'alertmanager'
static_configs:
- targets: ['10.46.143.50:9093']
rules/docker_alerts.yml:
groups:
- name: container.rules
rules:
- alert: Container_cpu_usage
expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 75
for: 5m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} CPU 资源利用率大于 75% , (current value is {{ $value }})
summary: Dev CPU 负载告警
- alert: Container_memory_usage
expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*4
for: 10m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} Memory 资源利用率大于 4G , (current value is {{ $value }})
summary: Dev Memory 负载告警
- alert: Container_network_receive_usage
expr: sum by (name)(irate(container_network_receive_bytes_total{image!=""}[1m])) > 1024*1024*50
for: 10m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} network_receive 资源利用率大于 50M , (current value is {{ $value }})
summary: network_receive 负载告警
grafana导入dashboard模板:
11600_rev1.json
./promtool check config ./prometheus.yml
Checking ./prometheus.yml
SUCCESS: 2 rule files found
SUCCESS: ./prometheus.yml is valid prometheus config file syntax
Checking rules/docker_alerts.yml
SUCCESS: 3 rules found
Checking rules/node_alerts.yml
SUCCESS: 7 rules found
curl -X POST localhost:9090/-/reload
8、安装blackbox_exporter监控网站
下载:https://github.com/prometheus/blackbox_exporter
注册服务:
/lib/systemd/system/blackbox.service:
[Unit]
Description=blackbox_exporter
After=network.target
[Service]
User=root
Type=simple
ExecStart=/opt/gnp/blackbox_exporter/blackbox_exporter --config.file=/opt/gnp/blackbox_exporter/blackbox.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target
设置开机自启动并启动服务:
systemctl enable --now blackbox
prometheus配置文件:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
basic_auth:
username: prometheus
password: nhlrs@202419
static_configs:
- targets: ["127.0.0.1:9092","127.0.0.1:9102","127.0.0.1:28080"]
- job_name: 'http响应状态码监控'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://example1.com
labels:
group: 'web'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
重启prometheus服务,查看targets
grafana导入9965号模板。
Categories:
docker与kubernetes