grafana+node_exporter+prometheus+alertmanager+cadvisor部署笔记

1、 新建项目文件夹,上传安装包,解压安装包, 新建prometheus账号

mkdir /opt/gnp
tar zxf grafana-9.5.15.linux-amd64.tar.gz
mv grafana-v9.5.15/ grafana


tar zxf node_exporter-1.7.0.linux-amd64.tar.gz
mv node_exporter-1.7.0.linux-amd64 node_exporter


tar zxf prometheus-2.45.2.linux-amd64.tar.gz
mv prometheus-2.45.2.linux-amd64 prometheus


useradd -s /sbin/nologin -M prometheus
chown -R prometheus:prometheus prometheus

2、为node_exporter设置安全账号,启动配置

用户名:prometheus
密码:nhlrs@202419
安装htpasswd工具
debian:apt-get install apache2-utils
centos: yum -y install httpd-tools
生成bcrypt密钥:
htpasswd -nBC 12 '' |tr -d '\:n'
New password:
Re-type new password:
$2y$12$rNSJIMLIIAN24YJc3LNEvuERW8JkqPHWyRlaNrWE8XAr3alItKgeG
cat >/opt/gnp/node_exporter/config.yml
basic_auth_users:
  prometheus: $2y$12$rNSJIMLIIAN24YJc3LNEvuERW8JkqPHWyRlaNrWE8XAr3alItKgeG

3、prometheus的配置,红蓝字体部分需要根据情况定义

cat >/opt/gnp/prometheus/prometheus.yml
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).




# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093




# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"




# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"




    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    basic_auth:
      username: prometheus
      password: nhlrs@202419




    static_configs:
      - targets: ["192.168.126.128:9090","192.168.126.128:9100"]

4、服务启动和开机自动启动

cat >/usr/lib/systemd/system/grafana.service
[Unit]
Description=grafana
After=network.target
[Service]
WorkingDirectory=/opt/gnp/grafana
ExecStart=/opt/gnp/grafana/bin/grafana-server
[Install]
WantedBy=multi-user.target


cat >/usr/lib/systemd/system/prometheus.service
[Unit]
Description=prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
WorkingDirectory=/opt/gnp/prometheus
ExecStart=/opt/gnp/prometheus/prometheus --web.enable-lifecycle #支持热重启 curl -X POST localhost:9090/-/reload
[Install]
WantedBy=multi-user.target


cat >/usr/lib/systemd/system/node_exporter.service
[Unit]
Description=node_exporter
After=network.target
[Service]
Type=simple
User=root
ExecStart=/opt/gnp/node_exporter/node_exporter --web.config.file=/opt/gnp/node_exporter/config.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target


systemctl daemon-reload


for i in grafana prometheus node_exporter;do systemctl enable --now $i;done

5、web配置

grafana:
http://192.168.126.128:3000 默认用户名/密码:admin/admin
新增数据源是http://localhost:9090
dashboard import导入离线模板


prometheus:
http://192.168.126.128:9090/targets?search=


node_exporter:
http://192.168.126.128:9100/metrics

6、告警

prometheus.yml配置:
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).




# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 10.46.143.50:9093




# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*_rules.yml"
  - "rules/*_alerts.yml"




# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"




    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    basic_auth:
      username: prometheus
      password: nhlrs@202419
    static_configs:
      - targets: ["10.46.143.50:9090", "10.46.143.49:9100", "10.46.143.50:9100", "10.46.143.51:9100", "10.46.143.52:9100", "10.46.143.53:9100", "10.46.143.65:9100", "10.46.143.66:9100", "10.46.143.164:9100", "10.46.143.165:9100", "10.46.143.166:9100", "10.46.143.171:9100", "10.46.143.172:9100"]
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['10.46.143.50:9093']
rules/node_alerts.yml的配置:
groups:
- name: 主机状态-监控告警
  rules:
  - alert: 主机状态
    expr: up {job="kubernetes-nodes"} == 0
    for: 15s
    labels:
      status: 非常严重
    annotations:
      summary: "{{.instance}}:服务器宕机"
      description: "{{.instance}}:服务器延时超过15s"




  - alert: CPU使用情况
    expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 60
    for: 1m
    labels:
      status: warning
    annotations:
      summary: "{{$labels.instance}}: High CPU Usage Detected"
      description: "{{$labels.instance}}: CPU usage is {{$value}}, above 60%"




  - alert: NodeFilesystemUsage
    expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} : {{ $labels.mountpoint }} 分区使用率过高"
      description: "{{ $labels.instance }}: {{ $labels.mountpoint }} 分区使用大于80% (当前值: {{ $value }})"




  - alert: 内存使用
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{ $labels.instance}} 内存使用率过高!"
      description: "{{ $labels.instance }} 内存使用大于80%(目前使用:{{ $value}}%)"








  - alert: IO性能
    expr: (avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) > 60
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流入磁盘IO使用率过高!"
      description: "{{ $labels.instance }} 流入磁盘IO大于60%(目前使用:{{ $value }})"








  - alert: 网络
    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{ $labels.instance}} 流入网络带宽过高!"
      description: "{{ $labels.instance }}流入网络带宽持续2分钟高于100M. RX带宽使用率{{ $value }}"
  - alert: TCP会话
    expr: node_netstat_Tcp_CurrEstab > 1000
    for: 1m
    labels:
      status: 严重告警
    annotations:
      summary: "{{ $labels.instance }} TCP_ESTABLISHED过高!"
      description: "{{ $labels.instance }} TCP_ESTABLISHED大于1000%(目前使用:{{ $value }}%)"
curl -X POST localhost:9090/-/reload #热生效配置
alertmanager.yml配置:
global:
  resolve_timeout: 1m   # 每1分钟检测一次是否恢复
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_corp_id: ''      # 企业微信中企业ID
  wechat_api_secret: ''




templates:
  - '/opt/gnp/alertmanager/template/*.tmpl'
route:
  receiver: 'wechat'
  group_by: ['env','instance','type','group','job','alertname']
  group_wait: 10s
  group_interval: 5s
  repeat_interval: 1h




receivers:
- name: 'wechat'
  wechat_configs:
  - send_resolved: true
    message: '{{ template "wechat.default.message" . }}'
    to_party: '1'
    agent_id: ''   #企微后台查询的agentid
    to_user : "@all"
    api_secret: ''  #后台查询的secret


template/wechat.tmpl的配置:


{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
=========云环境监控报警 =========
告警状态:{{   .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }} {{ $alert.Labels.pod }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
触发阀值:{{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = end =  =========
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
=========云环境异常恢复 =========
告警类型:{{ .Labels.alertname }}
告警状态:{{   .Status }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
========= = end =  =========
{{- end }}
{{- end }}
{{- end }}
{{- end }}


注册alertmanager为系统服务
cat /lib/systemd/system/alertmanager.service
[Unit]
Description=prometheus
After=network.target
[Service]
User=prometheus
Group=prometheus
WorkingDirectory=/opt/gnp/alertmanager
ExecStart=/opt/gnp/alertmanager/alertmanager
[Install]
WantedBy=multi-user.target


启动和开机自启动:systemctl daemon-reload && systemctl enable --now alertmanager


热生效配置:curl -X POST localhost:9093/-/reload

7、安装cadvisor,监控docker容器

mkdir -p /opt/gnp
if [ -d "/opt/gnp/cadvisor" ];then
/bin/rm -rf /opt/gnp/cadvisor
fi
mkdir -p /opt/gnp/cadvisor
mv /root/cadvisor-v0.47.2-linux-amd64 /opt/gnp/cadvisor/cadvisor
docker_root=$(docker info |egrep "Docker Root Dir: " |awk -F': ' '{print $2}')
cat >/usr/lib/systemd/system/cadvisor.service <<EOF
[Unit]
Description=cadvisor
After=network.target
[Service]
WorkingDirectory=/opt/gnp/cadvisor
ExecStart=/opt/gnp/cadvisor/cadvisor -port 28080 -docker_root ${docker_root:-'/var/lib/docker'}
[Install]
WantedBy=multi-user.target
EOF
chmod a+x /opt/gnp/cadvisor/cadvisor
systemctl daemon-reload
systemctl restart cadvisor
systemctl enable cadvisor
iptables -I INPUT -p tcp --dport 28080 -j ACCEPT


prometheus.yml:


# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).




# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - 10.46.143.50:9093




# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "rules/*_rules.yml"
  - "rules/*_alerts.yml"




# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"




    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    basic_auth:
      username: prometheus
      password: nhlrs@202419
    static_configs:
      - targets: ["10.46.143.50:9090", "10.46.143.49:9100", "10.46.143.50:9100", "10.46.143.51:9100", "10.46.143.52:9100", "10.46.143.53:9100", "10.46.143.65:9100", "10.46.143.66:9100", "10.46.143.164:9100", "10.46.143.165:9100", "10.46.143.166:9100", "10.46.143.171:9100", "10.46.143.172:9100", "10.46.143.49:28080", "10.46.143.50:28080", "10.46.143.51:28080", "10.46.143.52:28080", "10.46.143.53:28080", "10.46.143.65:28080", "10.46.143.66:28080", "10.46.143.164:28080", "10.46.143.165:28080", "10.46.143.166:28080", "10.46.143.171:28080", "10.46.143.172:28080"]
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['10.46.143.50:9093']


rules/docker_alerts.yml:
groups:
- name: container.rules
  rules:




  - alert: Container_cpu_usage
    expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 75
    for: 5m
    labels:
      severity: critical
    annotations:
      description: 容器 {{ $labels.name }} CPU 资源利用率大于 75% , (current value is {{ $value }})
      summary: Dev CPU 负载告警




  - alert: Container_memory_usage
    expr: sort_desc(avg by(name)(irate(container_memory_usage_bytes{name!=""}[5m]))*100) > 1024*10^3*4
    for: 10m
    labels:
      severity: critical
    annotations:
      description: 容器 {{ $labels.name }} Memory 资源利用率大于 4G , (current value is {{ $value }})
      summary: Dev Memory 负载告警




  - alert: Container_network_receive_usage
    expr: sum by (name)(irate(container_network_receive_bytes_total{image!=""}[1m])) > 1024*1024*50
    for: 10m
    labels:
      severity: critical
    annotations:
      description: 容器 {{ $labels.name }} network_receive 资源利用率大于 50M , (current value is {{ $value }})
      summary: network_receive 负载告警


grafana导入dashboard模板:
11600_rev1.json


./promtool check config ./prometheus.yml
Checking ./prometheus.yml
  SUCCESS: 2 rule files found
SUCCESS: ./prometheus.yml is valid prometheus config file syntax




Checking rules/docker_alerts.yml
  SUCCESS: 3 rules found




Checking rules/node_alerts.yml
  SUCCESS: 7 rules found


curl -X POST localhost:9090/-/reload

8、安装blackbox_exporter监控网站

下载:https://github.com/prometheus/blackbox_exporter

注册服务:

/lib/systemd/system/blackbox.service:

[Unit]
Description=blackbox_exporter
After=network.target
[Service]
User=root
Type=simple
ExecStart=/opt/gnp/blackbox_exporter/blackbox_exporter --config.file=/opt/gnp/blackbox_exporter/blackbox.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target

设置开机自启动并启动服务:

systemctl enable --now blackbox

prometheus配置文件:

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).








# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093








# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"








# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"








    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    basic_auth:
      username: prometheus
      password: nhlrs@202419








    static_configs:
      - targets: ["127.0.0.1:9092","127.0.0.1:9102","127.0.0.1:28080"]
  - job_name: 'http响应状态码监控'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - https://example.com
        - https://example1.com
        labels:
          group: 'web'
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115

重启prometheus服务,查看targets

grafana导入9965号模板。

Categories: docker与kubernetes