目 录CONTENT

文章目录

prometheus+grafana+AlertManager部署

简中仙
2022-09-09 / 0 评论 / 0 点赞 / 87 阅读 / 0 字 / 正在检测是否收录...
温馨提示:
本文最后更新于2024-01-08,若内容或图片失效,请留言反馈。 本文如有错误或者侵权的地方,欢迎您批评指正!

这里介绍和搭建一套监控体系,所使用的是prometheus监控采集监控指标数据,将数据保存到时序数据库中,使用grafana做监控指标数据可视化,告警使用的是alertmanager。

一、组件介绍

1、Prometheus介绍

Prometheus 是一套开源的系统监控报警框架。它启发于 Google 的 borgmon 监控系统,由工作在 SoundCloud 的 google 前员工在 2012 年创建,作为社区开源项目进行开发,并于 2015 年正式发布。2016 年,Prometheus 正式加入 Cloud Native Computing Foundation,成为受欢迎度仅次于 Kubernetes 的项目。

为新一代的监控框架,Prometheus 具有以下特点:

  • 强大的多维度数据模型: 时间序列数据通过 metric 名和键值对来区分。 所有的 metrics 都可以设置任意的多维标签。
  • 数据模型更随意,不需要刻意设置为以点分隔的字符串。 可以对数据模型进行聚合,切割和切片操作。
  • 支持双精度浮点类型,标签可以设为全unicode。 灵活而强大的查询语句(PromQL):在同一个查询语句,可以对多个 metrics进行乘法、加法、连接、取分数位等操作。
  • 易于管理: Prometheus server是一个单独的二进制文件,可直接在本地工作,不依赖于分布式存储。 高效:平均每个采样点仅占 3.5 bytes,且一个 Prometheus server 可以处理数百万的 metrics。 使用 pull模式采集时间序列数据,这样不仅有利于本机测试而且可以避免有问题的服务器推送坏的 metrics。 可以采用 push gateway 的方式把时间序列数据推送至 Prometheus server 端。 可以通过服务发现或者静态配置去获取监控的 targets。
  • 有多种可视化图形界面。 易于伸缩。 需要指出的是,由于数据采集可能会有丢失,所以 Prometheus 不适用对采集数据要 100%
  • 准确的情形。但如果用于记录时间序列数据,Prometheus 具有很大的查询优势,此外,Prometheus 适用于微服务的体系架构。

2、Grafana介绍

Grafana是一个跨平台的开源的度量分析和可视化工具,可以通过将采集的数据查询然后可视化的展示。Grafana提供了对prometheus的友好支持,各种工具帮助你构建更加炫酷的数据可视化。

Grafana特点:

  • 可视化:快速和灵活的客户端图形具有多种选项。面板插件为许多不同的方式可视化指标和日志。
  • 报警:可视化地为最重要的指标定义警报规则。Grafana将持续评估它们,并发送通知。
  • 通知:警报更改状态时,它会发出通知。接收电子邮件通知。
  • 动态仪表盘:使用模板变量创建动态和可重用的仪表板,这些模板变量作为下拉菜单出现在仪表板顶部。
  • 混合数据源:在同一个图中混合不同的数据源!可以根据每个查询指定数据源。这甚至适用于自定义数据源。
  • 注释:注释来自不同数据源图表。将鼠标悬停在事件上可以显示完整的事件元数据和标记。
  • 过滤器:过滤器允许您动态创建新的键/值过滤器,这些过滤器将自动应用于使用该数据源的所有查询。

3、AlertManager介绍

Prometheus的报警功能主要是利用Alertmanager这个组件。当Alertmanager接收到 Prometheus 端发送过来的 Alerts 时,Alertmanager 会对 Alerts 进行去重复,分组,按标签内容发送不同报警组,包括:邮件,微信,webhook。

使用prometheus进行告警分为两部分:Prometheus Server中的告警规则会向Alertmanager发送。然后,Alertmanager管理这些告警,包括进行重复数据删除,分组和路由,以及告警的静默和抑制。

二、二进制部署

1、部署prometheus

1、下载源码包

wget https://github.do/https://github.com/prometheus/prometheus/releases/download/v2.33.4/prometheus-2.33.4.linux-amd64.tar.gz
tar zxf prometheus-2.33.4.linux-amd64.tar.gz 
mv prometheus-2.33.4.linux-amd64 /usr/local/prometheus

2、创建prometheus用户及数据存放目录

useradd -M -s /sbin/nologin prometheus
mkdir -p /data/prometheus
chown -R prometheus:prometheus /usr/local/prometheus /data/prometheus

3、使用systemd来管理prometheus服务

cat <<EOF >/etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
Environment="GOMAXPROCS=4"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/prometheus/prometheus \
  --config.file=/usr/local/prometheus/prometheus.yml \
  --storage.tsdb.path=/data/prometheus \
  --storage.tsdb.retention=30d \
  --web.console.libraries=/usr/local/prometheus/console_libraries \
  --web.console.templates=/usr/local/prometheus/consoles \
  --web.listen-address=0.0.0.0:9090 \
  --web.read-timeout=5m \
  --web.max-connections=10 \
  --query.max-concurrency=20 \
  --query.timeout=2m \
  --web.enable-lifecycle \
  --enable-feature=remote-write-receiver
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
NoNewPrivileges=true
LimitNOFILE=infinity
ReadWriteDirectories=/data/prometheus
ProtectSystem=full

SyslogIdentifier=prometheus
Restart=on-failure
SuccessExitStatus=0
LimitNOFILE=65536
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=prometheus

[Install]
WantedBy=multi-user.target
EOF

4、启动prometheus

systemctl daemon-reload
systemctl enable prometheus && systemctl start prometheus
netstat -lntp | grep prometheus
tcp6       0      0 :::9090                 :::*                    LISTEN      40632/prometheus

5、修改配置文件

# vim /usr/local/prometheus/prometheus.yml
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
alerting:
  alertmanagers:
    - static_configs:
        - targets: [ 'localhost:9093' ]
rule_files:
   - "/usr/local/prometheus/rules/*.yml"
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: cadvisor
    static_configs:
      - targets:
        - 192.168.0.28:8080
  - job_name: 'nacos_exproter'
    metrics_path: '/nacos/actuator/prometheus'
    static_configs:
      - targets:
        - 192.168.0.86:8888
  - job_name: 'rabbitmq'
    static_configs:
    - targets:
        - 192.168.0.20:15692
        - 192.168.0.20:15693
        - 192.168.0.20:15694
  - job_name: 'fastdfs'
    static_configs:
    - targets: ['192.168.0.176:9018']
  - job_name: 'nginx_VTS'
    static_configs:
    - targets: ['192.168.0.19:9913']
  - job_name: 'redis_exporter'
    static_configs:
      - targets:
        - redis://192.168.0.155:6379
        - redis://192.168.0.155:6380
    metrics_path: /scrape
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.0.155:9121
  - job_name: 'blackbox_tcp'
    metrics_path: /probe
    params:
      module: [tcp_connect]
    static_configs:
      - targets:
        - 192.168.0.256:8888
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115
  - job_name: 'blackbox_http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
        - http://192.168.1.10:110
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115
  - job_name: 'mysql_exproter'
    static_configs:
    - targets:
      - 192.168.0.118:9104

7、检查配置文件是否正确

./promtool check config prometheus.yml

8、不重启重新加载配置文件

curl -X POST http://127.0.0.1:9090/-/reload

2、部署grafana

1、下载grafana二进制包

wget https://github.do/https://github.com/grafana/grafana/archive/refs/tags/v8.4.4.tar.gz
tar -zxf v8.4.4.tar.gz
mv grafana-8.4.4  /usr/local/
ln -s /usr/local/grafana-8.4.4/ /usr/local/grafana

2、创建grafana用户及数据存放目录

useradd -s /sbin/nologin -M grafana
mkdir /data/grafana
chown -R grafana:grafana /usr/local/grafana/ 
chown -R grafana:grafana  /data/grafana/

3、修改配置文件

# vim /usr/local/grafana/conf/defaults.ini
data = /data/grafana/data
logs = /data/grafana/log
plugins = /data/grafana/plugins
provisioning = /data/grafana/conf/provisioning

4、使用systemd来管理grafana服务

# vim /etc/systemd/system/grafana-server.service
[Unit]
Description=Grafana
After=network.target

[Service]
User=grafana
Group=grafana
Type=notify
ExecStart=/usr/local/grafana/bin/grafana-server -homepath /usr/local/grafana
Restart=on-failure

[Install]
WantedBy=multi-user.target

5、启动并设置开机自启

systemctl daemon-reload && systemctl start  grafana-server
systemctl status  grafana-server
systemctl enable  grafana-server

3、部署AlertManager

1、下载AlertManager二进制包

wget https://github.do/https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
tar zxf alertmanager-0.23.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/
mv alertmanager-0.23.0.linux-amd64/ alertmanager

2、使用systemd来管理AlertManager服务

# vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/alertmanager/alertmanager --config.file /usr/local/alertmanager/alertmanager.yml --storage.path=/var/lib/alertmanager
Restart=on-failure

[Install]
WantedBy=multi-user.target

3、修改配置文件

# vim /usr/local/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m   #解析的超时时间
  http_config: {}
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'dingding.webhook1'
receivers:
- name: 'dingding.webhook1'
  webhook_configs:
  - send_resolved: true
    url: 'http://127.0.0.1:8060/dingtalk/webhook1/send'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
templates:
  - '/alertmanager/template/*.tmpl'

4、启动AlertManager

systemctl daemon-reload && systemctl enable alertmanager
systemctl start alertmanager && systemctl status alertmanager

4、部署钉钉告警

1、下载二进制包

wget https://mirror.ghproxy.com/https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.0.0/prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz
tar xf prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz -C /usr/local/
ln -sv /usr/local/prometheus-webhook-dingtalk-2.0.0.linux-amd64/ /usr/local/prometheus-webhook-dingtalk

2、修改配置文件

# vim /usr/local/prometheus-webhook-dingtalk/config.yml
timeout: 5s
templates:
  - contrib/templates/legacy/template.tmpl

default_message:
  title: '{{ template "legacy.title" . }}'
  text: '{{ template "legacy.content" . }}'

targets:
  webhook1:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    # secret for signature
    secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx			# 加签
  webhook_mention_all:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    mention:
      all: true
  webhook_mention_users:
    url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    mention:
      mobiles: ['139xxxxxxxx']
      
curl -XPOST http://localhost:8060/-/reload

3、修改报警模板

# cd /usr/local/prometheus-webhook-dingtalk
# cd contrib/templates/legacy/
# vim template.tmpl 
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}

{{ define "__text_alert_list" }}{{ range . }}
**Labels**
{{ range .Labels.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Annotations**
{{ range .Annotations.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}
**Source:** [{{ .GeneratorURL }}]({{ .GeneratorURL }})
{{ end }}{{ end }}

{{ define "default.__text_alert_list" }}{{ range . }}
---
**告警级别:** {{ .Labels.severity | upper }}

**运营团队:** {{ .Labels.team | upper }}

**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**事件信息:** 
{{ range .Annotations.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }}

{{ end }}

**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }} - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}
{{ define "default.__text_alertresovle_list" }}{{ range . }}
---
**告警级别:** {{ .Labels.severity | upper }}

**运营团队:** {{ .Labels.team | upper }}

**触发时间:** {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}

**结束时间:** {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}

**事件信息:**
{{ range .Annotations.SortedPairs }} - {{ .Name }}: {{ .Value | markdown | html }}

{{ end }}

**事件标签:**
{{ range .Labels.SortedPairs }}{{ if and (ne (.Name) "severity") (ne (.Name) "summary") (ne (.Name) "team") }} - {{ .Name }}: {{ .Value | markdown | html }}
{{ end }}{{ end }}
{{ end }}
{{ end }}

{{/* Default */}}
{{ define "default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "default.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ if gt (len .Alerts.Firing) 0 -}}

{{ template "default.__text_alert_list" .Alerts.Firing }}

{{- end }}

{{ if gt (len .Alerts.Resolved) 0 -}}
{{ template "default.__text_alertresovle_list" .Alerts.Resolved }}

{{- end }}
{{- end }}

{{/* Legacy */}}
{{ define "legacy.title" }}{{ template "__subject" . }}{{ end }}
{{ define "legacy.content" }}#### \[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}\] **[{{ index .GroupLabels "alertname" }}]({{ template "__alertmanagerURL" . }})**
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}

{{/* Following names for compatibility */}}
{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}

4、使用systemd来管理服务

# vim /lib/systemd/system/dingtalk.service
[Unit]
Descripton=dingtalk
Documentation=https://github.com/timonwong/prometheus-webhook-dingtalk/
After=network.target

[Service]
Restart=on-failure
WorkingDirectory=/usr/local/prometheus-webhook-dingtalk
ExecStart=/usr/local/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/prometheus-webhook-dingtalk/config.yml

[Install]
WantedBy=multi-user.target

5、启动钉钉告警

systemctl daemon-reload && systemctl enable dingtalk
systemctl start dingtalk && systemctl status dingtalk

三、docker-compose部署

version: "3.8"

volumes:
    grafana_storage: {}
 
services:
    prometheus:
        image: prom/prometheus
        container_name: prometheus
        hostname: prometheus
        restart: always
        user: root
        volumes:
            - ./prometheus/conf:/etc/prometheus
            - ./prometheus/data:/prometheus
        ports:
            - "9090:9090"
        command:
            - '--config.file=/etc/prometheus/prometheus.yml'
            - '--storage.tsdb.path=/prometheus'
            - "--storage.tsdb.retention.time=30d"
            - "--web.enable-lifecycle"

    grafana:
        image: grafana/grafana
        container_name: grafana
        hostname: grafana
        user: root
        restart: always
        ports:
            - "3000:3000"
        volumes:
            - grafana_storage:/var/lib/grafana
 
  alertmanager:
    image: prom/alertmanager
    restart: always
    container_name: alert
    volumes:
      - /etc/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml    
    ports:
      - "9093:9093"
    depends_on:
      - prometheus

  cadvisor:
    image: google/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /dev/disk/:/dev/disk:ro
      - /data/docker/:/var/lib/docker:ro
    command:
      - "--disable_metrics=udp,tcp,percpu,sched"
      - "--storage_duration=15s"
      - "-docker_only=true"
      - "-housekeeping_interval=30s"
      - "-disable_metrics=disk"
    ports:
      - 8080:8080
 
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    environment:
      TZ: Asia/Shanghai
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped
    network_mode: host
0

评论区