一、什么是prometheus
Prometheus是一个开源监控系统,它前身是SoundCloud的警告工具包。从2012年开始,许多公司和组织开始使用Prometheus。该项目的开发人员和用户社区非常活跃,越来越多的开发人员和用户参与到该项目中。目前它是一个独立的开源项目,且不依赖与任何公司。 为了强调这点和明确该项目治理结构,Prometheus在2016年继Kurberntes之后,加入了Cloud Native Computing Foundation。
特征
Prometheus的主要特征有:
- 多维度数据模型
- 灵活的查询语言
- 不依赖分布式存储,单个服务器节点是自主的
- 以HTTP方式,通过pull模型拉去时间序列数据
- 也通过中间网关支持push模型
- 通过服务发现或者静态配置,来发现目标服务对象
- 支持多种多样的图表和界面展示,grafana也支持它
组件
Prometheus生态包括了很多组件,它们中的一些是可选的:
- 主服务Prometheus Server负责抓取和存储时间序列数据
- 客户库负责检测应用程序代码
- 支持短生命周期的PUSH网关
- 基于Rails/SQL仪表盘构建器的GUI
- 多种导出工具,可以支持Prometheus存储数据转化为HAProxy、StatsD、Graphite等工具所需要的数据存储格式
- 警告管理器
- 命令行查询工具
- 其他各种支撑工具
多数Prometheus组件是Go语言写的,这使得这些组件很容易编译和部署。
架构
下面这张图说明了Prometheus的整体架构,以及生态中的一些组件作用:
Prometheus服务,可以直接通过目标拉取数据,或者间接地通过中间网关拉取数据。它在本地存储抓取的所有数据,并通过一定规则进行清理和整理数据,并把得到的结果存储到新的时间序列中,PromQL和其他API可视化地展示收集的数据
适用场景
Prometheus在记录纯数字时间序列方面表现非常好。它既适用于面向服务器等硬件指标的监控,也适用于高动态的面向服务架构的监控。对于现在流行的微服务,Prometheus的多维度数据收集和数据筛选查询语言也是非常的强大。
Prometheus是为服务的可靠性而设计的,当服务出现故障时,它可以使你快速定位和诊断问题。它的搭建过程对硬件和服务没有很强的依赖关系。
不适用场景
Prometheus,它的价值在于可靠性,甚至在很恶劣的环境下,你都可以随时访问它和查看系统服务各种指标的统计信息。 如果你对统计数据需要100%的精确,它并不适用,例如:它不适用于实时计费系统
二、安装Prometheus Server
1、使用预编译二进制文件
我们为Prometheus大多数的官方组件,提供了预编译二进制文件。可用版本下载列表
2、源码安装
如果要从源码安装Prometheus的官方组件,可以查看各个项目源码目录下的Makefile
注意点:在web上的文档指向最新的稳定版(不包括预发布版)。下一个版本指向master分支还没有发布的版本
1、下载源码包
wget https://github.do/https://github.com/prometheus/prometheus/releases/download/v2.48.1/prometheus-2.48.1.linux-amd64.tar.gz
tar zxf prometheus-2.48.1.linux-amd64.tar.gz
mv prometheus-2.48.1.linux-amd64 /usr/local/prometheus
2、创建prometheus用户及数据存放目录
useradd -M -s /sbin/nologin prometheus
mkdir -p /data/prometheus
chown -R prometheus:prometheus /usr/local/prometheus /data/prometheus
3、使用systemd来管理prometheus服务
cat <<EOF >/usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
Environment="GOMAXPROCS=4"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--storage.tsdb.path=/data/prometheus \
--storage.tsdb.retention=30d \
--web.console.libraries=/usr/local/prometheus/console_libraries \
--web.console.templates=/usr/local/prometheus/consoles \
--web.listen-address=0.0.0.0:9090 \
--web.read-timeout=5m \
--web.max-connections=10 \
--query.max-concurrency=20 \
--query.timeout=2m \
--web.enable-lifecycle \
--enable-feature=remote-write-receiver
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
NoNewPrivileges=true
LimitNOFILE=infinity
ReadWriteDirectories=/data/prometheus
ProtectSystem=full
SyslogIdentifier=prometheus
Restart=on-failure
SuccessExitStatus=0
LimitNOFILE=65536
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=prometheus
[Install]
WantedBy=multi-user.target
EOF
4、启动prometheus
systemctl daemon-reload
systemctl enable prometheus && systemctl start prometheus
netstat -lntp | grep prometheus
tcp6 0 0 :::9090 :::* LISTEN 40632/prometheus
3、Docker-compose安装
所有Prometheus服务的Docker镜像在官方组织prom下,都是可用的
在Docker上运行Prometheus服务,只需要简单地执行docker run -p 9090:9090 prom/prometheus
命令行即可。这条命令会启动Prometheus服务,使用的是默认配置文件,并对外界暴露9090端口
Prometheus镜像使用docker中的volume卷存储实际度量指标。在生产环境上使用容器卷模式, 可以在Prometheus更新和升级时轻松管理Prometheus数据, 这种使用docker volume卷方式存储数据,是被docker官方强烈推荐的.
当然,也可以使用docker-compose启动
version: '3'
services:
prometheus:
container_name: prometheus
restart: always
ports:
- 9090:9090
image: prom/prometheus
volumes:
- /etc/timezone:/etc/timezone
- /etc/localtime:/etc/localtime
- ./conf/prometheus/:/etc/prometheus/
- ./data/prometheus/:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.enable-lifecycle
- --web.console.templates=/usr/share/prometheus/consoles
3、使用basic_auth加密
Prometheus于2.24版本(包括2.24)之后提供Basic Auth功能进行加密访问,在浏览器登录UI的时候需要输入用户密码,访问Prometheus api的时候也需要加上用户密码。
1、安装httpd-tools
yum -y install httpd-tools
2、生成密码
htpasswd -nBC 12 '' | tr -d ':\n'
3、写入配置文件
# vim config.yml
basic_auth_users:
admin: $2y$12$82JUhBA8IvYnueD.qaeeB.99VXcQReU/eq.QkRMXOYxlgjqDz.otK
4、启用配置文件
二进制
# vim /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
Environment="GOMAXPROCS=4"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \
--web.config.file=/usr/local/prometheus/config.yml \
--storage.tsdb.path=/data/prometheus \
--storage.tsdb.retention=60d \
--web.console.libraries=/usr/local/prometheus/console_libraries \
--web.console.templates=/usr/local/prometheus/consoles \
--web.listen-address=0.0.0.0:9090 \
--web.read-timeout=5m \
--web.max-connections=10 \
--query.max-concurrency=20 \
--query.timeout=2m \
--web.enable-lifecycle
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
NoNewPrivileges=true
LimitNOFILE=infinity
ReadWriteDirectories=/data/prometheus
ProtectSystem=full
SyslogIdentifier=prometheus
Restart=always
[Install]
WantedBy=multi-user.target
docker-compose
services:
prometheus:
image: prom/prometheus:v2.48.1
container_name: prometheus
hostname: prometheus
restart: always
user: root
environment:
- TZ=Asia/Shanghai
- LANG=zh_CN.UTF-8
volumes:
- ./prometheus/conf:/etc/prometheus
- ./prometheus/data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.config.file=/etc/prometheus/config.yml' # 添加配置文件,注意config.yml的挂载路径
- '--storage.tsdb.path=/prometheus'
- "--storage.tsdb.retention.time=15d"
- "--web.enable-lifecycle"
- "--web.console.libraries=/usr/share/prometheus/console_libraries"
- "--web.console.templates=/usr/share/prometheus/consoles"
- "--enable-feature=remote-write-receiver"
- "--query.lookback-delta=2m"
5、重启prometheus
6、不重启重新加载配置文件
curl --basic -u admin:密码 -X POST http://127.0.0.1:9090/-/reload
三、prometheus 配置文件
Prometheus配置方式有两种:
(1)命令行,用来配置不可变命令参数,主要是Prometheus运行参数,比如数据存储位置
(2)配置文件,用来配置Prometheus应用参数,比如数据采集,报警对接
不重启进程配置生效方式也有两种:
(1)对进程发送信号SIGHUP
(2)HTTP POST请求,需要开启–web.enable-lifecycle选项
curl -X POST http://192.168.66.112:9091/-/reload
配置文件格式是yaml格式,说明: .yml或者.yaml 都是 yaml格式的文件, yaml格式的好处: 和json交互比较容易 python/go/java/php 有yaml格式库,方便语言之间解析,并且这种格式存储的信息量很大。
1、命令行
命令行可用配置可通过prometheus -h来查看。
# ./prometheus -h
usage: prometheus [<flags>]
The Prometheus monitoring server
Flags:
-h, --help Show context-sensitive help (also try --help-long and --help-man).
--version Show application version.
--config.file="prometheus.yml"
Prometheus configuration file path.
--web.listen-address="0.0.0.0:9090"
Address to listen on for UI, API, and telemetry.
--web.config.file="" [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication.
--web.read-timeout=5m Maximum duration before timing out read of the request, and closing idle connections.
--web.max-connections=512 Maximum number of simultaneous connections.
--web.external-url=<URL> The URL under which Prometheus is externally reachable (for example, if Prometheus is served via a reverse proxy). Used for generating relative and absolute links back to Prometheus itself. If the URL
has a path portion, it will be used to prefix all HTTP endpoints served by Prometheus. If omitted, relevant URL components will be derived automatically.
--web.route-prefix=<path> Prefix for the internal routes of web endpoints. Defaults to path of --web.external-url.
--web.user-assets=<path> Path to static asset directory, available at /user.
--web.enable-lifecycle Enable shutdown and reload via HTTP request.
--web.enable-admin-api Enable API endpoints for admin control actions.
--web.enable-remote-write-receiver
Enable API endpoint accepting remote write requests.
--web.console.templates="consoles"
Path to the console template directory, available at /consoles.
--web.console.libraries="console_libraries"
Path to the console library directory.
--web.page-title="Prometheus Time Series Collection and Processing Server"
Document title of Prometheus instance.
--web.cors.origin=".*" Regex for CORS origin. It is fully anchored. Example: 'https?://(domain1|domain2)\.com'
--storage.tsdb.path="data/"
Base path for metrics storage. Use with server mode only.
--storage.tsdb.retention=STORAGE.TSDB.RETENTION
[DEPRECATED] How long to retain samples in storage. This flag has been deprecated, use "storage.tsdb.retention.time" instead. Use with server mode only.
--storage.tsdb.retention.time=STORAGE.TSDB.RETENTION.TIME
How long to retain samples in storage. When this flag is set it overrides "storage.tsdb.retention". If neither this flag nor "storage.tsdb.retention" nor "storage.tsdb.retention.size" is set, the
retention time defaults to 15d. Units Supported: y, w, d, h, m, s, ms. Use with server mode only.
--storage.tsdb.retention.size=STORAGE.TSDB.RETENTION.SIZE
Maximum number of bytes that can be stored for blocks. A unit is required, supported units: B, KB, MB, GB, TB, PB, EB. Ex: "512MB". Use with server mode only.
--storage.tsdb.no-lockfile
Do not create lockfile in data directory. Use with server mode only.
--storage.tsdb.allow-overlapping-blocks
Allow overlapping blocks, which in turn enables vertical compaction and vertical query merge. Use with server mode only.
--storage.agent.path="data-agent/"
Base path for metrics storage. Use with agent mode only.
--storage.agent.wal-compression
Compress the agent WAL. Use with agent mode only.
--storage.agent.retention.min-time=STORAGE.AGENT.RETENTION.MIN-TIME
Minimum age samples may be before being considered for deletion when the WAL is truncated Use with agent mode only.
--storage.agent.retention.max-time=STORAGE.AGENT.RETENTION.MAX-TIME
Maximum age samples may be before being forcibly deleted when the WAL is truncated Use with agent mode only.
--storage.agent.no-lockfile
Do not create lockfile in data directory. Use with agent mode only.
--storage.remote.flush-deadline=<duration>
How long to wait flushing sample on shutdown or config reload.
--storage.remote.read-sample-limit=5e7
Maximum overall number of samples to return via the remote read interface, in a single query. 0 means no limit. This limit is ignored for streamed response types. Use with server mode only.
--storage.remote.read-concurrent-limit=10
Maximum number of concurrent remote read calls. 0 means no limit. Use with server mode only.
--storage.remote.read-max-bytes-in-frame=1048576
Maximum number of bytes in a single frame for streaming remote read response types before marshalling. Note that client might have limit on frame size as well. 1MB as recommended by protobuf by
default. Use with server mode only.
--rules.alert.for-outage-tolerance=1h
Max time to tolerate prometheus outage for restoring "for" state of alert. Use with server mode only.
--rules.alert.for-grace-period=10m
Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured "for" time greater than grace period. Use with server mode only.
--rules.alert.resend-delay=1m
Minimum amount of time to wait before resending an alert to Alertmanager. Use with server mode only.
--alertmanager.notification-queue-capacity=10000
The capacity of the queue for pending Alertmanager notifications. Use with server mode only.
--query.lookback-delta=5m The maximum lookback duration for retrieving metrics during expression evaluations and federation. Use with server mode only.
--query.timeout=2m Maximum time a query may take before being aborted. Use with server mode only.
--query.max-concurrency=20
Maximum number of queries executed concurrently. Use with server mode only.
--query.max-samples=50000000
Maximum number of samples a single query can load into memory. Note that queries will fail if they try to load more samples than this into memory, so this also limits the number of samples a query can
return. Use with server mode only.
--enable-feature= ... Comma separated feature names to enable. Valid options: agent, exemplar-storage, expand-external-labels, memory-snapshot-on-shutdown, promql-at-modifier, promql-negative-offset, remote-write-receiver
(DEPRECATED), extra-scrape-metrics, new-service-discovery-manager. See https://prometheus.io/docs/prometheus/latest/feature_flags/ for more details.
--log.level=info Only log messages with the given severity or above. One of: [debug, info, warn, error]
--log.format=logfmt Output format of log messages. One of: [logfmt, json]
2、配置文件
配置文件使用yml格式,配置文件中一级配置项如下,说明参考#备注内容。
#全局配置 (如果有内部单独设定,会覆盖这个参数)
global:
#告警插件定义。这里会设定alertmanager这个报警插件。
alerting:
#告警规则。 按照设定参数进行扫描加载,用于自定义报警规则,其报警媒介和route路由由alertmanager插件实现。
rule_files:
#采集配置。配置数据源,包含分组job_name以及具体target。又分为静态配置和服务发现
scrape_configs:
#用于远程存储写配置
remote_write:
#用于远程读配置
remote_read:
配置文件中通用字段值格式 : 布尔类型值为true和false : 协议方式包含http和https
原始配置文件内容:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
1、global字段
scrape_interval
全局默认的数据拉取间隔
[ scrape_interval: <duration> | default = 1m ]
scrape_timeout
全局默认的单次数据拉取超时,当报context deadline exceeded错误时需要在特定的job下配置该字段。
[ scrape_timeout: <duration> | default = 10s ]
evaluation_interval
全局默认的规则(主要是报警规则)拉取间隔
[ evaluation_interval: <duration> | default = 1m ]
external_labels
该服务端在与其他系统对接所携带的标签
[ <labelname>: <labelvalue> ... ]
2、alerting 字段
该字段配置与Alertmanager进行对接的配置 样例:
alerting:
alert_relabel_configs: # 动态修改 alert 属性的规则配置。
- source_labels: [dc]
regex: (.+)\d+
target_label: dc1
alertmanagers:
- static_configs:
- targets: ['127.0.0.1:9093'] # 单实例配置
#- targets: ['172.31.10.167:19093','172.31.10.167:29093','172.31.10.167:39093'] # 集群配置
- job_name: 'Alertmanager'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
- static_configs:
- targets: ['localhost:19093']
上面的配置中的 alert_relabel_configs
是指警报重新标记在发送到Alertmanager之前应用于警报。 它具有与目标重新标记相同的配置格式和操作,外部标签标记后应用警报重新标记,主要是针对集群配置。
这个设置的用途是确保具有不同外部label的HA对Prometheus服务端发送相同的警报信息。
Alertmanager 可以通过 static_configs
参数静态配置,也可以使用其中一种支持的服务发现机制动态发现,我们上面的配置是静态的单实例。
此外,relabel_configs
允许从发现的实体中选择 Alertmanager,并对使用的API路径提供高级修改,该路径通过 __alerts_path__
标签公开。
完成以上配置后,重启Prometheus服务,用以加载生效,也可以使用热加载功能,使其配置生效。然后通过浏览器,访问http://192.168.1.220:19090/alerts就可以看 inactive
pending
firing
三个状态,没有警报信息是因为我们还没有配置警报规则 rules
。
这里定义和prometheus集成的alertmanager插件,用于监控报警。后续会单独进行alertmanger插件的配置、配置说明、报警媒介以及route路由规则记录。
1、alert_relabel_configs
此项配置和scrape_configs
字段中relabel_configs
配置一样,用于对需要报警的数据进行过滤后发向Alertmanager
说明 relabel-configs的配置允许你选择你想抓取的目标和这些目标的标签是什么。所以说如果你想要抓取这种类型的服务器而不是那种,可以使用relabel_configs
相比之下,metric_relabel_configs是发生在抓取之后,但在数据被插入存储系统之前使用。因此如果有些你想过滤的指标,或者来自抓取本身的指标(比如来自/metrics页面)你就可以使用metric_relabel_configs来处理。
2、alertmanagers
该项目主要用来配置不同的alertmanagers
服务,以及Prometheus服务和他们的链接参数。alertmanagers
服务可以静态配置也可以使用服务发现配置。Prometheus以pushing 的方式向alertmanager传递数据。
alertmanager 服务配置和target配置一样,可用字段如下
[ timeout: <duration> | default = 10s ]
[ path_prefix: <path> | default = / ]
[ scheme: <scheme> | default = http ]
basic_auth:
[ username: <string> ]
[ password: <string> ]
[ password_file: <string> ]
[ bearer_token: <string> ]
[ bearer_token_file: /path/to/bearer/token/file ]
tls_config:
[ <tls_config> ]
[ proxy_url: <string> ]
azure_sd_configs:
[ - <azure_sd_config> ... ]
consul_sd_configs:
[ - <consul_sd_config> ... ]
dns_sd_configs:
[ - <dns_sd_config> ... ]
ec2_sd_configs:
[ - <ec2_sd_config> ... ]
file_sd_configs:
[ - <file_sd_config> ... ]
gce_sd_configs:
[ - <gce_sd_config> ... ]
kubernetes_sd_configs:
[ - <kubernetes_sd_config> ... ]
marathon_sd_configs:
[ - <marathon_sd_config> ... ]
nerve_sd_configs:
[ - <nerve_sd_config> ... ]
serverset_sd_configs:
[ - <serverset_sd_config> ... ]
triton_sd_configs:
[ - <triton_sd_config> ... ]
static_configs:
[ - <static_config> ... ]
relabel_configs:
[ - <relabel_config> ... ]
3、rule_files
这个主要是用来设置告警规则,基于设定什么指标进行报警(类似触发器trigger)。这里设定好规则以后,prometheus会根据全局global设定的evaluation_interval参数进行扫描加载,规则改动后会自动加载。其报警媒介和route路由由alertmanager插件实现。 样例:
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
"first_rules.yml"样例:
groups:
- name: test-rules
rules:
- alert: InstanceDown # 告警名称
expr: up == 0 # 告警的判定条件,参考Prometheus高级查询来设定
for: 10s # 满足告警条件持续时间多久后,才会发送告警
labels: #标签项
severity: error
annotations: # 解析项,详细解释告警信息
summary: "{{$labels.instance}}: has been down"
description: "{{$labels.instance}}: job {{$labels.job}} has been down "
Prometheus 支持两种类型的 Rules ,可以对其进行配置,然后定期进行运算:recording rules 记录规则 与 alerting rules 警报规则,规则文件的计算频率与警报规则计算频率一致,都是通过全局配置中的 evaluation_interval 定义。
规则分组rule_group
不论是recording rules还是alerting rules都要在组里面。
groups:
- name: example
#该组下的规则
rules:
[ - <rule> ... ]
alerting rules
要在Prometheus中使用Rules规则,就必须创建一个包含必要规则语句的文件,并让Prometheus通过Prometheus配置中的rule_files字段加载该文件,前面我们已经讲过了。 其实语法都一样,除了 recording rules 中的收集的指标名称 record: 字段配置方式略有不同,其他都是一样的。
配置范例:
- alert: ServiceDown
expr: avg_over_time(up[5m]) * 100 < 50
annotations:
description: The service {{ $labels.job }} instance {{ $labels.instance }} is
not responding for more than 50% of the time for 5 minutes.
summary: The service {{ $labels.job }} is not responding
- alert: RedisDown
expr: avg_over_time(redis_up[5m]) * 100 < 50
annotations:
description: The Redis service {{ $labels.job }} instance {{ $labels.instance
}} is not responding for more than 50% of the time for 5 minutes.
summary: The Redis service {{ $labels.job }} is not responding
- alert: PostgresDown
expr: avg_over_time(pg_up[5m]) * 100 < 50
annotations:
description: The Postgres service {{ $labels.job }} instance {{ $labels.instance
}} is not responding for more than 50% of the time for 5 minutes.
summary: The Postgres service {{ $labels.job }} is not responding
定义Recording rules
recording rules 是提前设置好一个比较花费大量时间运算或经常运算的表达式,其结果保存成一组新的时间序列数据。当需要查询的时候直接会返回已经计算好的结果,这样会比直接查询快,同时也减轻了PromQl的计算压力,同时对可视化查询的时候也很有用,可视化展示每次只需要刷新重复查询相同的表达式即可。
在配置的时候,除却 record: 需要注意,其他的基本上是一样的,一个 groups 下可以包含多条规则 rules ,Recording 和 Rules 保存在 group 内,Group 中的规则以规则的配置时间间隔顺序运算,也就是全局中的 evaluation_interval 设置。
配置范例:
groups:
- name: http_requests_total
rules:
- record: job:http_requests_total:rate10m
expr: sum by (job)(rate(http_requests_total[10m]))
lables:
team: operations
- record: job:http_requests_total:rate30m
expr: sum by (job)(rate(http_requests_total[30m]))
lables:
team: operations
上面的规则其实就是根据 record 规则中的定义,Prometheus 会在后台完成 expr 中定义的 PromQL 表达式周期性运算,以 job 为维度使用 sum 聚合运算符 计算 函数rate 对http_requests_total 指标区间 10m 内的增长率,并且将计算结果保存到新的时间序列 job:http_requests_total:rate10m 中, 同时还可以通过 labels 为样本数据添加额外的自定义标签,但是要注意的是这个 lables 一定存在当前表达式 Metrics 中。
使用模板
模板是在警报中使用时间序列标签和值展示的一种方法,可以用于警报规则中的注释(annotation)与标签(lable)。模板其实使用的go语言的标准模板语法,并公开一些包含时间序列标签和值的变量。这样查询的时候,更具有可读性,也可以执行其他PromQL查询 来向警报添加额外内容,ALertmanager Web UI中会根据标签值显示器警报信息。
{{ $lable.}} 可以获取当前警报实例中的指定标签值
{{ $value }} 变量可以获取当前PromQL表达式的计算样本值。
groups:
- name: operations
rules:
# monitor node memory usage
- alert: node-memory-usage
expr: (1 - (node_memory_MemAvailable_bytes{env="operations",job!='atlassian'} / (node_memory_MemTotal_bytes{env="operations"})))* 100 > 90
for: 1m
labels:
status: Warning
team: operations
annotations:
description: "Environment: {{ $labels.env }} Instance: {{ $labels.instance }} memory usage above {{ $value }} ! ! !"
summary: "node os memory usage status"
调整好rules以后,我们可以使用 curl -XPOST http://localhost:9090/-/reload 或者 对Prometheus服务重启,让警报规则生效。
这个时候,我们可以把阈值调整为 50 来进行故障模拟操作,这时在去访问UI的时候,当持续1分钟满足警报条件,实际警报状态已转换为 Firing,可以在 Annotations中看到模板信息 summary 与 description 已经成功显示。
规则检查
#打镜像后使用
FROM golang:1.18
RUN GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go get -u github.com/prometheus/prometheus/cmd/promtool
FROM alpine:latest
COPY --from=0 /go/bin/promtool /bin
ENTRYPOINT ["/bin/promtool"]
# 编译
docker build -t promtool:0.1 .
#使用
docker run --rm -v /root/test/prom:/opt promtool:0.1 check rules /opt/rule.yml
#返回
Checking /opt/rule.yml
SUCCESS: 1 rules found
4、scrape_configs字段
拉取数据配置,在配置字段内可以配置拉取数据的对象(Targets),job以及实例
job_name
定义job名称,是一个拉取单元。每个job_name都会自动引入默认配置如
- scrape_interval 依赖全局配置
- scrape_timeout 依赖全局配置
- metrics_path 默认为’/metrics’
- scheme 默认为’http’
这些也可以在单独的job中自定义
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]
[ metrics_path: <path> | default = /metrics ]
honor_labels
服务端拉取过来的数据也会存在标签,配置文件中也会有标签,这样就可能发生冲突。
true就是以抓取数据中的标签为准 false就会重新命名抓取数据中的标签为“exported”形式,然后添加配置文件中的标签
[ honor_labels: <boolean> | default = false ]
scheme
切换抓取数据所用的协议
[ scheme: <scheme> | default = http ]
params
定义可选的url参数
[ <string>: [<string>, ...] ]
5、抓取认证类
每次抓取数据请求的认证信息
basic_auth
password和password_file互斥只可以选择其一
basic_auth:
[ username: <string> ]
[ password: <secret> ]
[ password_file: <string> ]
bearer_token
bearer_token和bearer_token_file互斥只可以选择其一
[ bearer_token: <secret> ]
[ bearer_token_file: /path/to/bearer/token/file ]
tls_config
抓取ssl请求时证书配置
tls_config:
[ ca_file: <filename> ]
[ cert_file: <filename> ]
[ key_file: <filename> ]
[ server_name: <string> ]
#禁用证书验证
[ insecure_skip_verify: <boolean> ]
proxy_url
通过代理去主去数据
[ proxy_url: <string> ]
6、服务发现类
Prometheus支持多种服务现工具,详细配置这里不再展开
#sd就是service discovery的缩写
azure_sd_configs:
consul_sd_configs:
dns_sd_configs:
ec2_sd_configs:
openstack_sd_configs:
file_sd_configs:
gce_sd_configs:
kubernetes_sd_configs:
marathon_sd_configs:
nerve_sd_configs:
serverset_sd_configs:
triton_sd_configs:
更多参考官网:https://prometheus.io/docs/prometheus/latest/configuration/configuration/
static_configs
服务发现来获取抓取目标为动态配置,这个配置项目为静态配置,静态配置为典型的targets配置,在改配置字段可以直接添加标签
- targets:
[ - '<host>' ]
labels:
[ <labelname>: <labelvalue> ... ]
采集器所采集的数据都会带有label,当使用服务发现时,比如consul所携带的label如下:
__meta_consul_address: consul地址
__meta_consul_dc: consul中服务所在的数据中心
__meta_consul_metadata_: 服务的metadata
__meta_consul_node: 服务所在consul节点的信息
__meta_consul_service_address: 服务访问地址
__meta_consul_service_id: 服务ID
__meta_consul_service_port: 服务端口
__meta_consul_service: 服务名称
__meta_consul_tags: 服务包含的标签信息
这些lable是数据筛选与聚合计算的基础。
7、数据过滤类
抓取数据很繁杂,尤其是通过服务发现添加的target。所以过滤就显得尤为重要,我们知道抓取数据就是抓取target的一些列metrics,Prometheus过滤是通过对标签操作操现的,在字段relabel_configs和metric_relabel_configs里面配置,两者的配置都需要relabel_config字段。该字段需要配置项如下
[ source_labels: '[' <labelname> [, ...] ']' ]
[ separator: <string> | default = ; ]
[ target_label: <labelname> ]
[ regex: <regex> | default = (.*) ]
[ modulus: <uint64> ]
[ replacement: <string> | default = $1 ]
#action除了默认动作还有keep、drop、hashmod、labelmap、labeldrop、labelkeep
[ action: <relabel_action> | default = replace ]
target配置示例
relabel_configs:
- source_labels: [job]
regex: (.*)some-[regex]
action: drop
- source_labels: [__address__]
modulus: 8
target_label: __tmp_hash
action: hashmod
target中metric示例
- job_name: cadvisor
...
metric_relabel_configs:
- source_labels: [id]
regex: '/system.slice/var-lib-docker-containers.*-shm.mount'
action: drop
- source_labels: [container_label_JenkinsId]
regex: '.+'
action: drop
target中metric示例
- job_name: cadvisor
...
metric_relabel_configs:
- source_labels: [id]
regex: '/system.slice/var-lib-docker-containers.*-shm.mount'
action: drop
- source_labels: [container_label_JenkinsId]
regex: '.+'
action: drop
使用示例 由以上可知当使用服务发现consul会带入标签__meta_consul_dc,现在为了表示方便需要将该标签变为dc
需要做如下配置,这里面action使用的replacement
scrape_configs:
- job_name: consul_sd
relabel_configs:
- source_labels: ["__meta_consul_dc"]
regex: "(.*)"
replacement: $1
action: replace
target_label: "dc"
#或者
- source_labels: ["__meta_consul_dc"]
target_label: "dc"
过滤采集target
relabel_configs:
- source_labels: ["__meta_consul_tags"]
regex: ".*,development,.*"
action: keep
sample_limit
为了防止Prometheus服务过载,使用该字段限制经过relabel之后的数据采集数量,超过该数字拉取的数据就会被忽略
[ sample_limit: <int> | default = 0 ]
8、远程读写
Prometheus可以进行远程读/写数据。字段remote_read和remote_write
remote_read
#远程读取的url
url: <string>
#通过标签来过滤读取的数据
required_matchers:
[ <labelname>: <labelvalue> ... ]
[ remote_timeout: <duration> | default = 1m ]
#当远端不是存储的时候激活该项
[ read_recent: <boolean> | default = false ]
basic_auth:
[ username: <string> ]
[ password: <string> ]
[ password_file: <string> ]
[ bearer_token: <string> ]
[ bearer_token_file: /path/to/bearer/token/file ]
tls_config:
[ <tls_config> ]
[ proxy_url: <string> ]
remote_write
url: <string>
[ remote_timeout: <duration> | default = 30s ]
#写入数据时候进行标签过滤
write_relabel_configs:
[ - <relabel_config> ... ]
basic_auth:
[ username: <string> ]
[ password: <string> ]
[ password_file: <string> ]
[ bearer_token: <string> ]
[ bearer_token_file: /path/to/bearer/token/file ]
tls_config:
[ <tls_config> ]
[ proxy_url: <string> ]
#远端写细粒度配置,这里暂时仅仅列出官方注释
queue_config:
# Number of samples to buffer per shard before we start dropping them.
[ capacity: <int> | default = 10000 ]
# Maximum number of shards, i.e. amount of concurrency.
[ max_shards: <int> | default = 1000 ]
# Maximum number of samples per send.
[ max_samples_per_send: <int> | default = 100]
# Maximum time a sample will wait in buffer.
[ batch_send_deadline: <duration> | default = 5s ]
# Maximum number of times to retry a batch on recoverable errors.
[ max_retries: <int> | default = 3 ]
# Initial retry delay. Gets doubled for every retry.
[ min_backoff: <duration> | default = 30ms ]
# Maximum retry delay.
[ max_backoff: <duration> | default = 100ms ]
评论区