Grafana 复习

Fri, 10 Apr 2026 00:00:00 +0000

基础入门

- 基础入门
- 可观测性与 Grafana 概览
- 理解可观测性
- 什么是可观测性
- 指通过系统对外输出的数据，来推断系统内部状态的能力
- 和监控的区别：监控是预先定义好要看什么，可观测性让你能回答事先没想到的问题
- 三大支柱
- Metrics（指标）：
- 带事件戳的数值数据，例如 QPS、响应事件 P99、内存使用率
- 指标的特点是体积小、聚合性强，适合用来回答系统整体表现如何
- Logs （日志）：
- 是离散的事件记录，比如一条错误堆栈、一次用户登录、一个数据库慢查询
- 日志的特点是信息丰富但体积大，适合用来回答到底发生了什么
- 指标告诉你系统有问题，需要通过日志定位具体原因
- Traces（链路追踪）：
- 记录一个请求在分布式系统中经过的所有服务和耗时
- 一个 API 请求经过了网关 -> 用户服务 -> 订单服务 -> 数据库
- Trace 记录每一跳的耗时
- 适合回答请求慢在哪个环节
- 认知 Grafana 生态
- Grafana 是什么
- 是 Grafana Labs 开发的开源可视化与监控平台
- Grafana 本身不存储数据，是一个统一的查询和可视化前端
- LGTM 技术栈
- LGTM 代表四个核心组件-Loki、Grafana、Tempo 和 Mimir，每个负责可观测性的一个关键方面
- L - Loki：日志聚合系统，类似 ELK 中的 Elasticsearch 但是更轻量
- G - Grafana：可视化中心，所有数据汇聚于此
- T - Tempo：分布式追踪后端，存储和查询 Trace 数据
- M - Mimir：长期指标存储，可以理解为 Prometheus 的增强版
- 了解 Prometheus 在生态中的角色
- Prometheus 是什么
- 是一个开源的指标监控和告警系统
- 在 Grafana 中扮演数据采集和存储的角色
- 关键特征
- 拉取模型（Pull Model）：Prometheus 主动去拉取目标服务的指标，而不是目标服务推送数据给它，服务只需暴露 /metrics HTTP 端点
- 时序数据库：所有指标都按时间戳存储，形成时间序列
- PromQL：Prometheus 自带的查询语言
- 环境搭建与初识 Grafana 界面
- 用 Docker Compose 搭建环境
- 创建项目目录
-
```
grafana-learning/
├── docker-compose.yml
└── prometheus/
└── prometheus.yml
```
- 启动服务
- 在 grafana-learning 目录打开终端，运行
```bash
docker-compose up -d
```
- 验证 prometheus
- 打开浏览器访问 localhost:9090
- 在查询框输入 up 点击 execute
- 结果 up{job="prometheus"} 的值为 1
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410163113390.png,440,150)
- 点击 Status -> Targets，看到 prometheus 的 target，状态为绿色的 UP
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410163148074.png,400,150)
- 验证 Grafana
- 打开 localhost:3000，使用 admin/admin123 登录
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410163255350.png,309,338)
- 熟悉 Grafana 界面
- Home：首页，包含最近访问的仪表盘和快捷入口
- Dashboard：管理和浏览所有仪表盘，核心功能区
- Explore：临时查询和调试数据的工作台
- Alerting：配置告警规则，通知渠道和静默策略
- Connections：管理数据源连接，在这里添加 Prometheus
- Administration：系统管理（用户、组织、插件、服务器设置等）
- 连接 Prometheus 并创建第一个面板
- 添加 Prometheus 数据源
- 在 Grafana -> Connections -> Data source -> Add data source (Prometheus)
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410163355890.png,400,300)
- 配置 Connection URL：http://prometheus:9090 （容器名）
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410163429596.png,300,80)
- 出现 queried the Prometheus API 提示，说明连接成功
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410163506913.png,410,60)
- 创建第一个 Dashboard
- Dashboard -> New -> New dashboard -> Add visualization
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410163625329.png,430,290)
- Metric ，输入 up，点击 Run queries，看到值为 1 的直线
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410163741421.png,260,160)
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410163813561.png,260,190)
- Save Dashboard 保存面板
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410163933122.png,260,210)
- 添加第二个面板
- Add -> Visualization
- 输入 prometheus_target_interval_length_seconds
- 记录了 Prometheus 实时抓取目标的时间间隔，会看到多条线，因为有不同的 quantile 标签
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410164235654.png,260,200)
- 面板设置，Title -> 抓取间隔分布，Legend 模式改为 Custom 输入 quantile
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410164840562.png,260,170)
- 调整时间范围
- 调整时间选择器 Last 6 hours
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410165639451.png,340,320)
- 理解 Prometheus 的基本概念
- Prometheus 的工作原理
- 拉取模型
- 传统的方式是应用主动把数据推给监控服务器
- Prometheus 按照固定的事件间隔，主动去拉取每个目标暴露的指标数据
- Prometheus 每隔 15s (scape_interval) 向目标服务发起一个 HTTP GET 请求，访问 /metrics 端点
- 解析返回的文本并存入本地的时序数据库
- 优点：被监控的服务不需要知道 Prometheus 地址，只需暴露一个 HTTP 端点
- Prometheus 可以集中管理所有采集目标的配置，如果某个目标挂了，Prometheus 能立刻发现
- metric 数据结构
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410175651692.png,260,150)
- 格式：指标名称{标签1="值1", 标签2="值2"} 数值 时间戳
- 四种指标类型
- Counter 计数器
- 只增不减，每次事件发生，值就加 1
- 典型用例是请求总数、错误总数、发送的字节数
- 几乎不会直接看 Counter 的原始值，而是用 rate() 函数计算它的增长速率
- 例如 rate(http_requests_total[5m]) 告诉过去 5m 平均每秒处理了多少请求
- Counter 在服务重启时会归零，但 rate() 函数能自动处理这种重置
- Gauge 仪表盘
- Gauge 可以上升也可以下降，代表某个瞬时状态的快照
- 典型用例是当前内存的使用率、CPU 温度、队列中任务量
- 与 Counter 不同，Gauge 的原始值本身就有意义，不需要 rate()
- Histogram 直方图
- Histogram 把观测值按照预定义的区间分类计数
- 每次请求的响应时间会被分到不同的时间区间中
- Histogram 在 Prometheus 中实际会产生多个时间序列：_bucket(每个区间的累计计数)、_count(总观测次数)、_sum(所有观测值的总和)
- Histogram 的核心价值在于它可以在服务端灵活计算任意分位数
- 可以用 histogram_quantile(0.95,...) 在查询时计算 P95
- Summary 摘要
- Summary 和 Histogram 解决类似的问题，但是它在客户端直接计算分位数
- Summary 的缺点是分位数在客户端计算后就固定了，无法跨实例聚合
- 实际项目中，Histogram 比 Summary 用的更多
- 理解标签
- 一个指标名称加上一组标签，唯一确定一条时间序列
- 什么是标签
- 查询 prometheus_http_requests_total:
```
prometheus_http_requests_total{code="200", handler="/api/v1/query", instance="prometheus:9090", job="prometheus"} → 42
prometheus_http_requests_total{code="200", handler="/metrics", instance="prometheus:9090", job="prometheus"} → 1580
prometheus_http_requests_total{code="302", handler="/", instance="prometheus:9090", job="prometheus"} → 3
```
- code、handler、instance、job 都是标签，虽然指标名称是 prometheus_http_requests_total，但是每组不同的标签组合构成了独立的时间序列
- 标签可以做维度分析：按 HTTP 状态码筛选，按接口路径分组，按实例聚合
- 需要注意标签的基数，如果用用户 ID 作为标签，会导致存储爆炸
- 查询
- prometheus_http_requests_total{code="200"} 只看成功的请求
- 在 Grafana 中连接数据源与创建面板
- Explore 查询实验台
- 查询 Counter 指标
- prometheus_http_requests_total
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410191834327.png,400,400)
- 只看 /metrics 端点的请求
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410191950185.png,400,400)
- 使用 rate() 函数把累计值转换为速率
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410192131676.png,400,400)
- [5m] 是范围选择器，表示过去 5min 内每秒的平均请求数，指定了计算速率的时间窗口
- 查询 Gauge 指标
- go_goroutines
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410192303151.png,400,400)
- go_memstats_alloc_bytes/1024
- 是 Prometheus 进程当前分配的内存字节数，/1024 转换为 KB
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410192453732.png,400,400)
- 观察 Histogram 指标的结构
- prometheus_http_request_duration_seconds_bucket
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410192606038.png,400,375)
- 每个时间序列都带有 le 标签，表示 less than or equal
- 计算 P90
- histogram_quantile(0.9, rate(prometheus_http_request_duration_seconds_bucket[5m]))
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410193139016.png,400,375)
- 构建完整的 Dashboard
- 创建 Dashboard
- Dashboard -> New -> New dashboard，save dashboard 命名为 "Day 4-Prometheus 自监控"
- 面板 1：Prometheus 运行状态 (Stat 面板)
- Add -> Visualization，面板类型从 Time series 切换成 Stat
- 查询输入 up{job="prometheus"}
- Title 填运行状态
- Value mapping 部分，点击 Add Value mapping，添加 1->正常运行/绿色，0->已宕机/红色
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410194252635.png,300,300)
- 面板 2：每秒请求数 (Time Series 面板)
- 查询输入 rate(prometheus_http_requests_total[5m])
- Title "HTTP 请求速率 (按 handler)"
- Legend 输入 {{handler}}
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/10/20260410194559215.png,300,300)
- 面板 3：内存使用量 (Time Series 面板)
- go_memstats_alloc_bytes / 1024 / 1024
- Title 填内存使用 (MB)，右侧设置找到 Standard options -> Unit，搜索并选择 Megabytes (SI)，设置 Decimals 为 1，让数值显示一位小数
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412093035837.png,440,340)
- 面板 4: Goroutine 数量 (Gauge 面板)
- 这里的 Gauge 是 Grafana 的面板类型，不是 Prometheus 的指标类型，同名但是概念不同
- go_goroutines
- Title 填 Goroutines，在 Standard options 中设置 Min 为 0，Max 为 100
- 找到 Threshholds 部分，设置阈值：绿色为基础色，50 以上为黄色，80 以上为红色
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412093435497.png,420,340)
- 面板 5: 请求耗时 P90 (Stat 面板)
- 新建 Stat
- histogram_quantile(0.9, rate(prometheus_http_request_duration_seconds_bucket[5m]))
- Title 请求耗时 P90，在 Standard Options -> Unit 中选择 seconds
- 布局建议
- 第一行放小面板-运行状态和请求耗时 P90 这种单值面板适合窄小的尺寸
- 第二行放 HTTP 请求速率这种需要横向展开的时序图
- 第三行并排放内存使用和 Goroutines
- 面板探索设置
- Standard options 包含最常用的显示设置：Unit 单位、Min/Max 范围、Decimals 小数位数、Color Schema 配色方案
- Threshholds 用颜色编码数值区间，帮助快速判断好坏
- Overrides 允许为特定的时间序列覆盖全局设置

核心技能

- 核心技能
- PromQL
- 两种向量 - PromQL 的基石
- 即时向量
- 即时向量返回每条时间序列在某一时刻的采样值，之前的大部分查询都是即时向量
- up/go_goroutines/prometheus_http_requests_total{handler="/metrics"}
- 每条查询返回的是当前时刻每条匹配的事件序列的最新值
- 切换到 Table 视图看的更清晰
- 范围向量
- 返回每条时间序列在一个时间窗口内的所有采样值，语法是在指标名后加 [时间窗口]
- prometheus_http_requests_total{handler="/metrics"}[5m]
- 返回的是过去 5 分钟内的所有采样点
- 范围向量不能直接绘图，只能作为函数的输入转化为即时向量，最常见的就是 rate()
- 时间窗口的写法
- 支持这些时间单位 s(秒)、m(分钟)、h(小时)、d(天)、w(周)、y(年)
- 可以组合使用 [1h30m] 表示 1小时 30 分钟
- 窗口越短，曲线越尖锐，能捕捉到更细微的波动，窗口越长，曲线越平滑，反映的是更长期的趋势
- 5m 是最常用的窗口
- 核心函数
- rate()
- 计算 Counter 在指定时间窗口内的平均增长率，会自动处理 Counter 重置
- rate(prometheus_http_requests_total[5m])
- irate()
- irate() 只用时间窗口内最后两个数据点计算瞬时速率，对突刺更敏感
- irate(prometheus_http_requests_total[5m])
- 因为只看两个点，结果不够稳定，一般不用于告警规则
- increase()
- 返回 Counter 在时间窗口内的总增长量
- increase(prometheus_http_requests_total{handler="/metrics"}[1h])
- 告诉你过去 1 h /metrics 端点一共被请求了多少次
- sum()
- sum() 把多条时间序列的值加在一起
- sum(rate(prometheus_http_requests_total[5m]))
- 把所有 handler、所有 code 的请求速率加总，得到一个总 QPS
- sum by (code) (rate(prometheus_http_requests_total[5m]))
- 可以用 by 子句按某个标签维度分组
- avg()
- 有多个实例时，avg() 能给出平均值
- 也支持 by 分组
- count()
- count(prometheus_http_requests_total)
- 返回的是有多少条时间序列匹配这个查询
- 数学运算与常用模式
- 算术运算
- 支持标准的数学运算符
- go_memstats_alloc_bytes / 1024 / 1024
- 两个指标之间也能做运算 HTTP 请求的错误率
- sum(rate(prometheus_http_requests_total{code=~"5.."}[5m]))/sum(rate(prometheus_http_requests_total[5m]))
- 比较运算
- go_goroutines > 30
- 聚合函数
- min(prometheus_http_request_duration_seconds_sum)
- max(prometheus_http_request_duration_seconds_sum)
- topk(3, prometheus_http_requests_total)
- bottomk(2, prometheus_http_requests_total)
- 标签匹配的四种方式
- 精确匹配 =
- prometheus_http_requests_total{handler="/metrics"}
- 精确排除 !=
- prometheus_http_requests_total{handler!="/metrics"}
- 正则匹配
- prometheus_http_requests_total{handler=~"/api/v1/.*"}
- 正则排除
- prometheus_http_requests_total{handler!~"/api/v1/.*"}
- 聚合-by 与 without
- by 保留指定标签
- by 告诉聚合函数"按这些标签分组，其余标签全部丢弃"
- sum by (code) (rate(prometheus_http_requests_total[5m]))
- 可以按多个标签分组
- sum by (code, handler) (rate(prometheus_http_requests_total[5m]))
- without 排除指定标签
- 按除了这些标签之外的所有标签分组
- sum without (instance) (rate(prometheus_http_requests_total[5m]))
- 会移除 instance 标签维度，保留其他所有标签。效果是把同一个 job 下不同实例的数据合并。
- 什么时候用 by/without
- 关心的维度少，想丢弃的维度多，用 by
- 时间偏移与数据对比
- offset 查看历史数据
- go_goroutines offset 1h
- 返回的是 1 小时前的 goroutine 数量。单独看这个值意义不大，但它可以用来做环比对比。
- 计算同比/环比变化
- 用当前值减去历史值，就能得到变化量
-
```
rate(prometheus_http_requests_total{handler="/metrics"}[5m])
-
rate(prometheus_http_requests_total{handler="/metrics"}[5m] offset 1h)
```
- 深入 Grafana 面板类型
- 面板选择的思考方式
- 单个关键数值
- Stat 或 Gauge
- Stat 适合展示当前值，Gauge 适合展示有范围的百分比
- 随时间变化的趋势
- TimeSeries
- 最常用的面板类型，适合 QPS、延迟、内存使用等需要观察时间趋势的指标
- 多项的大小对比
- Bar chart
- 适合按维度比较大小，例如各服务的请求量排行、各状态码的占比
- 数据的分布密度
- Heatmap
- 用颜色深浅表示值的密集程度，非常适合展示请求延迟分布随时间的变化，Histogram 数据的最佳搭档
- 结构化明细数据
- Table
- 展示多列明细数据，适合需要精确数值和排序的场景
- TimeSeries 进阶
- 面板 1：多线对比
- 查询：rate(prometheus_http_requests_total[5m])
- Legend 填写 {{handler}} {{code}}
- 右侧设置 Graph Styles，Line Width:2, Fill opacity:10（线条下方增加淡填充色）
- Tooltip，设置为 All，Sort order: Descending
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412104830826.png,600,220)
- 面板 2：堆叠面积图
- 查询：sum by (code) (rate(prometheus_http_requests_total[5m]))
- Title: 请求量堆叠（按状态码）
- Legend {{code}}、Graph styles，找到 Stack series 设置为 Normal
- Fill opacity 调整到 50
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412105336324.png,540,300)
- Stat 面板的丰富配置
- 面板 3：带 sparkline 的 Stat
- sum(rate(prometheus_http_requests_total[5m]))
- Title：总 QPS，Stat Stype: Grap mode -> Area (在大数字下面显示迷你趋势线-sparkline)
- Standard options -> Unit requests/sec (ps)，Decimals=2
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412105937729.png,551,342)
- 面板 4：多值 Stat
- 查询 A：sum(rate(prometheus_http_requests_total{code="200"}[5m]))
- 查询 B：sum(rate(prometheus_http_requests_total{code!="200"}[5m]))
- Legend：A -> 成功请求，B -> 非 200 请求
- Orientation -> Horizontal 两个值会并排显示
- Thresholds B 设置 0 为绿色，0.1 以上为红色
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412110948641.png,541,320)
- BarChart 维度对比
- 面板 5：按 handler 排行的柱状图
- sort_desc(sum by (handler) (increase(prometheus_http_requests_total[1h])))
- 这条查询计算过去 1 h内每个 handler 的请求总量，按降序排列
- Bar chart 显示效果不理想，PromQL 返回的是时间序列数据，Bar chart 更适合展示某一时刻的快照
- 在查询编辑器的 Options 区域，把 Format 改为 Table，把 Type 改为 Instant
- 在右侧面板设置中，"Orientation" 改为 "Horizontal"（横向柱状图）
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412122419997.png,600,170)
- Table 结构化明细
- 面板 6：指标明细表
- sort_desc(sum by (handler, code) (increase(prometheus_http_requests_total[1h])))
- 在 Option 中设置 Format 为 Table，Type 为 Instant
- Overrides -> Add field override -> Fields with name -> 选择 value
- 点击 "Add override property" → 在搜索框中输入 "cell" → 选择 "Cell type"，然后设为 "Colored background"
- 颜色效果依赖 Threshold 的配置
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412151854303.png,422,282)
- Heatmap 分布密度
- Heatmap 是 Histogram 数据的最佳搭档
- 用颜色深浅表示数据密度，X 轴是时间，Y 轴是数值区间
- 面板 7：请求延迟分布热力图
- sum(increase(prometheus_http_request_duration_seconds_bucket[5m])) by (le)
- 把所有 handler 的延迟 bucket 数据汇总，le 标签表示 bucket 的上界
- Format -> Heatmap
- Yaxis，Unit -> seconds，Colors -> Scheme 选择不同的配色方案
- Title：请求延迟分布
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412152205566.png,530,310)
- Dashboard 布局优化
- 第一行放概览指标（Stat 面板）
- 第二行放趋势图，Time Series 面板（多线对比，堆叠面积图）
- 第三方放分析图，Bar Chart 和 Heatmap 并排放在第三排
- 第四行放明细表，Table 面板放在最底部
- Dashboard 变量与交互设计
- 理解变量的作用
- 变量允许用户通过下拉框切换要查看的数据维度，而不用为每个维度创建单独的面板
- 创建一个变量
- 进入变量设置
- Dashboard 页面 -> Settings -> Variables -> Add variable
- 创建 handler 变量
- Name 填 handler，这是变量的标识符，在查询中用 $handler 引用它
- Label 填接口，这是下拉框旁边显示的中文标签
- Type 选 Query，这表示变量的候选值从数据源查询获得
- Data source 选 prometheus
- Query Type 选 Lable Values
- Label 选 handler
- Metric 选 prometheus_http_requests_total
- 告诉 Grafana 从 prometheus_http_requests_total 中提取所有不同的 handler 标签值作为下拉选项
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412153052768.png,280,300)
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412153132293.png,400,150)
- 在面板使用变量
- 添加 Time Series 面板
- 查询：rate(prometheus_http_requests_total{handler="$handler"}[5m])
- Legend 填 {{code}}，Title 填 $handler 请求速率，变量在 title 中也能用
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412153254711.png,530,340)
- 多选与全选
- Dashboard Settings -> Variables -> handler 变量进行编辑
- Selection options 部分，勾选 Multi-value，允许用户同时勾选多个 handler
- 勾选 Include All option ，在下拉框增加 ALL 选项
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412153554336.png,350,190)
- 修改面板查询为 rate(prometheus_http_requests_total{handler=~"$handler"}[5m])
- Grafana 会将多选处理为正则表达式
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412153751767.png,430,350)
- 内置变量
- Grafana 提供了一些内置变量，不需要手动创建，可以直接在查询中使用
- $__interval 和 $__rate_interval
- 是最重要的内置变量
- 会根据 Dashboard 的时间范围和面板宽度，自动计算一个合理的时间步长
- $__rate_interval 的智能之处
- 当查看过去一小时的数据，可能是 1 分钟
- 查看过去 7 天的数据，可能是 15 分钟
- 无论时间范围怎么变，都能返回合理密度的数据点
- $__rate_interval 和 $interval 的区别
- $__rate_interval 至少覆盖 4 个采样周期，更适合搭配 rate() 使用
- $__interval 纯根据面板像素密度计算步长
- $__range
- 表示当前 Dashboard 选择的完整时间范围，在 increase 中有用
- increase(prometheus_http_requests_total{handler=~"$handler"}[$__range])
- $__dashboard 和 $__name
- $__dashboard 是当前 Dashboard 的名称，$__name 是面板的名称
- 一般用在告警通知模板中，不常用在查询里
- 用 Go 构建可观测 HTTP 服务
- 编写带 Prometheus 指标的 Go 服务
- 构建一个模拟的订单 API 服务，会暴露：请求计数 (Counter)、请求延迟 (Histogram)、当前处理中的请求数 (Gauge)。
- 指标设计
- myapp_http_requests_total(Counter)
- 按 method、endpoint、status 三个维度记录请求总数
- Counter 是因为请求数只增不减
- 三个标签维度可以让你从不同角度分析：按接口看、按状态码看，按 HTTP 方法看
- myapp_http_request_duration_seconds(Histogram)
- 记录请求延迟分布
- 用 Histogram 而不是 Summary，是因为要计算不同分位
- myapp_http_requets_in_flight(Guage)
- 当前正在处理的请求数
- Guage 是由于这个值可增可减
- myapp_orders_created_total(Counter)
- 业务指标，按商品类型记录订单数
- myapp_order_queue_size(Guage)
- 模拟的业务指标，队列深度
- 启动服务并验证
- 启动服务
-
```bash
docker-compose down
docker-compose up -d --build
```
- --build 参数会重新构建 myapp 镜像。首次构建需要下载 Go 依赖
- 验证 myapp
- 访问 http://localhost:8080/health，看到 {"status": "healthy"}
- 访问 http://localhost:8080/metrics，看到自定义指标和 Go 运行时指标
- 验证 prometheus 抓取
- 访问 http://localhost:9090/targets，看到两个 target：prometheus 和 myapp，状态是 up
- 查询 myapp_http_requests_total，确认有数据返回
- 构建应用监控 Dashboard
- 创建新 Dashboard "My GoApp-监控面板"
- 创建变量
- Settings -> Variables
- 添加 endpoint 变量：
- Type -> Query，Data source -> Prometheus
- query type -> label values，Label -> myapp，metric->myapp_http_requests_total
- Multi-value 和 Include All option 勾选
- 面板 1：应用状态
- 查询 up{job="myapp"}
- 类型 Stat，配置 Value mappings: 1->在线(绿色)，0->离线(红色)，Title->应用状态
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412170743853.png,250,140)
- 面板 2：总 QPS (state + sparkline)
- 查询：sum(rate(myapp_http_requests_total{endpoint=~"$endpoint"}[$__rate_interval]))
- 类型选 Stat，Graph mode 设置为 Area，Unit 设置为 reqps，Title 填写 QPS
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412170957566.png,250,140)
- 面板 3：错误率 (Stat)
- 查询：
```
sum(rate(myapp_http_requests_total{endpoint=~"$endpoint", status=~"4..|5.."}[$__rate_interval]))
/
sum(rate(myapp_http_requests_total{endpoint=~"$endpoint"}[$__rate_interval])) * 100
```
- Unit 设为 "percent (0-100)"
- Thresholds 设置：绿色为基础，5 以上黄色，10 以上红色。Title 填 "错误率"。
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412171243965.png,250,150)
- 面板 4：请求速率按状态码（Time Series 堆叠）
- 查询：sum by (status) (rate(myapp_http_requests_total{endpoint=~"$endpoint"}[$__rate_interval]))
- Legend 填 {{status}}。Graph styles 中 Stack series 设为 "Normal"，Fill opacity 设为 40。
- Overide: 添加 "Fields with name" 为 "200"，Override 属性选 "Color"，设为绿色。重复操作，给 "500" 设红色，"400" 设黄色，"201" 设浅绿色。
- Title 填 "请求速率（按状态码）"
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412171817077.png,250,150)
- 面板 5：请求延迟分位数 (Time Series)
- 同时展示 P50,P90,P99 三条线
- 查询 A:
- Legend P50
- histogram_quantile(0.5, sum by (le) (rate(myapp_http_request_duration_seconds_bucket{endpoint=~"$endpoint"}[$__rate_interval])))
- 查询 B:
- Legend P90
- histogram_quantile(0.9, sum by (le) (rate(myapp_http_request_duration_seconds_bucket{endpoint=~"$endpoint"}[$__rate_interval])))
- 查询 C
- Legent P99
- histogram_quantile(0.99, sum by (le) (rate(myapp_http_request_duration_seconds_bucket{endpoint=~"$endpoint"}[$__rate_interval])))
- Unit: seconds，Title：请求延迟分位数
- 通过 Overrides 给三条线设置不同的视觉效果：P50 绿色实线，P90 用黄色实线，P99 用红色实线
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/12/20260412225759292.png,310,150)

告警与日志

- 告警与日志
- Grafana 的告警系统
- 理解 Grafana 告警架构
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/18/20260418104323716.png,618,330)
- 配置 Contact Point
- 左侧导航栏 Alerting -> Contact points -> Create contact points
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/18/20260418104545088.png,277,414)
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/18/20260418104623733.png,400,150)
- Name 填学习用 Webhook，Integration 选 Webhook，URL 填 http://myapp:8080/health
- 用 Go 应用的健康检查端点作为 Webhook 接收地址，不会真正处理报警，但能验证通知是否发出
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/18/20260418104854183.png,400,250)
- 内置了一个默认的 grafana-default-email Contact Point，生产环境会配置 Email、Slack、DingTalk、PagerDuty 等
- 创建第一条告警规则（监控 Go 应用的错误率）
- 创建告警规则
- Alerting -> Alert Rules -> New alert rule
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/18/20260418105203977.png,400,350)
- 命名规则
- Rule name 填入应用错误率过高
- 定义查询和条件
- 在 Define query and alert condition 部分，会看到一个查询编辑器
- 查询 A: 选择数据源 Prometheus，切换到 Code 模式，输入
- sum(rate(myapp_http_requests_total{status=~"4..|5.."}[5m])) / sum(rate(myapp_http_requests_total[5m])) * 100
- 计算的是过去 5 分钟内的错误率百分比
- 下方 Set Alert Condition，设置 WHERE QUERY IS ABOVE 5
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/18/20260418105855803.png,400,125)
- 通过 preview alert rule condition 可以预览报警状态，如果已经满足条件，会显示为 firing
- 配置评估行为
- 在 Add foler and labels 下选择或创建一个文件夹 (学习告警)
- Evaluation 创建一个新的，叫做默认评估组，评估间隔 1m，每分钟评估一次
- Pending period 填写 2，这个参数很重要，表示条件必须持续满足 2 分钟后才会真正触发告警
- 避免了短暂的尖峰导致的误报，告警会先进入 Pending 状态，2 分钟后才会变为 Firing
- 添加标签和注释
- 在 Add folder and labels 部分
- 添加一个 Label: key 填 severity，value 填 warning。标签用于告警路由
- Notification Policy 可以决定把报警发给谁
- 在 Configure notification message 中：Summary 填应用错误率超过 5%
- Description 填当前错误率为 {{$value.B.value}}，已超过 5% 的告警阈值
- {{ $value.B.value }} 是模板变量，会在告警触发时被替换为实际的错误率数值。
- 理解告警状态与查看告警
- 在 Alerting -> Alert Rules 可以查看所有规则的状态
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/18/20260418174616492.png,300,100)
- 在 Dashboard 中添加一个 Alert list 面板，会展示当前活跃的所有告警
- Loki 与日志监控
- Loki 的设计哲学
- 传统日志系统（ElasticSearch）会对日志内容做全文索引，每个词都被索引，搜索速度快但是存储和计算成本高
- Loki 只索引日志的元数据标签，不索引日志内容本身，查询先通过标签快速定位到相关的日志流，再在这些流中做文本搜索
- 部署 Loki 和 Promtail
- Loki 是日志存储和查询引擎，Promtail 是日志采集代理，类似 Node Exporter 采集指标，Promtail 采集日志
- 更新 docker-compose.yml
- 在底部的 volumns 部分添加 loki_data:
- 创建 Promtail 配置文件
- 放在 grafana-learning/promtail 目录下
- 理解 Promtail 配置
- clients 指定了日志推送的目标地址 - Loki 的 api 端点
- 和 Prometheus 的拉取模型不同，Loki 使用推送模型：Promtail 主动把采集到的日志推送给 Loki
- scrape_configs 定义了日志采集源
- 本次使用了 docker_sd_configs，通过 Docker Socket 自动发现所有运行中的容器并采集它们的标准输出日志
- relabel_configs 对标签做转换
- __meta_docker_container_name 是 Docker 自动提供的元数据，映射到 container 标签，就能按容器名筛选日志
- 验证
- 访问 http://localhost:3100/ready，返回 ready 说明 Loki 已启动
- docker-compose logs promtail --tail 20
- 能看到 Successfully connected to Loki 的日志
- 在 Grafana 中添加 Loki 数据源
- 在 Grafana 中点击 Connections -> Data sources -> Add data source，搜索 Loki
- Connection URL 填写 http://loki:3100
- 用 Explore 查看日志
- 点击左侧 Explore，在顶部数据源下拉框中切换到 Loki
- 第一次查询
- Loki 查询语言是 LogQL，结构和 PromQL 非常相似
- 最基本的查询是用花括号指定标签筛选：{container="myapp"}
- 输入后点击 Run query
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/18/20260418210037672.png,430,360)
- 按容器筛选
- {container="prometheus}
- 查看所有容器的日志
- {container=~".+"}
- =~".+" 是正则匹配
- LogQL 管道操作
- 可以在标签筛选的基础上，用 ｜ 符号串联多个处理步骤
- 文本过滤
- 只看包含 error 的日志
- {container="myapp"} |= "error"
- |= 表示包含，!= 表示不包含
- {container="myapp"} != "health"
- 正则过滤
- {container="myapp"} |~ "status.*500"
- |~ 是正则匹配，!~ 是反向正则过滤
- 管道串联
- 多个过滤条件可以串联，形成管道
- {container="myapp"} |= "error" != "health"
- 管道中的每一步都在上一步的结果基础上继续过滤
- JSON 解析
- 如果 Go 应用输出的是 JSON 格式的日志，可以用 | json 进行自动解析
- {container="myapp"} | json
- 解析后，json 中的每个字段都变成了可筛选的标签
- {"level":"error"}
- {container="myapp"} | json | level="error"
- 行格式化
- | line_format 可以重新格式化日志的显示方式
- {container="myapp"} | line_format "{{.container}}: {{.__line__}}"
- 从日志提取指标
- Metrics from Logs 能让你不需要在代码埋点就能从日志中生成监控指标
- 计算日志量速率
- rate({container="myapp"}[5m])
- 按容器分组统计
- sum by (container) (rate({container=~".+"}[5m]))
- count_over_time
- 计算时间窗口内的日志总行数
- count_over_time({container="myapp"} |= "error" [1h])
- 在 Dashboard 中整合日志面板
- 实时日志流
- 右侧面板类型选择 Logs，数据源选 Loki
- 查询: {container="myapp"} != "metrics"，排除 /metrics 端点的请求日志
- 在右侧面板设置中，找到 Log details 区域，确保 Show time 和 Wrap lines 都开启
- Tilte 填应用日志，这个面板会实时滚动 Go 应用的最新日志
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/04/18/20260418220304056.png,260,150)
- 错误日志速率 (Time Series 面板)
- 新建 Time Series 类型，数据源选择 Loki
- rate({container="myapp"} |= "error" [5m])
- Title 填写错误日志速率，Unit 设置为 logs/sec
- LogQL 进阶与指标日志联动
- 更多日志解析器
- logfmt 解析器
- logfmt 是 Go 生态常见的日志格式，形如 level=info msg="request handled" method="GET"
- 如果日志是这种键值对格式，可以用：{container="myapp"} | logfmt
- 解析后每个键值对都变成可以筛选的标签
- 当前 Go 应用用的是标准 log.Println 输出，不是 logfmt 格式
- pattern 解析器
- pattern 是最灵活的解析器，允许用模板来描述日志的结构，提取出其中的字段
- 语法是使用 <field_name> 来做占位符
- {container="myapp"} | pattern `<_> "<method> <path> <_>" <status> <_>`
- 这条查询假设日志中包含类似 "GET /api/orders HTTP/1.1" 200 ... 的内容
- 会把 HTTP 方法提取为 method，路径提取为 path，状态码提取为 status
- <_> 是丢弃占位符，匹配但是不保存
- regxp 解析器
- {container="myapp"} | regexp `(?P<method>GET|POST|PUT|DELETE) (?P<path>/\S+)`
- (?P<name>...) 是命名捕获组，提取的值会变成同名标签，正则解析器功能最强但是性能最差
- 标签提取与 label_format
- line_format 重新格式化日志显示
- line_format 用 Go 模板语法重新定义每行日志的显示内容
- {container=~"myapp|prometheus"} | line_format "[{{.container}}] {{.__line__}}"
- 这会在每行日志前加上方括号包裹的容器名
- line_format 转换标签
- label_format 可以基于现有标签生成新标签或修改标签名
- {container="myapp"} | label_format short_id="{{.container_id | trunc 12}}"
- 这条查询把容器 ID 截取前 12 位，存为 short_id 标签，trunc 12 是 Go 模板的字符串截取函数
- drop 和 keep 标签
- 当解析器提取了太多标签导致查询杂乱，可以用 drop 丢弃不需要的标签，或用 keep 只保留需要的
- {container="myapp"} | json | keep container, level, msg

附录

环境配置

docker-compose.yml

Grafana on 安橙的博客

Grafana 复习

基础入门

核心技能

告警与日志

附录

环境配置