Day 03 · GPU 硬件与体系结构

Fri, 15 May 2026 00:00:00 +0000

拆解 GPU 微架构：从 SM、Warp 调度到 Tensor Core，理解 HBM-L2-SMEM 存储层级与算术强度，对比 A100/H100/H20 代际演进。

Day 01 · AI Infra 全景与学习环境

Thu, 14 May 2026 00:00:00 +0000

AI Infra 60 天学习计划的起点：梳理从用户 prompt 到 GPU kernel 的完整链路，搭建开发环境，理解推理与训练基础设施的全景图。

AIInfra 学习

Tue, 05 May 2026 00:00:00 +0000

GPU 体系结构 + CUDA 入门 + Profiling

AIInfra 全景 + 学习内容

- AIInfra 全景 & 学习内容
- 理论部分
- AIInfra 是什么
- AI Infra 不是一个单一的东西，而是一整套支撑「让模型在合适的硬件上、以合适的成本、稳定地训练和服务用户」的技术栈
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/10/20260510163128065.png,726,300)
- 一条 prompt 的完整生命周期
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/10/Clipboard_Screenshot_1778404995.png,500,400)
- 训练 vs 推理的关键差异
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/10/20260510180216416.png,655,286)
- 动手部分
- 环境检查清单
-
```bash
# 1. 操作系统
uname -a # 期望 Linux x86_64，推荐 Ubuntu 22.04
cat /etc/os-release
# 2. GPU & 驱动
nvidia-smi # 看到 GPU 型号、驱动版本、CUDA 版本
nvidia-smi topo -m # 看 GPU 间互联拓扑（NVLink/PCIe）
# 3. CUDA Toolkit
nvcc --version # 没装就先装，建议 12.x
# 4. Python 环境（推荐 uv 或 conda）
python --version # 3.10 / 3.11 都可
pip --version
```
- 装 PyTorch + transformers
-
```bash
# 用 conda 建一个干净环境
conda create -n aiinfra python=3.10 -y
conda activate aiinfra
# PyTorch（按你的 CUDA 版本选，下面以 CUDA 12.1 为例）
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
# 推理常用库
pip install transformers accelerate sentencepiece
```
- 验证
```python
# check.py
import torch
print("torch:", torch.__version__)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())
print("device name:", torch.cuda.get_device_name(0))
```
- 跑通第一个 LLM 推理
- 选一个小模型（千万别第一次就拉 70B），比如 Qwen2.5-0.5B-Instruct 或 TinyLlama-1.1B：
-
```python
# first_infer.py
import time, torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "Qwen/Qwen2.5-0.5B-Instruct" # 国内可换镜像
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.float16, device_map="cuda"
)
prompt = "用一句话解释什么是 KV Cache。"
inputs = tok(prompt, return_tensors="pt").to("cuda")
torch.cuda.synchronize()
t0 = time.time()
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
torch.cuda.synchronize()
t1 = time.time()
print(tok.decode(out[0], skip_special_tokens=True))
print(f"\n耗时 {t1-t0:.2f}s, 生成 {out.shape[1]-inputs.input_ids.shape[1]} tokens")
```
- 监控
```bash
watch -n 0.5 nvidia-smi # 观察显存占用、SM 利用率
```

Linux 容器回顾

- Linux/容器基础回顾
- 理论部分
- Linux 进程隔离三件套
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/10/20260510201727118.png,716,170)
- 需要记住的几个 namespace：pid、net、mnt、uts、ipc、user
- GPU 不在 namespace 隔离范围内 —— 这就是为什么需要 nvidia-container-toolkit
- NUMA、CPU 亲和性、与 GPU 的关系
- 一台 8 卡 GPU 服务器，CPU 和 GPU 之间的距离不是均等的
-
```
[ NUMA Node 0 ] [ NUMA Node 1 ]
CPU0–47 CPU48–95
├─ GPU0 (PCIe) ├─ GPU4
├─ GPU1 ├─ GPU5
├─ GPU2 ├─ GPU6
└─ GPU3 └─ GPU7
```
- 如果你的数据加载进程在 NUMA 0，但用的是 GPU 4，每条数据都要跨 NUMA 走 UPI，吞吐立即掉一截
-
```bash
numactl --hardware # 看 NUMA 拓扑
numactl --cpunodebind=0 --membind=0 python train.py # 绑到 NUMA 0
taskset -c 0-15 python ... # 绑到具体 CPU 核
nvidia-smi topo -m # 看 GPU 与 CPU 的拓扑距离 (PIX/PHB/NODE/SYS)
```
- 训练任务的 dataloader worker 必须和它要喂的 GPU 在同一个 NUMA node 上
- 容器与 GPU：为什么需要 nvidia-container-toolkit
- 普通容器只能隔离 CPU/内存/网络，看不到 GPU 设备
- 驱动在宿主机，CUDA Toolkit 在容器里
- 宿主机驱动版本必须 ≥ 容器里 CUDA 编译要求的最低版本
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/10/20260510203843759.png,585,612)
- 动手部分
- 操作 namespace 和 cgroup
- 直观感受「容器不过是一组 namespace + cgroup 的进程」
-
```
# 1. 用 unshare 起一个独立 PID namespace 的 shell
sudo unshare --pid --fork --mount-proc bash
ps aux # 只看到几个进程！这就是 PID namespace
exit
# 2. 看自己进程的 namespace
ls -l /proc/self/ns/
# 3. cgroup v2：看当前 shell 的资源限制
cat /proc/self/cgroup
cat /sys/fs/cgroup/memory.max 2>/dev/null || echo "no limit"
```
- 安装 nvidia-container-toolkit
-
```bash
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
```
- 验证
```bash
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
```
- 写自己的 AI Infra 开发镜像
-
```dockerfile
# 基础镜像：CUDA 12.4 + cuDNN + Ubuntu 22.04
FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive \
LANG=C.UTF-8 \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1
# 系统依赖
RUN apt-get update && apt-get install -y --no-install-recommends \
python3.10 python3-pip python3.10-venv \
git curl wget vim numactl htop \
build-essential ninja-build \
&& rm -rf /var/lib/apt/lists/*
RUN ln -sf /usr/bin/python3.10 /usr/bin/python && \
ln -sf /usr/bin/pip3 /usr/bin/pip
# Python 依赖
RUN pip install --no-cache-dir \
torch==2.4.0 torchvision \
--index-url https://download.pytorch.org/whl/cu124
RUN pip install --no-cache-dir \
transformers accelerate sentencepiece \
jupyterlab ipython \
nvitop py-spy
WORKDIR /workspace
CMD ["bash"]
```
- 构建 + 运行
```bash
docker build -t aiinfra-dev:0.1 .
# 跑起来，挂载工作目录、暴露 jupyter 端口
docker run --rm -it \
--gpus all \
--shm-size=8g \
-v $PWD:/workspace \
-p 8888:8888 \
aiinfra-dev:0.1
```
- NUMA 绑定
- 多卡机器
```bash
# 先看拓扑
nvidia-smi topo -m
numactl --hardware
# 把推理脚本绑到 GPU0 同 NUMA 的 CPU 上
CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 --membind=0 \
python first_infer.py
```

GPU 硬件与体系结构

- GPU 硬件与体系结构
- 理论部分
- 一张图看懂 GPU
- 以 NVIDIA H100 (Hopper) 为例
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/11/20260511002451179.png,756,565)
- H100 一共有 132 个 SM（消费卡 4090 是 128 SM，A100 是 108 SM）
- SM 是 GPU 的核，所有的 CUDA kernel 都在 SM 上执行
- 算力：FP16/BF16 Tensor Core ≈ 989 TFLOPS，FP8 ≈ 1979 TFLOPS
- 显存带宽：HBM3 ≈ 3 TB/s
- L2 Cache：50 MB（A100 是 40MB）
- NVLink：单卡对外 900 GB/s（A100 600 GB/s，4090 没 NVLink）
- SM 内部
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/11/20260511003227058.png,561,620)
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/11/20260511003301355.png,686,280)
- 关键洞察
- Warp Divergence：一个 warp 里 32 个线程如果走不同分支（if/else），GPU 会 顺序执行两条路径，性能直接腰斩。这就是为什么 GPU 不擅长复杂控制流
- Tensor Core 是深度学习的命门
- CUDA Core 一条指令算 1 个 FMA（乘加）
- Tensor Core 一条指令算 数百个 FMA
- 所以 H100 标称的 989 TFLOPS 几乎全来自 Tensor Core，CUDA Core 只有 ~67 TFLOPS
- 如果你的 kernel 没用上 Tensor Core，就只用了 GPU 7% 的算力
- 显存层级：从 Register 到 HBM
- GPU 是带宽为王的设备
- ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/11/20260511003449699.png,713,234)
- 核心结论
- 访问 HBM 比访问 Register 慢 500 倍
- 所以高性能 kernel 的核心套路就是：把数据从 HBM 搬到 Shared Memory，重复使用
- 这正是 FlashAttention 的核心思想
-