CUDA

GPU 体系结构 + CUDA 入门 + Profiling AIInfra 全景 + 学习内容 - AIInfra 全景 & 学习内容 - 理论部分 - AIInfra 是什么 - AI Infra 不是一个单一的东西，而是一整套支撑「让模型在合适的硬件上、以合适的成本、稳定地训练和服务用户」的技术栈 - ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/10/20260510163128065.png,726,300) - 一条 prompt 的完整生命周期 - ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/10/Clipboard_Screenshot_1778404995.png,500,400) - 训练 vs 推理的关键差异 - ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/10/20260510180216416.png,655,286) - 动手部分 - 环境检查清单 - ```bash # 1. 操作系统 uname -a # 期望 Linux x86_64，推荐 Ubuntu 22.04 cat /etc/os-release # 2. GPU & 驱动 nvidia-smi # 看到 GPU 型号、驱动版本、CUDA 版本 nvidia-smi topo -m # 看 GPU 间互联拓扑（NVLink/PCIe） # 3. CUDA Toolkit nvcc --version # 没装就先装，建议 12.x # 4. Python 环境（推荐 uv 或 conda） python --version # 3.10 / 3.11 都可 pip --version ``` - 装 PyTorch + transformers - ```bash # 用 conda 建一个干净环境 conda create -n aiinfra python=3.10 -y conda activate aiinfra # PyTorch（按你的 CUDA 版本选，下面以 CUDA 12.1 为例） pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121 # 推理常用库 pip install transformers accelerate sentencepiece ``` - 验证 ```python # check.py import torch print("torch:", torch.__version__) print("cuda available:", torch.cuda.is_available()) print("device count:", torch.cuda.device_count()) print("device name:", torch.cuda.get_device_name(0)) ``` - 跑通第一个 LLM 推理 - 选一个小模型（千万别第一次就拉 70B），比如 Qwen2.5-0.5B-Instruct 或 TinyLlama-1.1B： - ```python # first_infer.py import time, torch from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "Qwen/Qwen2.5-0.5B-Instruct" # 国内可换镜像 tok = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.float16, device_map="cuda" ) prompt = "用一句话解释什么是 KV Cache。" inputs = tok(prompt, return_tensors="pt").to("cuda") torch.cuda.synchronize() t0 = time.time() out = model.generate(**inputs, max_new_tokens=128, do_sample=False) torch.cuda.synchronize() t1 = time.time() print(tok.decode(out[0], skip_special_tokens=True)) print(f"\n耗时 {t1-t0:.2f}s, 生成 {out.shape[1]-inputs.input_ids.shape[1]} tokens") ``` - 监控 ```bash watch -n 0.5 nvidia-smi # 观察显存占用、SM 利用率 ``` Linux 容器回顾 - Linux/容器基础回顾 - 理论部分 - Linux 进程隔离三件套 - ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/10/20260510201727118.png,716,170) - 需要记住的几个 namespace：pid、net、mnt、uts、ipc、user - GPU 不在 namespace 隔离范围内 —— 这就是为什么需要 nvidia-container-toolkit - NUMA、CPU 亲和性、与 GPU 的关系 - 一台 8 卡 GPU 服务器，CPU 和 GPU 之间的距离不是均等的 - ``` [ NUMA Node 0 ] [ NUMA Node 1 ] CPU0–47 CPU48–95 ├─ GPU0 (PCIe) ├─ GPU4 ├─ GPU1 ├─ GPU5 ├─ GPU2 ├─ GPU6 └─ GPU3 └─ GPU7 ``` - 如果你的数据加载进程在 NUMA 0，但用的是 GPU 4，每条数据都要跨 NUMA 走 UPI，吞吐立即掉一截 - ```bash numactl --hardware # 看 NUMA 拓扑 numactl --cpunodebind=0 --membind=0 python train.py # 绑到 NUMA 0 taskset -c 0-15 python ... # 绑到具体 CPU 核 nvidia-smi topo -m # 看 GPU 与 CPU 的拓扑距离 (PIX/PHB/NODE/SYS) ``` - 训练任务的 dataloader worker 必须和它要喂的 GPU 在同一个 NUMA node 上 - 容器与 GPU：为什么需要 nvidia-container-toolkit - 普通容器只能隔离 CPU/内存/网络，看不到 GPU 设备 - 驱动在宿主机，CUDA Toolkit 在容器里 - 宿主机驱动版本必须 ≥ 容器里 CUDA 编译要求的最低版本 - ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/10/20260510203843759.png,585,612) - 动手部分 - 操作 namespace 和 cgroup - 直观感受「容器不过是一组 namespace + cgroup 的进程」 - ``` # 1. 用 unshare 起一个独立 PID namespace 的 shell sudo unshare --pid --fork --mount-proc bash ps aux # 只看到几个进程！这就是 PID namespace exit # 2. 看自己进程的 namespace ls -l /proc/self/ns/ # 3. cgroup v2：看当前 shell 的资源限制 cat /proc/self/cgroup cat /sys/fs/cgroup/memory.max 2>/dev/null || echo "no limit" ``` - 安装 nvidia-container-toolkit - ```bash distribution=$(. /etc/os-release; echo $ID$VERSION_ID) curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \ | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \ | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \ | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update sudo apt-get install -y nvidia-container-toolkit sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker ``` - 验证 ```bash docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi ``` - 写自己的 AI Infra 开发镜像 - ```dockerfile # 基础镜像：CUDA 12.4 + cuDNN + Ubuntu 22.04 FROM nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04 ENV DEBIAN_FRONTEND=noninteractive \ LANG=C.UTF-8 \ PYTHONDONTWRITEBYTECODE=1 \ PYTHONUNBUFFERED=1 # 系统依赖 RUN apt-get update && apt-get install -y --no-install-recommends \ python3.10 python3-pip python3.10-venv \ git curl wget vim numactl htop \ build-essential ninja-build \ && rm -rf /var/lib/apt/lists/* RUN ln -sf /usr/bin/python3.10 /usr/bin/python && \ ln -sf /usr/bin/pip3 /usr/bin/pip # Python 依赖 RUN pip install --no-cache-dir \ torch==2.4.0 torchvision \ --index-url https://download.pytorch.org/whl/cu124 RUN pip install --no-cache-dir \ transformers accelerate sentencepiece \ jupyterlab ipython \ nvitop py-spy WORKDIR /workspace CMD ["bash"] ``` - 构建 + 运行 ```bash docker build -t aiinfra-dev:0.1 . # 跑起来，挂载工作目录、暴露 jupyter 端口 docker run --rm -it \ --gpus all \ --shm-size=8g \ -v $PWD:/workspace \ -p 8888:8888 \ aiinfra-dev:0.1 ``` - NUMA 绑定 - 多卡机器 ```bash # 先看拓扑 nvidia-smi topo -m numactl --hardware # 把推理脚本绑到 GPU0 同 NUMA 的 CPU 上 CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 --membind=0 \ python first_infer.py ``` GPU 硬件与体系结构 - GPU 硬件与体系结构 - 理论部分 - 一张图看懂 GPU - 以 NVIDIA H100 (Hopper) 为例 - ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/11/20260511002451179.png,756,565) - H100 一共有 132 个 SM（消费卡 4090 是 128 SM，A100 是 108 SM） - SM 是 GPU 的核，所有的 CUDA kernel 都在 SM 上执行 - 算力：FP16/BF16 Tensor Core ≈ 989 TFLOPS，FP8 ≈ 1979 TFLOPS - 显存带宽：HBM3 ≈ 3 TB/s - L2 Cache：50 MB（A100 是 40MB） - NVLink：单卡对外 900 GB/s（A100 600 GB/s，4090 没 NVLink） - SM 内部 - ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/11/20260511003227058.png,561,620) - ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/11/20260511003301355.png,686,280) - 关键洞察 - Warp Divergence：一个 warp 里 32 个线程如果走不同分支（if/else），GPU 会顺序执行两条路径，性能直接腰斩。这就是为什么 GPU 不擅长复杂控制流 - Tensor Core 是深度学习的命门 - CUDA Core 一条指令算 1 个 FMA（乘加） - Tensor Core 一条指令算数百个 FMA - 所以 H100 标称的 989 TFLOPS 几乎全来自 Tensor Core，CUDA Core 只有 ~67 TFLOPS - 如果你的 kernel 没用上 Tensor Core，就只用了 GPU 7% 的算力 - 显存层级：从 Register 到 HBM - GPU 是带宽为王的设备 - ![](https://an-img.oss-cn-hangzhou.aliyuncs.com/2026/05/11/20260511003449699.png,713,234) - 核心结论 - 访问 HBM 比访问 Register 慢 500 倍 - 所以高性能 kernel 的核心套路就是：把数据从 HBM 搬到 Shared Memory，重复使用 - 这正是 FlashAttention 的核心思想 -

Day 01 · AI Infra 全景与学习环境

AIInfra 学习