<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Training on 安橙的博客</title><link>https://blog.ans20xx.com/tags/training/</link><description>Recent content in Training on 安橙的博客</description><generator>Hugo -- 0.163.3</generator><language>zh</language><lastBuildDate>Sat, 20 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.ans20xx.com/tags/training/index.xml" rel="self" type="application/rss+xml"/><item><title>Day 17 · 数据并行 DP/DDP</title><link>https://blog.ans20xx.com/posts/ai/day17/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day17/</guid><description>进入分布式训练的第一条主线:从 DataParallel 到 DistributedDataParallel,拆开梯度同步时机、Reducer、bucket、overlap 与 no_sync;阅读 torch/nn/parallel/distributed.py 关键路径,并用 torchrun 跑一个可观测的 DDP 实验。</description></item><item><title>Day 18 · ZeRO 系列（DeepSpeed）</title><link>https://blog.ans20xx.com/posts/ai/day18/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day18/</guid><description>理解 ZeRO-1/2/3 分别切分 optimizer state、gradient 和 parameter 的方式，读 ZeRO 论文主线，并用 DeepSpeed 配置把 DDP 的复制显存一步步拆掉。</description></item><item><title>Day 20 · Pipeline Parallel</title><link>https://blog.ans20xx.com/posts/ai/day20/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day20/</guid><description>拆开 Pipeline Parallel:理解模型按层切 stage、micro-batch 如何填流水线,对比 GPipe、1F1B 与 Megatron Interleaved 1F1B,掌握 bubble 时间计算与 pipeline 调参方法。</description></item><item><title>Day 22 · 3D / 4D 并行实战</title><link>https://blog.ans20xx.com/posts/ai/day22/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day22/</guid><description>把 Day19-21 的 TP、PP、DP、SP/CP 组合起来,在单机多卡上用 Megatron-LM 跑一个小 GPT,并通过调整 tensor-model-parallel-size 与 pipeline-model-parallel-size 理解并行维度的取舍。</description></item><item><title>Day 23 · DeepSpeed 实战</title><link>https://blog.ans20xx.com/posts/ai/day23/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day23/</guid><description>实战 DeepSpeed ZeRO-3 + Offload:理解参数、梯度、优化器状态如何分片与换入换出,拆解 ds_config.json 的 zero_optimization、offload_param、offload_optimizer、bucket、overlap 与 NVMe 参数,并给出可运行的训练配置模板。</description></item><item><title>Day 24 · Checkpoint 与容错</title><link>https://blog.ans20xx.com/posts/ai/day24/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day24/</guid><description>学习分布式训练中的 checkpoint 与容错:理解 DCP 分片保存、异步保存、训练中断恢复、torchrun elastic restart,并建立可恢复训练的状态清单与演练流程。</description></item><item><title>Day 25 · 数据 Pipeline</title><link>https://blog.ans20xx.com/posts/ai/day25/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day25/</guid><description>进入训练数据通路:理解 WebDataset、Mosaic Streaming 与自定义 IterableDataset 的设计取舍;调优 DataLoader 的 num_workers、prefetch_factor、pin_memory、persistent_workers 与 shared memory,定位 GPU starvation。</description></item><item><title>Day 28 · 周复盘 + 小项目</title><link>https://blog.ans20xx.com/posts/ai/day28/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day28/</guid><description>阶段 2 收官:复盘分布式训练 Infra 的 NCCL、DDP、ZeRO、TP、PP、SP/CP、DeepSpeed、checkpoint、data pipeline、算子加速与 profiling;在 2 卡或云上 8 卡训练一个约 125M GPT,记录 MFU,并完成 ZeRO-3 vs TP+PP 的硬件取舍笔记。</description></item></channel></rss>