<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>NCCL on 安橙的博客</title><link>https://blog.ans20xx.com/tags/nccl/</link><description>Recent content in NCCL on 安橙的博客</description><generator>Hugo -- 0.163.3</generator><language>zh</language><lastBuildDate>Sat, 20 Jun 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://blog.ans20xx.com/tags/nccl/index.xml" rel="self" type="application/rss+xml"/><item><title>Day 15 · 分布式基础</title><link>https://blog.ans20xx.com/posts/ai/day15/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day15/</guid><description>进入 AI Infra 分布式训练阶段:理解进程组、rank/world_size、torchrun 启动模型,掌握 AllReduce、AllGather、ReduceScatter、Broadcast 四类集合通信,并跑通一个 DDP MNIST。</description></item><item><title>Day 16 · NCCL 深入</title><link>https://blog.ans20xx.com/posts/ai/day16/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day16/</guid><description>进入分布式训练通信层:理解 NCCL 的 ring、tree、双二叉树 AllReduce 算法,看懂 NCCL_DEBUG=INFO 的初始化、拓扑、通道、算法选择日志,并用一个小脚本完整跑通 AllReduce 取证流程。</description></item><item><title>Day 17 · 数据并行 DP/DDP</title><link>https://blog.ans20xx.com/posts/ai/day17/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day17/</guid><description>进入分布式训练的第一条主线:从 DataParallel 到 DistributedDataParallel,拆开梯度同步时机、Reducer、bucket、overlap 与 no_sync;阅读 torch/nn/parallel/distributed.py 关键路径,并用 torchrun 跑一个可观测的 DDP 实验。</description></item><item><title>Day 07 · 周复盘 + 网络/存储基础</title><link>https://blog.ans20xx.com/posts/ai/day07/</link><pubDate>Mon, 18 May 2026 00:00:00 +0000</pubDate><guid>https://blog.ans20xx.com/posts/ai/day07/</guid><description>Week 1 收官:把 Day 01–06 的 GPU 编程知识串成地图,再补上 AI 集群的网络与存储基础——NVLink 与 PCIe、InfiniBand 与 RoCE、RDMA 原理、NCCL 通信模型、存储分层。为 Phase 1 进入框架内部机制铺路。</description></item></channel></rss>