Inference on 安橙的博客

Inference on 安橙的博客https://blog.ans20xx.com/tags/inference/Recent content in Inference on 安橙的博客Hugo -- 0.163.3zhSat, 20 Jun 2026 00:00:00 +0000Day 29 · LLM 推理基础https://blog.ans20xx.com/posts/ai/day29/Sat, 20 Jun 2026 00:00:00 +0000https://blog.ans20xx.com/posts/ai/day29/进入 LLM 推理 Infra:理解 prefill 与 decode 的阶段差异、KV Cache 为什么是显存大头、吞吐/延迟指标如何拆解,并写出一个最小 generation loop。Day 30 · 解码算法https://blog.ans20xx.com/posts/ai/day30/Sat, 20 Jun 2026 00:00:00 +0000https://blog.ans20xx.com/posts/ai/day30/进入 LLM 解码策略:理解 greedy、beam search、temperature、top-k、top-p 的采样语义与服务化影响;掌握 speculative decoding 的 draft/verify 思想,并用 Transformers 与 vLLM 参数做小型实验。Day 31 · PagedAttention & vLLMhttps://blog.ans20xx.com/posts/ai/day31/Sat, 20 Jun 2026 00:00:00 +0000https://blog.ans20xx.com/posts/ai/day31/学习 PagedAttention 与 vLLM 的核心机制:为什么 KV Cache 会浪费显存,如何用 block table 管理逻辑块到物理块的映射,copy-on-write 如何支撑并行采样和 beam search,以及这些机制如何服务高吞吐 LLM serving。Day 33 · Continuous Batchinghttps://blog.ans20xx.com/posts/ai/day33/Sat, 20 Jun 2026 00:00:00 +0000https://blog.ans20xx.com/posts/ai/day33/学习 LLM 推理服务中的 Continuous Batching:理解静态 batching 与 in-flight batching 的差异,prefill/decode 如何混排,以及 TGI、vLLM、SGLang 调度器在吞吐、TTFT、TPOT 与公平性上的取舍。Day 32 · vLLM 实战https://blog.ans20xx.com/posts/ai/day32/Sat, 20 Jun 2026 00:00:00 +0800https://blog.ans20xx.com/posts/ai/day32/动手部署一个 7B 模型到 vLLM,开启 OpenAI 兼容 API,学习 --max-num-seqs 与 --gpu-memory-utilization 的调参方法,并建立推理服务压测与排错流程。Day 34 · SGLang & RadixAttentionhttps://blog.ans20xx.com/posts/ai/day34/Sat, 20 Jun 2026 00:00:00 +0800https://blog.ans20xx.com/posts/ai/day34/学习 SGLang 推理框架与 RadixAttention:理解前缀缓存、共享 prefix 的请求调度、Radix Tree 如何复用 KV Cache,并动手用同一个 system prompt 发多请求观察缓存命中。Day 35 · 量化 (1)：权重量化https://blog.ans20xx.com/posts/ai/day35/Sat, 20 Jun 2026 00:00:00 +0800https://blog.ans20xx.com/posts/ai/day35/学习 LLM 推理中的权重量化:理解 INT8 / INT4、per-channel / group-wise scale、GPTQ 与 AWQ 的核心思想,并用 AutoGPTQ 或 llama.cpp 完成一次模型量化与评估。