推理 on 黄文卓 | DevOps Engineer

推理 on 黄文卓 | DevOps Engineer https://socake.github.io/tags/%E6%8E%A8%E7%90%86/ Recent content in 推理 on 黄文卓 | DevOps Engineer Hugo -- gohugo.io zh-CN 17691281867@163.com (Wenzhuo Huang) 17691281867@163.com (Wenzhuo Huang) © 2026 Wenzhuo Huang Tue, 13 Jan 2026 13:36:00 +0800 LLM 生产服务化：vLLM 部署与 GPU 推理优化实战 https://socake.github.io/posts/llm-production-serving-vllm/ Tue, 13 Jan 2026 13:36:00 +0800 17691281867@163.com (Wenzhuo Huang) https://socake.github.io/posts/llm-production-serving-vllm/ 团队把 Ollama 搬上生产后，高峰期请求排队超过 30 秒，用户纷纷反映 AI 功能不可用。这篇文章记录我们迁移到 vLLM 的全过程，包括 PagedAttention、Continuous Batching 原理，以及 Kubernetes GPU 部署的完整配置。