Shizhu Huang | 2025 C++ and System Software Summit

Shizhu Huang

Senior Expert and R&D Architect at Tencent Cloud Intelligence

Shizhu Huang is a Senior Expert and R&D Architect at Tencent Cloud Intelligence, with extensive hands-on experience spanning AI applications to underlying AI infrastructure. He has incubated and developed multiple medium- and large-scale projects at Tencent. He has led the R&D of TencentOS, voice assistants, and conversational AI technologies, supporting the deployment of AI across multiple Tencent product lines. He has also optimized and reinforced the architecture of Qidian Customer Service, Qidian Marketing Cloud, and the TI platform. From the ground up, he helped build the Tencent Cloud Intelligent Agent Platform, deeply empowering industry partners. In 2024, he delivered a specialized session on AI technologies at the Shanghai Society of Mechanical Engineering Annual Conference. He has also contributed to the design and optimization of multiple Tencent Cloud large model technical solutions, achieving industry-leading progress in DeepSeek inference acceleration in 2025.

Topic

High-Performance and Cost-Effective DeepSeek Inference: Core Methods and Optimization Practices

This talk approaches Deepseek inference from a full-stack perspective. By combining an in-depth analysis of the core architecture of the Deepseek R1 model with the guiding principles of “seeing clearly, avoiding waste, improving utilization, and saving resources,” it explains how—under strict SLA constraints (first-token latency < 2 seconds, per-token latency < 50 ms)—to significantly improve inference performance through fine-grained performance analysis, targeted inference-architecture and framework optimization, communication optimization, operator optimization, and efficient resource-management strategies. This methodology has already been successfully applied across multiple projects and demonstrates strong reusability and generality. It can help other large models quickly achieve high-cost-performance inference on different hardware platforms. Outline (I) See Clearly Model architecture and inference workflow Hardware resources and bottleneck analysis (II) Avoid Waste CPU + GPU overlap Operator fusion (III) Improve Resource Utilization PD separation PD separation load balancing PD separation communication optimization Data Parallelism (DP) Expert Parallelism (EP) MTP optimization (IV) Save Resources Quantization Operator optimization – Sparse attention KV Cache store (V) Looking Ahead

Boolan is a leading IT Education & Consulting company in China. Our core competence is our experts team around the world and their cutting edge technology experience accumulated through decades. Adhering to the tenet of "Global Experts, Global Wisdom", we are dedicated to providing our customers In-house Training,Technical Conference, Software Consulting, Expert Lecture, Seminar, Talent Evaluation and Certification and other services by gathering the world's top IT technology experts. www.boolan.com

沪ICP备15014563号