Shizhu Huang

Senior Expert and R&D Architect at Tencent Cloud Intelligence

Shizhu Huang is a Senior Expert and R&D Architect at Tencent Cloud Intelligence, with extensive hands-on experience spanning AI applications to underlying AI infrastructure. He has incubated and developed multiple medium- and large-scale projects at Tencent. He has led the R&D of TencentOS, voice assistants, and conversational AI technologies, supporting the deployment of AI across multiple Tencent product lines. He has also optimized and reinforced the architecture of Qidian Customer Service, Qidian Marketing Cloud, and the TI platform. From the ground up, he helped build the Tencent Cloud Intelligent Agent Platform, deeply empowering industry partners. In 2024, he delivered a specialized session on AI technologies at the Shanghai Society of Mechanical Engineering Annual Conference. He has also contributed to the design and optimization of multiple Tencent Cloud large model technical solutions, achieving industry-leading progress in DeepSeek inference acceleration in 2025.

Topic

High-Performance and Cost-Effective DeepSeek Inference: Core Methods and Optimization Practices

This talk approaches Deepseek inference from a full-stack perspective. By combining an in-depth analysis of the core architecture of the Deepseek R1 model with the guiding principles of “seeing clearly, avoiding waste, improving utilization, and saving resources,” it explains how—under strict SLA constraints (first-token latency < 2 seconds, per-token latency < 50 ms)—to significantly improve inference performance through fine-grained performance analysis, targeted inference-architecture and framework optimization, communication optimization, operator optimization, and efficient resource-management strategies. This methodology has already been successfully applied across multiple projects and demonstrates strong reusability and generality. It can help other large models quickly achieve high-cost-performance inference on different hardware platforms. Outline (I) See Clearly Model architecture and inference workflow Hardware resources and bottleneck analysis (II) Avoid Waste CPU + GPU overlap Operator fusion (III) Improve Resource Utilization PD separation PD separation load balancing PD separation communication optimization Data Parallelism (DP) Expert Parallelism (EP) MTP optimization (IV) Save Resources Quantization Operator optimization – Sparse attention KV Cache store (V) Looking Ahead

© boolan.com 博览 版权所有

沪ICP备15014563号

沪公网安备31011502003949号