Yang Ke

Core Contributor to Mooncake, Technical Expert at Approaching AI

Yang Ke is a Technical Expert at Approaching AI and a core contributor to the open-source project Mooncake. He earned his Ph.D. from the Institute of High-Performance Computing, Department of Computer Science, Tsinghua University, and his bachelor's degree from Beijing University of Posts and Telecommunications. He was a finalist in the 2013 ACM-ICPC World Finals and has published first-author papers in top systems conferences such as SOSP and ASPLOS. His research interests include distributed systems, parallel computing, and AI infrastructure.

Topic

Mooncake: Decoupled Architecture and Memory-for-Compute Optimization for Large Model Inference

Mooncake is a distributed large model inference architecture designed for PD (Processing–Data) separation and centered around KVCache. It accelerates large model inference from three key dimensions: store more, transmit faster, and integrate more easily. In the era of long-context models, inference costs have grown dramatically. To address this, Mooncake introduces a decoupled architecture that enables efficient cross-node transmission and sharing of KVCache through technologies such as zero-copy data transfer, multi-NIC pooling and network path optimization, and elastic scaling with efficient memory utilization. In real-world applications, Mooncake has achieved significant improvements in large model inference performance. This talk will explore why KVCache has become the core challenge of large model inference in the long-context era and how Mooncake breaks through this bottleneck to enable efficient and scalable deployment. Outline: 1. Background: Challenges of Large Model Inference in the Long-Context Era, PD-Separated Architecture, and KVCache 2. Deep Dive into Mooncake’s Core Technologies and System Optimizations 3. Integration of Mooncake with Open-Source Large Model Inference Systems

© boolan.com 博览 版权所有

沪ICP备15014563号

沪公网安备31011502003949号