Ruidong Zhang
Research and Development Engineer at Microsoft Research Asia
He is a research and development engineer in the Systems Group at Microsoft Research Asia (Shanghai), obtained his master’s degree from New York University. He primarily focuses on the field of artificial intelligence systems, with current research directions including sparse computation and long-text reasoning for large language models. His work aims to optimize the training, pre-filling, and decoding processes of these models through joint design of systems and algorithms. He has been involved in the development of the Phi-3 series of models at Microsoft, and his recent projects include MInference, an accelerator for long-text reasoning; LongRoPE, an algorithm for extending context windows; ParrotServe, a distributed LLM service system; and PIT, a dynamic sparse operator compiler.
Topic
How to Leverage the Sparsity of Large Models for Optimization on Highly Parallelized Devices
The lengthy training and inference processes, along with their associated high computational costs, represent one of the primary challenges in the widespread adoption of large language models (LLMs). Extensive research indicates that the computational load of LLMs exhibits a significant level of sparsity, which offers an opportunity to reduce computational costs. However, parallel computing infrastructure, represented by GPUs, faces notable efficiency issues when executing sparse computations. Achieving efficient sparse computation on parallel devices is a common challenge, and the demand for sparse computation also presents new challenges to traditional operator programming paradigms and model compilation systems. We will explore theoretical models for efficient sparse computation on highly parallelized devices and provide some successful cases of leveraging the sparsity of LLMs for efficient computation through the joint design of algorithms and systems. Outline: a) From Sparse Computational Loads in Artificial Neural Networks to Sparse Computation Needs in the Era of Large Language Models b) Adding Sparse Attributes to Tensor Units to Construct an End-to-End Static Sparse Computation Framework - SparTA c) Leveraging Equivalent Reordering to Implement Efficient Dynamic Sparse Computation on Dense Computing Units - PIT d) Efficient Sparse Computation Requires Joint Design of Algorithms and Systems: A Case Study of the Prefilling Stage in Long Text Inference - MInference e) Outlook on Sparse Computation for Large Language Models: Training, Decoding, and Distributed Processing