Fan Ye
Head of Heterogeneous AI R&D, Tencent Cloud
Dr Ye Fan has been specialising in AI Infrastructure and is deeply involved in heterogeneous computing. After obtaining his PhD from the French Atomic Energy Agency, Dr Ye went to NVIDIA in Silicon Valley to be in charge of CUDA research and development, and was also one of the founders of TensorRT. Afterwards, he designed and developed PAI-Blade from scratch, and widely empowered many industries, spanning e-commerce, CV, NLP, ASR and other fields. Currently, Ye Fan leads the heterogeneous computing R&D team in Tencent Cloud to build TACO, the AI acceleration engine in Tencent Smart Computing, including TACO-Train, TACO-Infer, and TACO-LLM. Another masterpiece of the team, qGPU, has also helped many customers inside and outside the group to expand their GPU computing power and achieve extreme benefits with its industry-leading GPU virtualisation technology.
Topic
LLM Key Performance Design and Business Practice
TACO-LLM is Tencent Cloud's self-developed large language model inference engine. After the polishing of multiple business scenarios within and outside the group, including WeChat, code assistant, intelligent customer service, pop-up auditing, document summarisation, etc., and the support of the R&D team's highly innovative and unique acceleration technology, TACO-LLM has basically achieved the coverage of the whole application scenario of LLM by exerting efforts in multiple directions, including parallel decoding, Prefill optimisation, quantisation, and long sequences, and has basically achieved the coverage of the whole application scenario of LLM, and compared to the community's SOTA performance Compared with the community SOTA performance, the general acceleration ranges from 1.5x-3x, which is highly recognised by the business. In this topic, we will unveil the secrets behind the excellent performance of TACO-LLM, focusing on TACO's self-developed technology from the perspective of high-performance arithmetic design. We will introduce the undisclosed Turbo Attention and low-precision arithmetic practices in quantisation scenarios. Outline: 1. challenges and opportunities of LLM training 2. technical principles and performance bottlenecks of LLM inference 3. LLM inference technology evolution 4. TACO-LLM technology demystified 5. TACO performance journey: head industry cases