Fan Ye | 2024 C++ and System Software Summit

免费领取大会全套PPT

点击领取

我要参会

Fan Ye

Head of Heterogeneous AI R&D, Tencent Cloud

Dr Ye Fan has been specialising in AI Infrastructure and is deeply involved in heterogeneous computing. After obtaining his PhD from the French Atomic Energy Agency, Dr Ye went to NVIDIA in Silicon Valley to be in charge of CUDA research and development, and was also one of the founders of TensorRT. Afterwards, he designed and developed PAI-Blade from scratch, and widely empowered many industries, spanning e-commerce, CV, NLP, ASR and other fields. Currently, Ye Fan leads the heterogeneous computing R&D team in Tencent Cloud to build TACO, the AI acceleration engine in Tencent Smart Computing, including TACO-Train, TACO-Infer, and TACO-LLM. Another masterpiece of the team, qGPU, has also helped many customers inside and outside the group to expand their GPU computing power and achieve extreme benefits with its industry-leading GPU virtualisation technology.

Topic

LLM Key Performance Design and Business Practice

TACO-LLM is Tencent Cloud's self-developed large language model inference engine. After the polishing of multiple business scenarios within and outside the group, including WeChat, code assistant, intelligent customer service, pop-up auditing, document summarisation, etc., and the support of the R&D team's highly innovative and unique acceleration technology, TACO-LLM has basically achieved the coverage of the whole application scenario of LLM by exerting efforts in multiple directions, including parallel decoding, Prefill optimisation, quantisation, and long sequences, and has basically achieved the coverage of the whole application scenario of LLM, and compared to the community's SOTA performance Compared with the community SOTA performance, the general acceleration ranges from 1.5x-3x, which is highly recognised by the business. In this topic, we will unveil the secrets behind the excellent performance of TACO-LLM, focusing on TACO's self-developed technology from the perspective of high-performance arithmetic design. We will introduce the undisclosed Turbo Attention and low-precision arithmetic practices in quantisation scenarios. Outline: 1. challenges and opportunities of LLM training 2. technical principles and performance bottlenecks of LLM inference 3. LLM inference technology evolution 4. TACO-LLM technology demystified 5. TACO performance journey: head industry cases

Boolan is a leading IT Education & Consulting company in China. Our core competence is our experts team around the world and their cutting edge technology experience accumulated through decades. Adhering to the tenet of "Global Experts, Global Wisdom", we are dedicated to providing our customers In-house Training,Technical Conference, Software Consulting, Expert Lecture, Seminar, Talent Evaluation and Certification and other services by gathering the world's top IT technology experts. www.boolan.com