Wenan Mao
enior Technical Expert of Ali Cloud OS
Many years of experience in Linux OS development, performance tuning, was a net subsystem committer in Huawei kernel lab, responsible for kernel version maintenance and network feature development, contributed 50+ patches to Linux kernel community. Now he is a senior technical expert in Alibaba OS team, mainly responsible for the application of new network features and the exploration of eBPF tracing technology, mining the whole-link packet delay information from container application to kernel and driver, opening a world for solving the difficult network jitter problem. He has led the development of pingtrace, a packet loss delay visualization tool, netinfo, a monitoring system for jitter problems, and LCC, a compilation platform for BPF development. Translated with www.DeepL.com/Translator (free version)
Topic
Coolbpf-based AI infrastructure observation
Continues Profiling, as one of the four pillars of observability, is very helpful in analysing CPU performance bottlenecks and other issues; at the level of AI infrastructure observation, if CPU and GPU hotspots can be fused and analysed, the latency jitter problem in the process of model training and inference will be delimited and located on a flame map. coolbpf The project combines open-source multi-language profiling technology, explores the open-source observable capability in the form of lib libraries, meets the demand for profiling technology in different scenarios, and facilitates integration and secondary development. Outline: 1. background analysis of CPU and GPU problems 2. Introduction of Coolbpf multi-language profiling technology capability. 3. How to make AI infrastructure observable