Скачать с ютуб видео Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Efficiency- C. Coleman & J. Shan (ISL)

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Скачать видео с ютуб по ссылке или смотреть без блокировок на сайте: Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Efficiency- C. Coleman & J. Shan (ISL) в качестве 4k

У нас вы можете посмотреть бесплатно Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Efficiency- C. Coleman & J. Shan (ISL) или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:

Информация по загрузке:

Скачать mp3 с ютуба отдельным файлом. Бесплатный рингтон Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Efficiency- C. Coleman & J. Shan (ISL) в формате MP3:

Если кнопки скачивания не загрузились НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу страницы.
Спасибо за использование сервиса ClipSaver.ru

Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Efficiency- C. Coleman & J. Shan (ISL)

Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Hong Kong, China (June 10-11); Tokyo, Japan (June 16-17); Hyderabad, India (August 6-7); Atlanta, US (November 10-13). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Efficiency - Clayton Coleman, Distinguished Engineer, Google & Jiaxin Shan, Software Engineer, Bytedance Traditional load balancing approaches, including round robin or those relying on metrics like QPS are often ineffective when applied to LLM serving. LLM requests vary significantly in computational demands due to prompt length, the model differences and their autoregressive nature, leading to unpredictable request running times. Moreover, the emergence of model multiplexing techniques (e.g., LoRA) introduces new complexities that necessitate LLM-aware load balancing strategies. In this talk, we introduce a new set of Kubernetes APIs for routing to LLM workloads that allow configuration of serving objectives and priorities for each use case. These APIs integrate seamlessly with Gateway API, and an included extension means that support for these APIs can easily be plugged into many Gateway API implementations to enable turnkey LLM routing support. This talk will show this project in action, demonstrating the significant improvements it can enable across a variety of real world examples.

Comments