У нас вы можете посмотреть бесплатно Large Scale Distributed LLM Inference with LLM D and Kubernetes by Abdel Sghiouar или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Running Large Language Models (LLMs) locally for experimentation is easy but running them in large scale architectures is not. It requires businesses looking to intergate LLMs into their critical paths to deal with the high costs and scarcity of GPU/TPU accelerators present a significant challenge. Striking the balance between performance, availability, scalability, and cost-efficiency is a must.While Kubernetes is a ubiquitous runtime for modern workloads, deploying LLM inference effectively demands a specialized approach. Enter LLM-D a Cloud Native Kubernetes based high-performance distributed LLM inference framework. It's architecture centers around a well-lit path for anyone looking to serve at scale, with the fastest time-to-value and competitive performance per dollar, for most models across a diverse and comprehensive set of hardware accelerators.In this deep dive we will start with a gentle introduction to the topic of Inference on Kubernetes and slowly work our way to why LLM-D and what kind of challenges it solves. LLM-D is a set of components and an opinionated architecture. Building on top of existing projects like vLLM, Prometheus, the Kubernetes Gateway API. It's optimized KV-cache aware routing and disaggregated serving are designed to operationalize GenAI deployments. The project was designed by the creators of vLLM (Redhat, Google, Bytedance) and it's licensed under the Apache 2 License.