📌 Stop Wasting GPU Flops on Cold Starts: High Performance Inference with Model Streamer - AI Eng Paris - скачать видео с ютуба бесплатно по ссылке

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Скачать видео с ютуб по ссылке или смотреть без блокировок на сайте: Stop Wasting GPU Flops on Cold Starts: High Performance Inference with Model Streamer - AI Eng Paris в качестве 4k

У нас вы можете посмотреть бесплатно Stop Wasting GPU Flops on Cold Starts: High Performance Inference with Model Streamer - AI Eng Paris или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:

Информация по загрузке:

Скачать mp3 с ютуба отдельным файлом. Бесплатный рингтон Stop Wasting GPU Flops on Cold Starts: High Performance Inference with Model Streamer - AI Eng Paris в формате MP3:

Если кнопки скачивания не загрузились НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу страницы.
Спасибо за использование сервиса ClipSaver.ru

Stop Wasting GPU Flops on Cold Starts: High Performance Inference with Model Streamer - AI Eng Paris

AI Engineer Paris 2025 → https://www.ai.engineer/paris Traffic is spiking to your ML application. Your autoscaler kicks in. But instead of serving more requests, your new replicas are stuck downloading massive model weights, loading them onto GPUs, and warming up inference engines like vLLM. Minutes pass, response latency spikes, making your application unusable. You haggle with DevOps to overprovision capacity so your application remains reliable. Cold starts become hot pain, hurting latency, driving up costs, and making "just scale up" a lot more complicated than it sounds. In this talk, we’ll introduce a pattern for optimizing model loading for high performance inference. A case study, Run:ai Model Streamer, is an open-source tool built to reduce cold start times by streaming model weights directly to GPU memory in parallel. It’s natively integrated with vLLM and SGLang, supports MoE-style multi-file loading, and saturating object storage bandwidth across different cloud storage backends. And all without requiring changes to your model format. We’ll walk through how Model Streamer works, what bottlenecks it solves, and what we've learned from running it in production. Expect benchmarks, practical tips, and best practices for making large-model inference on Kubernetes faster and more efficient. If you’ve ever waited for a model to load and thought "surely this could be faster", this talk is for you! How the Model Streamer works animation → https://drive.google.com/file/d/1Nbme... Run:ai Model Streamer → https://github.com/run-ai/runai-model... GKE Inference Quickstart → https://cloud.google.com/kubernetes-e... KAI Scheduler → https://github.com/NVIDIA/KAI-Scheduler Speakers: Peter Schuurman, Software Engineer, Google Ekin Karabulut, AI/ML Developer Advocate, NVIDIA

Comments