У нас вы можете посмотреть бесплатно LMCache Office Hour 2026-02-12 или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
LMCache Office Hour #3 featuring @Martin Hickey (IBM), presenting on Event-driven KV-Cache-Aware Routing for Distributed LLM Inference. Chat Transcript: 00:09:45.125,00:09:48.125 Mo McElaney: Thanks to everyone who joined so far! Going to wait until 5 after the hour to get started. 00:20:31.137,00:20:34.137 Mo McElaney: "What is distributed inference?" https://www.redhat.com/en/topics/ai/w... 00:25:06.055,00:25:09.055 Ugur Kaynar: Is KV‑cache based routing becoming the de‑facto method for large scale disagg inference? 00:29:59.485,00:30:02.485 Mo McElaney: KV Cache Events in the LMCache docs... https://docs.lmcache.ai/production/kv... 00:37:36.285,00:37:39.285 Himanshu Sekhar Nayak: So here medium means kv blocks are sitting in cpu DRAM? 00:39:04.209,00:39:07.209 Himanshu Sekhar Nayak: Is it there for NVMe too? 00:40:28.290,00:40:31.290 Himanshu Sekhar Nayak: I mean storage 00:40:54.826,00:40:57.826 Himanshu Sekhar Nayak: thanks 00:47:04.601,00:47:07.601 kosseila Hd: what event do you think will benefit most the latency & performance when KVcache aware routing is enabled for users ? 00:48:13.525,00:48:16.525 kosseila Hd: 👍🏻 00:48:18.782,00:48:21.782 Himanshu Sekhar Nayak: I’ve been testing LMCache across versions 0.3.10 to 0.3.13 and I can clearly see overall performance improvements. However, I noticed a behavioral difference in KV offloading: In v0.3.10, when I send a small prompt (~20 tokens), KV blocks are offloaded to NVMe. In v0.3.13, KV blocks are not offloaded for the same prompt. Offloading only seems to happen when (input_tokens + output_tokens) approaches max_model_len. 00:48:53.877,00:48:56.877 Himanshu Sekhar Nayak: Was there any intentional change in the offloading/store logic between 0.3.10 and 0.3.13? 00:50:31.212,00:50:34.212 Samuel Shen: save_unfull_chunk was turned off by default 00:51:07.330,00:51:10.330 Himanshu Sekhar Nayak: Is it due to bandwidth saturation for small chunks? 00:51:25.723,00:51:28.723 Samuel Shen: it helps us not have to store metadata for chunks for remote backends 00:51:28.898,00:51:31.898 Samuel Shen: since all chunks become uniform 00:52:18.667,00:52:21.667 Ugur Kaynar: Thank you 00:52:39.543,00:52:42.543 Himanshu Sekhar Nayak: thanks for answering