У нас вы можете посмотреть бесплатно This Discovery can Make LLMs Cheaper или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Transformers compute dense attention across every token pair, which leads to quadratic compute cost. But do LLMs actually use all that attention? In this video we review the paper “Spike, Sparse & Sink”, which analyzes attention patterns in modern language models and discovers a surprising structure hidden inside transformer attention maps. Instead of uniform interactions, attention consistently concentrates in three patterns: • Spikes — tokens that receive strong attention • Sparse connections — structured local interactions • Sink tokens — attention attractors across the sequence This discovery suggests that dense attention is wasting massive compute, and that future long-context models may rely on structured sparse attention instead. We walk through the paper step-by-step and explain: • Why attention maps look the way they do • How spike tokens emerge in transformers • What sink tokens actually do • Why this pattern appears across models • How this could make long-context LLMs dramatically cheaper If you’re interested in LLM efficiency, transformer architectures, or long-context models, this paper is worth understanding. Comment “PAPER” and I’ll share my annotated reading notes.