У нас вы можете посмотреть бесплатно Beyond the Quadratic Wall: The Engineering Secrets of Million-Token LLMs или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
How can modern AI models read entire codebases, books, or hour-long videos in seconds? The secret isn't just "more compute"—it’s a radical redesign of the Transformer's core engine: Attention. In this deep-dive podcast, we journey from the foundational "Attention Is All You Need" architecture to the cutting-edge "Linear" and "Sparse" mechanisms that enable today’s million-token context windows. We break down the complex math and GPU physics for both technical experts and curious beginners. In this episode, we explore: The Foundation: Why the Scaled Dot-Product Attention mechanism (Queries, Keys, and Values) changed everything, and why it eventually became its own worst enemy. The Quadratic Bottleneck ($O(N^2)$): A slow-paced explanation of why doubling an AI's input length quadruples the work. We explain the "Memory Wall"—the physical limitation where GPUs spend more time moving data from slow Global Memory (HBM) than performing math. Hardware-Efficient Fixes: A technical look at FlashAttention. Discover how tiling and online softmax allow the GPU to calculate attention in small "blocks" entirely within fast Shared Memory (SRAM). Shrinking the KV Cache: How models like DeepSeek (MLA) and Llama (GQA) compress the "suitcase" of information they carry, allowing them to fit massive contexts into limited VRAM. The Sparse & Linear Revolution: How Sparse Attention (like StreamingLLM) identifies "attention sinks" to ignore noise and how Linear Attention and State Space Models (like Mamba) reorder the math to reach Linear Complexity ($O(N)$). The Future—Test-Time Training (TTT): What happens when an LLM's internal state acts like a "fast learner," updating itself as it reads your prompt?. Whether you're an AI researcher or just an enthusiast wanting to understand the "why" behind the AI boom, this episode provides the full map of the efficiency landscape. References & Sources Vaswani et al. (2017). Attention Is All You Need. Zhang et al. (2025). A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear. Sun et al. (2025). Speed Always Wins: A Survey on Efficient Architectures for Large Language Models. Sun et al. (2025). Efficient Attention Mechanisms for Large Language Models: A Survey. Dao et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. Sun et al. (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. credit: This podcast is created using NotebookLM