У нас вы можете посмотреть бесплатно Scalable Inference Algorithms for Large Language Models | Woomin Song, KAIST | AER LABS или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Scalable Inference Algorithms for LLMs: REFORM & STAND In this presentation, Woomin Song introduces two training-free frameworks for efficient LLM inference: REFORM for long-context processing and STAND for accelerating test-time scaling. Part 1: REFORM (NeurIPS 2025) Learn how REFORM overcomes the quadratic computational cost of Transformer attention and KV cache memory bottlenecks. By combining Recurrent Chunking with On-Demand Cache Recomputation, REFORM achieves 75% accuracy on 1M-token Needle-In-A-Haystack benchmarks while significantly reducing latency and memory usage. Part 2: STAND (EMNLP 2025) Discover how STAND accelerates test-time scaling (chain-of-thought reasoning, majority voting, tree search) through model-free speculative decoding. By leveraging cross-trajectory n-gram overlaps and stochastic drafting, STAND achieves the same accuracy in under 40% of the decoding time. Both works were conducted during the speaker's internship at Amazon. Speaker: Woomin Song | Integrated M.S. + Ph.D. Student at KAIST Affiliation: KAIST (Korea Advanced Institute of Science and Technology) [Resume & Profile] https://woominsong.github.io/ --- Timestamps: [Part 1: REFORM - Long Context Processing] [00:00] Introduction: Scalable Inference Algorithms for LLMs [00:42] The Problem: Quadratic computational costs and KV cache bottlenecks [01:52] The Challenge: Pre-trained context length limits [02:18] Existing Solutions: Recurrent Compression (StreamingLLM, H2O) [03:36] Existing Solutions: Random Access approaches and their limitations [04:28] Introducing REFORM: Best of both worlds [05:08] Key Observation: Attention heads as token selectors using cosine similarity [05:52] Methodology Overview: Compress, Gather, and Recompute stages [06:28] Step 1: Compress - Recurrent chunking with early exit strategy [08:12] Handling KV Cache: Token eviction using attention scores [08:52] Step 2: Gather - Cosine similarity search for relevant tokens [09:16] Step 3: Recompute - Forwarding gathered inputs for generation [09:32] Evaluation: Needle-In-A-Haystack (NIAH) benchmark results [10:24] Synthetic Benchmarks: Comparison with InfLLM (23% vs 75% at 1M tokens) [10:52] Realistic Benchmarks: InfiniteBench, RepoEval, and MM-NIAH results [11:28] Efficiency Analysis: Inference time and peak GPU memory savings [12:16] Comparison with RAG: Architecture-level advantages [13:24] Ablation Studies: Compression strategies and head selection [Part 2: STAND - Test-Time Scaling Acceleration] [14:08] Introduction: Test-time scaling and the latency problem [15:12] Background: Chain-of-thought, majority voting, and tree search [16:32] The Research Problem: Speeding up without compromising accuracy [17:04] Speculative Decoding: Draft-then-verify framework [18:16] Key Observation: High n-gram overlap across reasoning trajectories [19:08] Model-Free Drafters: Leveraging cross-trajectory information [20:04] Stochastic vs Deterministic Drafting: Why sampling matters [21:16] STAND Components: N-gram drafter with probability awareness [22:08] Optimization Techniques: Gumbel top-k trick for faster sampling [22:32] Tree Drafting: Optimizing tree structure for higher acceptance [23:16] Evaluation: AIME 2024, GPQA Diamond, and LiveCodeBench results [24:28] Results: Same accuracy in under 40% decoding time [25:04] Batch Decoding Scenarios: STAND remains effective in parallel inference [25:32] Ablation Studies: Contribution of stochastic drafting and tree optimization [26:24] Key Finding: Deeper and narrower tree structures perform better [26:52] Summary: N-gram based speculative decoding for test-time scaling [Q&A Session] [27:28] Q&A: How speculative decoding ensures output correctness [31:04] Q&A: Greedy decoding vs sampling scenarios [33:28] Q&A: Tree drafting explanation and benefits [38:24] Q&A: Batch decoding and high-throughput inference scenarios --- Hosted by AER Labs #REFORM #STAND #KAIST #LLM #LongContext #SpeculativeDecoding #TestTimeScaling #DeepLearning #Transformer #Inference #AIResearch #NLP #MachineLearning #NeurIPS2025 #EMNLP2025TAND for accelerating test-time scaling.