У нас вы можете посмотреть бесплатно TiDAR: NVIDIA's Answer to the LLM Inference Speed Crisis или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
In this video, we break down NVIDIA's latest research paper, "TiDAR: Think in Diffusion, Talk in Autoregression." This hybrid architecture aims to solve a major bottleneck in Large Language Models: the trade-off between inference speed and generation quality. TiDAR introduces a novel method that drafts tokens in parallel using Diffusion (Thinking) and verifies them sequentially using Autoregression (Talking)—all in a single forward pass. In this video, we cover: The Core Problem: Why Autoregressive (AR) models are memory-bound and Diffusion models often struggle with quality. The TiDAR Solution: How the "Hybrid Architecture" utilizes free GPU compute density to draft and sample simultaneously. Architecture Deep Dive: Understanding the specific attention masks and "Free Token Slots" concept. The Benchmarks: How TiDAR achieves 4.71x to 5.91x higher throughput than standard AR models while maintaining comparable quality. If you are an AI engineer or researcher looking to optimize LLM inference latency without sacrificing output quality, this paper is a must-read. Paper Reference: "TiDAR: Think in Diffusion, Talk in Autoregression" (Liu et al., 2025) - NVIDIA (https://arxiv.org/pdf/2511.08923)