У нас вы можете посмотреть бесплатно DeepSeek V3DeepSeek-V3.2 Attention Mechanisms Explained: DSA, MLA, and Vanilla Attention или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
DeepSeek V3.2 has introduced a major shift in how Large Language Models handle information processing. This video explores the specific types of attention mechanisms detailed in their latest technical report, comparing industry baselines with their new proprietary architecture. We start by analyzing the limitations of standard Vanilla Attention. While this dense mechanism allows every token to attend to every other token, the report highlights that its quadratic complexity creates massive efficiency bottlenecks for long sequences. Then we unpack the primary breakthrough known as DeepSeek Sparse Attention or DSA. You will learn how this mechanism uses a lightning indexer and fine grained token selection to drastically reduce computational complexity. Instead of processing the entire sequence, the model utilizes the indexer to select only the top k most relevant key value entries for each query, changing the complexity from quadratic to linear. We also explain the underlying architecture called Multi Head Latent Attention (MLA). We break down how DSA is instantiated under the Multi Query Attention (MQA) mode of MLA, where latent vectors are shared across query heads to optimize decoding efficiency at the kernel level. Finally, we look at the unique training process involving a Dense Warm up Stage. This is where the model briefly uses dense attention to initialize the lightning indexer before switching over to the sparse training stage. This combination allows DeepSeek V3.2 to match the reasoning capabilities of top tier proprietary models like GPT 5 and Gemini 3.0 Pro while significantly lowering inference costs. Timestamps: 0:00 Introduction to DeepSeek V3.2 1:30 Vanilla Attention and the Efficiency Bottleneck 3:15 DeepSeek Sparse Attention (DSA) Explained 5:00 The Lightning Indexer and Token Selection 6:45 Multi Head Latent Attention (MLA) Architecture 8:20 The MQA Mode vs. MHA Mode 10:10 The Dense Warm up Training Stage 12:00 Performance Results and Inference Costs