У нас вы можете посмотреть бесплатно Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
In this video I will be introducing all the innovations in the Mistral 7B and Mixtral 8x7B model: Sliding Window Attention, KV-Cache with Rolling Buffer, Pre-Fill and Chunking, Sparse Mixture of Experts (SMoE); I will also guide you in understanding the most difficult part of the code: Model Sharding and the use of xformers library to compute the attention for multiple prompts packed into a single sequence. In particular I will show the attention computed using BlockDiagonalCausalMask, BlockDiagonalMask and BlockDiagonalCausalWithOffsetPaddedKeysMask. I will also show you why the Sliding Window Attention allows a token to "attend" other tokens outside the attention window by linking it with the concept of Receptive Field, typical of Convolutional Neural Networks (CNNs). Of course I will prove it mathematically. When introducing Model Sharding, I will also talk about Pipeline Parallelism, because in the official mistral repository they refer to microbatching. I release a copy of the Mistral code commented and annotated by me (especially the most difficult parts): https://github.com/hkproj/mistral-src... Slides PDF and Python Notebooks: https://github.com/hkproj/mistral-llm... Prerequisite for watching this video: • Attention is all you need (Transformer) - ... Other material for better understanding Mistral: Grouped Query Attention, Rotary Positional Encodings, RMS Normalization: • LLaMA explained: KV-Cache, Rotary Position... Gradient Accumulation: • Distributed Training with PyTorch: complet... Chapters 00:00:00 - Introduction 00:02:09 - Transformer vs Mistral 00:05:35 - Mistral 7B vs Mistral 8x7B 00:08:25 - Sliding Window Attention 00:33:44 - KV-Cache with Rolling Buffer Cache 00:49:27 - Pre-Fill and Chunking 00:57:00 - Sparse Mixture of Experts (SMoE) 01:04:22 - Model Sharding 01:06:14 - Pipeline Parallelism 01:11:11 - xformers (block attention) 01:24:07 - Conclusion