Скачать с ютуб видео Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Скачать видео с ютуб по ссылке или смотреть без блокировок на сайте: Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer в качестве 4k

У нас вы можете посмотреть бесплатно Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:

Информация по загрузке:

Скачать mp3 с ютуба отдельным файлом. Бесплатный рингтон Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer в формате MP3:

Если кнопки скачивания не загрузились НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу страницы.
Спасибо за использование сервиса ClipSaver.ru

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer

In this video I will be introducing all the innovations in the Mistral 7B and Mixtral 8x7B model: Sliding Window Attention, KV-Cache with Rolling Buffer, Pre-Fill and Chunking, Sparse Mixture of Experts (SMoE); I will also guide you in understanding the most difficult part of the code: Model Sharding and the use of xformers library to compute the attention for multiple prompts packed into a single sequence. In particular I will show the attention computed using BlockDiagonalCausalMask, BlockDiagonalMask and BlockDiagonalCausalWithOffsetPaddedKeysMask. I will also show you why the Sliding Window Attention allows a token to "attend" other tokens outside the attention window by linking it with the concept of Receptive Field, typical of Convolutional Neural Networks (CNNs). Of course I will prove it mathematically. When introducing Model Sharding, I will also talk about Pipeline Parallelism, because in the official mistral repository they refer to microbatching. I release a copy of the Mistral code commented and annotated by me (especially the most difficult parts): https://github.com/hkproj/mistral-src... Slides PDF and Python Notebooks: https://github.com/hkproj/mistral-llm... Prerequisite for watching this video: • Attention is all you need (Transformer) - ... Other material for better understanding Mistral: Grouped Query Attention, Rotary Positional Encodings, RMS Normalization: • LLaMA explained: KV-Cache, Rotary Position... Gradient Accumulation: • Distributed Training with PyTorch: complet... Chapters 00:00:00 - Introduction 00:02:09 - Transformer vs Mistral 00:05:35 - Mistral 7B vs Mistral 8x7B 00:08:25 - Sliding Window Attention 00:33:44 - KV-Cache with Rolling Buffer Cache 00:49:27 - Pre-Fill and Chunking 00:57:00 - Sparse Mixture of Experts (SMoE) 01:04:22 - Model Sharding 01:06:14 - Pipeline Parallelism 01:11:11 - xformers (block attention) 01:24:07 - Conclusion

Comments

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer скачать в хорошем качестве

deep learning

pytorch

ai

ml

machine learning

paper review

large language model

mistral

mistral ai

mistral 7b

mistral 8x7b

pipeline parallelism

kv-cache

kv cache

sparse mixture of experts

mixture of experts

moe

smoe

sliding window attention

model sharding

prefill

chunking

xformers

BlockDiagonalCausalMask

BlockDiagonalMask

BlockDiagonalCausalWithOffsetPaddedKeysMask

Скачать видео с ютуб по ссылке или смотреть без блокировок на сайте: Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer в качестве 4k

Информация по загрузке:

Скачать mp3 с ютуба отдельным файлом. Бесплатный рингтон Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer в формате MP3:

Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer