У нас вы можете посмотреть бесплатно TokenFormer Explained in 3 Minutes! или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
What if we treated model parameters like tokens? In this video, we dive into TokenFormer, a radical new architecture that replaces traditional linear projections with Token-Parameter Attention (TPA). Standard Transformers are hard to scale because their linear layers are "baked" into the architecture-if you change the width, you have to retrain from scratch. TokenFormer solves this by using attention for everything, allowing you to scale the model simply by adding more parameter tokens. What we cover in 3 minutes: ✅ The Bottleneck: Why fixed linear projections make scaling expensive and rigid. ✅ Token-Parameter Attention (TPA): Replacing Q, K, V, and MLP projections with attention. ✅ Parameter Tokens: Thinking of weights as "trainable memory slots" that inputs can query. ✅ Seamless Scaling: How to increase model capacity without changing hidden dimensions or breaking the architecture. Chapters: [00:00] The Core Components of a Transformer [00:43] The Problem: Fixed Linear Projections & Scaling [01:29] The TokenFormer Breakthrough: Attention for Everything [02:09] How Token-Parameter Attention Works [02:45] Scaling Along a New Axis: Parameter Tokens [03:04] Impact on Long-Context Modeling #TokenFormer #attention #transformers #deeplearning #machinelearning #LLMs #AIResearch #neuralnetworks