У нас вы можете посмотреть бесплатно How I finetuned a Small LM to THINK and solve puzzles on its own (GRPO & RL!) или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
In this hands-on tutorial video, I am explaining Reasoning LLMs and SLMs and writing the Group Relative Policy Optimization (GRPO) algorithm from scratch in Pytorch. This tutorial is specially directed towards Small Language Models (SLMs) but the same principles apply for Large Language Models (LLMs) too. Plus, we are going through the policy gradient equation, explaining RLVR (reinforcement learning with verifiable rewards), and visualizing exactly how reasoning models work! All materials with this video (as well as all other videos in the channel) have been shared on my Patreon page. / neuralbreakdownwithavb Get 25% off on Ninjachat. Access multiple frontier LLMs, image, video, audio generation models all in one place. Use this link: https://ninjachat.ai/?ref=avishek and the code AI25 to get 25% off! #ai #languagemodels #machinelearning More RL videos: Curiosity and Sparse Reward Environments: • How to solve Reinforcement Learning when t... RL Primer: • Reinforcement Learning AI through 4 famous... More Language Modelling videos: Attention to Transformers playlist: • Attention to Transformers from zero to her... Guide to fine-tuning open source LLMs: • Finetune LLMs to teach them ANYTHING with ... Generative Language Modeling from scratch: • From Attention to Generative Language Mode... Papers: Deepseek Math: https://arxiv.org/pdf/2402.03300 DeepSeek R1: https://arxiv.org/abs/2501.12948 DAPO: https://arxiv.org/abs/2503.14476 Critical Perspectives on R1: https://arxiv.org/abs/2503.20783 Timestamps: 0:00 - Thinking LLMs are taking over! 3:47 - Setting up Reinforcement Learning Environment 4:50 - Reasoning Gym library - Rewards 8:00 - GRPO Visually explained 10:41 - Policy Optimization and PPO loss Explained 15:45 - Coding response generation 20:55 - Coding Reward Generation & Advantages 26:25 - Calculating log probabilities 30:58 - RL Training loop 33:49 - Visualizing log probabilities post training 36:01 - The GRPO and PPO Loss function 38:19 - Surrogate clipping 41:21 - Supervised Finetuning and LORA training 43:26 - Reasoning SLM results! 45:36 - 10 Practical Tips for finetuning Reasoning SLMs