У нас вы можете посмотреть бесплатно How AI Learned to Reason: DeepSeek and o1 Explained или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
How AI Learned to Think: The Complete Technical Breakdown DeepSeek R1 and OpenAI o1 have changed the game, but how do these models actually "think"? This video is a technical deep dive into the breakthrough that allowed AI to move from simple imitation (Supervised Fine-Tuning) to genuine reasoning via Reinforcement Learning (RL). While earlier LLMs were sophisticated models trained on human data, they often failed at complex, multi-step logic. The unlock wasn't just bigger models; it was giving them the "time to think" during inference. This video explains how shifting from imitating human traces to optimizing against verifiable reward functions created emergent reasoning behaviors, like self-correction and backtracking, even though no human ever explicitly taught them. This video explains exactly HOW reinforcement learning taught AI to reason, from the 2022 "chain-of-thought" discovery to DeepSeek R1's emergent self-correction. No hype, just the actual algorithms, training pipelines, and mathematical foundations that power today's thinking models. ------------------ CHAPTERS ------------------ 0:00 - The Impossible Leap (o1's breakthrough moment) 0:34 - The Discovery Hidden in Plain Sight (chain-of-thought prompting) 1:12 - Beyond Linear Thinking (tree-of-thought search) 1:47 - The Imitation Trap (why copying humans fails) 2:21 - The RLHF Foundation (InstructGPT's 100x improvement) 3:00 - Teaching AI to Learn from Success (sparse rewards) 3:32 - Step-by-Step Feedback (process reward models) 4:05 - The Credit Assignment Challenge (500 tokens, 1 signal) 4:34 - The Policy: AI's Decision Engine (probability distributions) 5:05 - Predicting Future Success (value functions) 5:35 - Learning from Outcomes (policy gradients) 6:06 - Gaming the System (reward hacking & verifiable rewards) 6:37 - The REINFORCE Algorithm (variance problem) 7:08 - Better Than Average (advantage functions) 7:46 - Stable Learning at Scale (PPO clipping) 8:18 - The Emergence of Machine Thinking (DeepSeek R1 & GRPO) 8:56 - The Current Reality Check (current limitations) 9:29 - Multiple Paths Forward (future directions) --------------------------------------------- KEY PAPERS & RESOURCES --------------------------------------------- Chain-of-Thought Prompting (Wei et al., 2022) - https://arxiv.org/abs/2201.11903 InstructGPT / RLHF (Ouyang et al., 2022) - https://arxiv.org/abs/2203.02155 Let's Verify Step by Step - Process Reward Models (Lightman et al., 2023) - https://arxiv.org/abs/2305.20050 OpenAI o1 System Card (2024) - https://arxiv.org/abs/2412.16720v1 DeepSeek-R1 Technical Report (2025) - https://arxiv.org/abs/2501.12948 DeepSeekMath - GRPO Algorithm (Shao et al., 2024) - https://arxiv.org/abs/2402.03300 Proximal Policy Optimization (Schulman et al., 2017) - https://arxiv.org/abs/1707.06347 GSM8K Math Benchmark - https://github.com/openai/grade-schoo... AIME (American Invitational Mathematics Examination) - https://artofproblemsolving.com/wiki/... #ai #machinelearning #reinforcementlearning #llm #openai #deepseek #o1 #o3 #chainofthought #ppo #rlhf #neuralnetworks #airesearch #deeplearning #reasoning #aiexplained #techeducation #aireasoning #reasoning