У нас вы можете посмотреть бесплатно Why Chain-of-Thought Isn't Enough & Google's SCoRe Method Explained или скачать в максимальном доступном качестве, которое было загружено на ютуб. Для скачивания выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
The new SCoRe (Self-Correction via Reinforcement Learning) method by Google introduces a novel approach to improving large language models' (LLMs) self-correction ability, addressing the limitations of existing methods like supervised fine-tuning (SFT). Traditional approaches suffer from distribution mismatch between training data and test-time behavior, causing models to fail at effectively correcting their own errors. SCoRe resolves this by using multi-turn reinforcement learning (RL), where the model generates multiple attempts to correct its own output based on feedback. It uniquely leverages self-generated correction traces and focuses on optimizing the correction process through RL without external supervision. The method consists of two stages: In Stage I, the model is trained to optimize second-attempt responses while keeping the first attempt close to the base model, ensuring that the model doesn't deviate too much initially, thus avoiding collapse into trivial or minimal edits. This stage uses a KL-divergence constraint to maintain stability. In Stage II, full multi-turn RL is applied with a reward shaping mechanism, which biases the model towards making substantial improvements between the first and second attempts. The reward bonus incentivizes meaningful corrections while discouraging regressive behavior, allowing the model to efficiently explore correction strategies. SCoRe achieves significant improvements in tasks such as mathematical reasoning and code generation, outperforming previous methods by a large margin, particularly in self-correction efficiency. It demonstrates that multi-turn RL and reward shaping are essential for overcoming the failure modes of SFT, including distribution mismatch and the model’s tendency to stick to minor corrections. This two-stage framework offers a scalable solution to teaching LLMs self-correction, enabling them to autonomously refine their outputs without external feedback. To learn more about REINFORCE: The Reinforce Algorithm, also known as the Monte Carlo Policy Gradient, is an approach to solving reinforcement learning problems. It's based on the idea of using gradient ascent to optimize a policy by directly maximizing the expected cumulative reward. one of the best resources is Lilian Weng's blog here: https://lilianweng.github.io/posts/20... All rights w/ authors: On the Diagram of Thought https://arxiv.org/pdf/2409.10038 TO COT OR NOT TO COT? CHAIN-OF-THOUGHT HELPS MAINLY ON MATH AND SYMBOLIC REASONING https://arxiv.org/pdf/2409.12183 Training Language Models to Self-Correct via Reinforcement Learning https://arxiv.org/pdf/2409.12917 #chatgpt #ai #google #reinforcementlearning