У нас вы можете посмотреть бесплатно Group Sequence Policy Optimization или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Reinforcement Learning (RL) for large language models (LLMs) often faces significant training instability, leading to catastrophic model collapse, particularly with existing algorithms like FQON. This instability primarily stems from a fundamental misapplication of token-level importance weights, which introduces high variance and noise into training gradients. To address these limitations, this paper proposes Group Sequence Policy Optimization (FRON), a novel RL algorithm designed for training large language models. FRON's key innovation lies in its theoretically grounded definition of importance ratios based on sequence likelihood and the normalization of sequence-level rewards. Empirical evaluations demonstrate FRON's superior training stability, efficiency, and overall performance compared to FQON. Critically, FRON inherently resolves the core stability challenges in RL training for LLMs, eliminating the need for complex stabilization strategies. These merits significantly contribute to exceptional performance improvements in state-of-the-art LLMs, like the latest Pvdm-2 models, and simplify future RL infrastructure design. #ReinforcementLearning #LargeLanguageModels #LLM #PolicyOptimization #MachineLearning #DeepLearning #AI #TrainingStability #Algorithm #FRON paper - https://arxiv.org/abs/2507.18071 subscribe - https://t.me/arxivpaper donations: USDT: 0xAA7B976c6A9A7ccC97A3B55B7fb353b6Cc8D1ef7 BTC: bc1q8972egrt38f5ye5klv3yye0996k2jjsz2zthpr ETH: 0xAA7B976c6A9A7ccC97A3B55B7fb353b6Cc8D1ef7 SOL: DXnz1nd6oVm7evDJk25Z2wFSstEH8mcA1dzWDCVjUj9e created with NotebookLM