Sharing is Caring: Efficient LM Post-Trainingwith Collective RL Experience Sharing скачать в хорошем качестве

Sharing is Caring: Efficient LM Post-Trainingwith Collective RL Experience Sharing 4 недели назад

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Скачать видео с ютуб по ссылке или смотреть без блокировок на сайте: Sharing is Caring: Efficient LM Post-Trainingwith Collective RL Experience Sharing в качестве 4k

У нас вы можете посмотреть бесплатно Sharing is Caring: Efficient LM Post-Trainingwith Collective RL Experience Sharing или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:

Информация по загрузке:

Скачать mp3 с ютуба отдельным файлом. Бесплатный рингтон Sharing is Caring: Efficient LM Post-Trainingwith Collective RL Experience Sharing в формате MP3:

Если кнопки скачивания не загрузились НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу страницы.
Спасибо за использование сервиса ClipSaver.ru

Sharing is Caring: Efficient LM Post-Trainingwith Collective RL Experience Sharing

Paper: https://arxiv.org/abs/2509.08721v1 Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nuño, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo. Welcome to the Mayuresh Shilotri's Youtube . Maintained by Mayuresh Shilotri You can follow me at Blog - https://shilotri.com/ LinkedIn - / mayureshshilotri Twitter - / mshilotri Note: I only claim to have read the research paper and created a Video using AI tool. I am not the author. All intellectual heavy lifting was performed by the respective authors. 🙏

Comments