📌 USENIX ATC '23 - Accelerating Distributed MoE Training and Inference with Lina - скачать видео с ютуба бесплатно по ссылке

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Скачать видео с ютуб по ссылке или смотреть без блокировок на сайте: USENIX ATC '23 - Accelerating Distributed MoE Training and Inference with Lina в качестве 4k

У нас вы можете посмотреть бесплатно USENIX ATC '23 - Accelerating Distributed MoE Training and Inference with Lina или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:

Информация по загрузке:

Скачать mp3 с ютуба отдельным файлом. Бесплатный рингтон USENIX ATC '23 - Accelerating Distributed MoE Training and Inference with Lina в формате MP3:

Если кнопки скачивания не загрузились НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу страницы.
Спасибо за использование сервиса ClipSaver.ru

USENIX ATC '23 - Accelerating Distributed MoE Training and Inference with Lina

USENIX ATC '23 - Accelerating Distributed MoE Training and Inference with Lina Jiamin Li, City University of Hong Kong, Yimin Jiang, ByteDance Inc., Yibo Zhu, Unaffiliated, Cong Wang, City University of Hong Kong, Hong Xu, The Chinese University of Hong Kong Scaling model parameters improves model quality at the price of high computation overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) architecture, have sub-linear scaling of computation cost with model size, thus providing opportunities to train and serve a larger model at a lower cost. However, distributed MoE training and inference are inefficient, mainly due to the interleaved all-to-all communication during model computation.This paper makes two main contributions. First, we systematically analyze all-to-all overhead in distributed MoE and present the main causes for it to be the bottleneck in training and inference, respectively. Second, we design and build Lina to address the all-to-all bottleneck head-on. Lina opportunistically prioritizes all-to-all over the concurrent allreduce whenever feasible using tensor partitioning, so all-to-all and training step time is improved. Lina further exploits the inherent pattern of expert selection to dynamically schedule resources during inference, so that the transfer size and bandwidth of all-to-all across devices are balanced amid the highly skewed expert popularity in practice. Experiments on an A100 GPU testbed show that Lina reduces the training step time by up to 1.73x and reduces the 95%tile inference time by an average of 1.63x over the state-of-the-art systems. View the full USENIX ATC '23 program at https://www.usenix.org/conference/atc...

Comments