Скачать с ютуб видео Open-Vocabulary Visual Perception upon Frozen Vision and Language Models (Yin Cui, Google)

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Скачать видео с ютуб по ссылке или смотреть без блокировок на сайте: Open-Vocabulary Visual Perception upon Frozen Vision and Language Models (Yin Cui, Google) в качестве 4k

У нас вы можете посмотреть бесплатно Open-Vocabulary Visual Perception upon Frozen Vision and Language Models (Yin Cui, Google) или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:

Информация по загрузке:

Скачать mp3 с ютуба отдельным файлом. Бесплатный рингтон Open-Vocabulary Visual Perception upon Frozen Vision and Language Models (Yin Cui, Google) в формате MP3:

Если кнопки скачивания не загрузились НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу страницы.
Спасибо за использование сервиса ClipSaver.ru

Open-Vocabulary Visual Perception upon Frozen Vision and Language Models (Yin Cui, Google)

ECCV 2022 CVinW Workshop Invited Talk: Open-Vocabulary Visual Perception upon Frozen Vision and Language Models (Yin Cui, Google) Abstract: Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs has become a promising paradigm for open-vocabulary visual perception. In our recent explorations, we developed open-vocabulary models for detection based on distilling VLMs on existing detection data (ViLD), and for segmentation based on aligning image regions with image captions (OpenSeg). In this talk, I will focus on how to greatly simplify the paradigm by directly building upon frozen VLMs like CLIP with minimal modifications. In the first part, I will present our open-vocabulary detection model F-VLM that achieves state-of-the-art performance on the LVIS benchmark by only training a light-weight detector head. In the second part, I will show how we leverage motion and audio to help video generalize better to novel classes. Our model MOV encodes video, audio and flow with the same pre-trained CLIP’s vision encoder (frozen for video). We design an asymmetrical cross-attention module to aggregate multimodal information. MOV achieves state-of-the-art performance on UCF and HMDB, outperforming both traditional zero-shot methods and recent CLIP-based adaptation methods.

Comments