У нас вы можете посмотреть бесплатно 【EP1】A Vision-and-Language Approach to Computer Vision in the Wild: Modeling and Benchmark или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
#computervision #vision #language #benchmarks #foundationmodels Title: A Vision-and-Language Approach to Computer Vision in the Wild: Modeling & Benchmark Speaker: Chunyuan Li https://chunyuan.li/ Abstract: The future of AI is in creating systems like foundation models that are pre-trained once, and will handle countless many downstream tasks directly (zero-shot), or adapt to new tasks quickly (few-shot). In this talk, I will focus on discussing our recent research explorations in building such a transferable system in computer vision (CV) that can effortlessly generalize to a wide range of visual recognition tasks in the wild. (1) As a research background, I will briefly mention our efforts on modeling. We are taking a vision-and-language (VL) approach, where every visual recognition task can be reformulated as an image-and-text matching problem. This is exemplified by UniCL[1] / Florence [2] for image classification, GLIP [3] for object detection, and KLITE [4] that demonstrates the advantage of the reformulation of CV as VL (it allows leveraging external knowledge). (2) I will also talk about benchmark ELEVATER [5] to evaluate the task-level transfer ability of pre-trained visual models, to measure the research progress in this direction. It consists of 20 image classification datasets and 35 object detection datasets. Based on which, we are also organizing an ECCV workshop [6] that aims to bring together the community effort to collaboratively tackle the challenge of computer vision in the wild. References: [1] Unified Contrastive Learning in Image-Text-Label Space https://arxiv.org/abs/2204.03610 [2] Florence: A New Foundation Model for Computer Vision https://arxiv.org/abs/2111.11432 [3] Grounded Language-Image Pre-training https://arxiv.org/abs/2112.03857 [4] K-LITE: Learning Transferable Visual Models with External Knowledge https://arxiv.org/abs/2204.09222 [5] ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models https://arxiv.org/abs/2204.08790 [6] ECCV Workshop https://computer-vision-in-the-wild.g... OUTLINE: 00:00:00 - Intro 00:01:10 - Talk 00:46:10 - Q&A