У нас вы можете посмотреть бесплатно TimeSformer from scratch: How to use Vision Transformer (ViT) for videos? или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Join the pro version to get access to code files, hand-written notes, PDF booklets, Vizuara's certificate and more: https://vizuara.ai/courses/transforme... In this lecture, we take a deep and careful step into transformers for video understanding, marking the first time in this series where we move beyond text and images and start reasoning about videos as a distinct modality, with time playing an equally important role as space. Starting from the familiar Vision Transformer (ViT) framework, this session explains how and why standard image-based attention mechanisms break down when applied naively to videos, and how the TimeSformer (Time-Space Transformer) architecture modifies attention to explicitly model both spatial and temporal information in a principled and computationally efficient way. The lecture is based on the paper “Is Space-Time Attention All You Need for Video Understanding?” and walks you through its core ideas without assuming anything beyond a solid understanding of transformers. We discuss why a single video frame is often insufficient to classify actions correctly, using intuitive real-world examples like gym exercises, where temporal motion matters more than static appearance, and contrast this with cases where spatial cues alone are enough. From there, we build the theory step by step, explaining spatial attention, temporal attention, joint space-time attention, and divided space-time attention, and why dividing attention across space and time turns out to be both more scalable and more accurate in practice. A major focus of the lecture is attention complexity, where we carefully derive how the number of query-key interactions explodes for joint space-time attention and how divided space-time attention reduces this from quadratic growth in N×F to a much more manageable N+F per token. These derivations are done slowly and intuitively, so that the math feels like a natural extension of ideas you already know from standard transformers, rather than something intimidating. We also look at empirical evidence from the paper, including FLOPs scaling plots, accuracy comparisons on the Kinetics-400 dataset, and t-SNE visualizations that show why divided space-time attention produces cleaner and more separable video embeddings. In the second half of the lecture, the focus shifts from theory to practice. We discuss the Kinetics-400 dataset, explain how real-world action recognition datasets are structured, and then move towards a hands-on implementation. You will see how to preprocess videos into frames, how to think about frames as sequences of patch tokens, and finally how to implement a TimeSformer block from scratch in PyTorch, including temporal attention, spatial attention, residual connections, layer normalization, and MLPs. We also discuss practical trade-offs such as frame sampling, model size, and why even small design choices matter a lot when working with video data. This session is ideal if you already understand Vision Transformers and want to extend that understanding to videos in a clean, theory-backed way, while also seeing how everything maps to real code. By the end of the lecture, you should have a clear mental model of how TimeSformer works, why divided space-time attention is so powerful, and how to implement and experiment with video transformers on your own. If you are following along as part of the larger Computer Vision from Scratch or Transformers for Vision series, this lecture forms a crucial bridge between image transformers and modern video understanding models.