У нас вы можете посмотреть бесплатно Coding Detection Transformer (DETR) или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
In this lecture, we move beyond theory and get into the real mechanics of how a Detection Transformer actually works by coding it almost from scratch, step by step, in a way that is meant to build intuition rather than just copy-paste code. We begin with a quick but very important recap of the DETR architecture, especially for those who may not clearly remember how object detection differs from pure classification models like Vision Transformer, DeiT, or Swin Transformer, and then slowly transition into implementation details. You will see how a CNN backbone like ResNet50 is used purely for feature extraction, why we deliberately remove the final classification layer, and how high-dimensional convolutional features are projected into a fixed embedding space suitable for transformers. From there, we carefully build the transformer encoder and decoder pipeline, explaining why DETR uses object queries, how self-attention and cross-attention operate inside the decoder, and why the number of decoder outputs depends on the number of object queries rather than the number of image tokens. A large part of this lecture is devoted to position embeddings, where we simplify the original DETR formulation and implement a practical row-column based positional encoding that preserves spatial structure while remaining easy to understand and code. You will clearly see where positional information is added, why certain tensor reshaping, flattening, permuting, and transposing steps are necessary, and how PyTorch’s transformer API expects data to be structured. We also discuss Hungarian matching in depth, explaining why DETR does not need Non-Max Suppression and how optimal matching between predicted boxes and ground truth boxes is achieved during training. The loss function is broken down into classification loss and localization loss, including L1 loss and Generalized IoU loss, with intuitive geometric explanations for why GIoU is needed. While we do not train the model from scratch due to the heavy computational cost, we load pretrained DETR weights and focus on clean, correct inference, so that you can see real bounding box predictions on images without waiting hours for training. By the end of this lecture, you will have a complete end-to-end understanding of how to define a Detection Transformer class, load pretrained weights, preprocess images, run inference, scale normalized bounding box outputs back to image coordinates, and finally visualize predicted boxes with class labels and confidence scores. This session is ideal if you want to truly understand DETR at an architectural and implementation level, and not just treat it as a black box.