У нас вы можете посмотреть бесплатно Build From Scratch Series - Multi-modal Models, Simply Explained или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
What happens when you stitch a text brain and a vision brain together into one Frankenstein-like AI? You get a multimodal model — and in this episode, we build one from scratch. In Article 4 of our "Building From Scratch" series, we break down exactly how AI learns to see AND read at the same time. We're talking shared embedding spaces, contrastive learning (a.k.a. Tinder for data), vision transformers that chop images into patches like lasagna, projection layers that act as universal adapter plugs, and cross-attention mechanisms where two AI brains literally phone each other for help. 🔥 Key topics covered in this episode: • What "multimodal" actually means — and why separate AI senses aren't enough • How vectors become the universal language connecting text, images, and beyond • Vision Transformers (ViT) — how AI reads a photo of a dog like a weird sentence • The 7-step pipeline for building a multimodal model from the ground up • Zero-shot classification — how a model identifies a platypus it's never seen • Cross-attention fusion — the secret sauce behind visual question answering • Why this tech is both thrilling and a little terrifying Whether you're an AI enthusiast, a machine learning student, or just curious about how ChatGPT-style models understand images, this episode makes the complex feel approachable with wild analogies, live role-play, and zero jargon left unexplained. 👉 New here? Start with Article 1 (Text Transformers) and Article 3 (Vision Models) to get the full picture — or jump right in, we've got you covered. 💬 Drop a comment: What modality should AI learn next — audio, video, or touch? 🔔 Subscribe and hit the bell so you never miss an episode! #MultimodalAI #VisionTransformer #BuildFromScratch #MachineLearning #CLIP #DeepLearning #AIExplained #TransformerModels 📑 Chapters: 0:00 Welcome to The Bearded AI Guy 0:44 Giving the Machine Eyes — Why This Episode Matters 1:55 Brain Surgery: Stitching Two AI Brains Together 2:33 What Does Multimodal Actually Mean? 3:37 The Library vs. The Camera — Why Silos Fail 4:56 The Platonic Ideal of a Car 🚗 5:38 The Universal Translator Room 6:26 It's Always Vectors — The Lifeblood of AI 7:11 The Shared Embedding Space Explained 8:19 CLIP — The Proof It Actually Works 9:11 Contrastive Learning = Tinder for Data 🔥 10:42 The 7-Step Build Guide Begins 12:43 Vision Encoder — Chopping Images Like Lasagna 15:56 The Projection Layer — The Adapter Plug 18:27 Training: Freeze the Big Brains 19:21 Zero-Shot Magic — The Platypus Test 21:05 Fusion & Cross-Attention Role Play 🎭 25:47 Fine-Tuning Without Breaking Everything 28:05 Full Pipeline Recap — We Built It! 29:30 Beyond Vision: Adding ALL the Senses 31:27 Wrapping Up — Will It Scale? Tags: multimodal AI, vision transformer, building AI from scratch, CLIP model explained, contrastive learning, shared embedding space, cross attention mechanism, how AI sees images, multimodal machine learning, vision language model, ViT explained, zero shot classification, AI for beginners, transformer architecture, projection layer AI, the bearded AI guy, building from scratch series, how multimodal models work, visual question answering, AI tutorial