У нас вы можете посмотреть бесплатно Seeing Is Believing: Training an Open Source Grounded OCR VLM –GutenOCR или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Traditional OCR engines do output locations – but they're often brittle across domains, layouts, and languages, requiring fragile post-processing to stay accurate. Vision-language models (VLMs) promise far better flexibility and transfer, yet many current OCR VLMs still falter on provenance: they can "read," but can't consistently show where a value came from without generating long, costly page-wide outputs. In this session, Hunter Heidenreich, Ben Elliott, and Yosheb Getachew show how a compact, open VLM can deliver reliable, line/word-level grounding – answering "what does it say here?" and "where is X?" with precise boxes and reproducible behavior. They walk through the end-to-end recipe behind GutenOCR (a fine-tune of Qwen2.5-VL): data and synthetic grounding signals, the prompting/system-prompt design that enforces strict output formats, the training stack and hardware profile, and how we evaluate reading, detection, and grounding (not just text accuracy). Expect candid lessons on multi-column layouts and complex tables, plus open code (including our vLLM eval harness) so you can reproduce results or adapt the approach.