Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell скачать в хорошем качестве

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell 3 недели назад

Не удается загрузить Youtube-плеер. Проверьте блокировку Youtube в вашей сети.
Повторяем попытку...

Скачать видео с ютуб по ссылке или смотреть без блокировок на сайте: Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell в качестве 4k

У нас вы можете посмотреть бесплатно Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:

Информация по загрузке:

Скачать mp3 с ютуба отдельным файлом. Бесплатный рингтон Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell в формате MP3:

Если кнопки скачивания не загрузились НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу страницы.
Спасибо за использование сервиса ClipSaver.ru

Goodfire AI’s Bet: Interpretability as the Next Frontier of Model Design — Myra Deng & Mark Bissell

From Palantir and Two Sigma to building Goodfire into the poster-child for actionable mechanistic interpretability, Mark Bissell (Member of Technical Staff) and Myra Deng (Head of Product) are trying to turn “peeking inside the model” into a repeatable production workflow by shipping APIs, landing real enterprise deployments, and now scaling the bet with a recent $150M Series B funding round at a $1.25B valuation. (https://www.goodfire.ai/blog/our-seri...) In this episode, we go far beyond the usual “SAEs are cool” take. We talk about Goodfire’s core bet: that the AI lifecycle is still fundamentally broken because the only reliable control we have is data and we post-train, RLHF, and fine-tune by “slurping supervision through a straw,” hoping the model picks up the right behaviors while quietly absorbing the wrong ones. Goodfire’s answer is to build a bi-directional interface between humans and models: read what’s happening inside, edit it surgically, and eventually use interpretability during training so customization isn’t just brute-force guesswork. (https://www.goodfire.ai/blog/on-optim...) We discuss: • Myra + Mark’s path: Palantir (health systems, forward-deployed engineering) → Goodfire early team; Two Sigma → Head of Product, translating frontier interpretability research into a platform and real-world deployments • What “interpretability” actually means in practice: not just post-hoc poking, but a broader “science of deep learning” approach across the full AI lifecycle (data curation → post-training → internal representations → model design) • Why post-training is the first big wedge: “surgical edits” for unintended behaviors likereward hacking, sycophancy, noise learned during customization plus the dream of targeted unlearning and bias removal without wrecking capabilities • SAEs vs probes in the real world: why SAE feature spaces sometimes underperform classifiers trained on raw activations for downstream detection tasks (hallucination, harmful intent, PII), and what that implies about “clean concept spaces” • Rakuten in production (https://www.goodfire.ai/research/raku... deploying interpretability-based token-level PII detection at inference time to prevent routing private data to downstream providers plus the gnarly constraints: no training on real customer PII, synthetic→real transfer, English + Japanese, and tokenization quirks • Real-time steering at frontier scale: a demo of steering Kimi K2 (~1T params) live and finding features via SAE pipelines, auto-labeling via LLMs, and toggling a “Gen-Z slang” feature across multiple layers without breaking tool use • Hallucinations as an internal signal: the case that models have latent uncertainty / “user-pleasing” circuitry you can detect and potentially mitigate more directly than black-box methods • Steering vs prompting (https://www.goodfire.ai/blog/feature-... the emerging view that activation steering and in-context learning are more closely connected than people think, including work mapping between the two (even for jailbreak-style behaviors) • Interpretability for science: using the same tooling across domains (genomics, medical imaging, materials) to debug spurious correlations and extract new knowledge up to and including early biomarker discovery work with major partners — Goodfire AI • Website: https://goodfire.ai • LinkedIn: / goodfire-ai • X: https://x.com/GoodfireAI Myra Deng • Website: https://myradeng.com/ • LinkedIn: / myra-deng • X: https://x.com/myra_deng Mark Bissell • LinkedIn: / mark-bissell • X: https://x.com/MarkMBissell 00:00 Introduction 00:45 Welcome + episode setup + intro to Goodfire 02:16 Fundraise news + what’s changed recently 02:44 Guest backgrounds + what they do day-to-day 05:52 “What is interpretability?” (SAEs, probing, steering and quick map of the space) 08:29 Post-training failures (sycophancy/reward hacking) + using interp to guide learning 10:26 Surgical edits: bias vectors + grokking/double descent + subliminal learning 14:04 How Goodfire decides what to work on (customers → research agenda) 16:58 SAEs vs probes: what works better for real-world detection tasks 19:04 Rakuten case study: production PII monitoring + multilingual + token-level scrubbing 22:06 Live steering demo on a 1T-parameter model (and scaling challenges) 25:29 Feature labeling + auto-interpretation + can we “turn down” hallucinations? 31:03 Steering vs prompting equivalence + jailbreak math + customization implications 38:36 Open problems + how to get started in mech interp 46:29 Applications: healthcare + scientific discovery (biomarkers, Mayo Clinic, etc.) 57:10 Induction + sci-fi intuition (Ted Chiang)

Comments