У нас вы можете посмотреть бесплатно Creating Models Worth Interpreting или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
This is a talk I gave to my MATS 9.0 training scholars about promising research areas in mech interp. If this kind of research sounds interesting to you, apply to do research with me in MATS! Due 23 Dec tinyurl.com/neel-mats-app We discuss research that’s enabled by making model organisms, models designed to have interesting, safety relevant properties that we can practice on. In particular, making models with hidden goals, so we can practice eliciting secrets, and making models that behave differently when being tested (eval awareness) so we can practice suppressing it. 0:00:00 Auditing Hidden Goals 0:02:24 Creating Model Organisms 0:07:50 Beliefs vs Role-Playing 0:10:55 The Auditing Game Results 0:14:46 Critiquing the Setup 0:17:58 The Power of Black Box Methods 0:20:50 The Value of Model Organisms 0:23:37 Case Study: Secret Knowledge 0:28:03 Overfitting to Organisms 0:32:26 Traces of Narrow Fine-Tuning 0:37:16 Why Fine-Tuning Leaves Traces 0:47:32 Agents and a Warning 0:50:05 Suppressing Eval Awareness 0:58:21 Real-World Eval Awareness 1:02:36 Amplifying Subtle Biases 1:06:02 Black Box vs Steering Interventions 1:09:28 Q&A: Debugging & Agents 1:13:41 Q&A: Editing Chain-of-Thought 1:20:30 Q&A: Probing for Role-Playing 1:25:26 Q&A: Reward Hacking 1:37:49 Q&A: Research Strategy & Advice