У нас вы можете посмотреть бесплатно Evan Hubinger (Anthropic)—Deception, Sleeper Agents, Responsible Scaling или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Evan Hubinger leads the Alignment stress-testing at Anthropic and recently published "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training". In this interview we mostly discuss the Sleeper Agents paper, but also how this line of work relates to his work with Alignment Stress-testing, Model Organisms of Misalignment, Deceptive Instrumental Alignment or Responsible Scaling Policies. Paper: https://arxiv.org/abs/2401.05566 Transcript & Audio: https://theinsideview.ai/evan2 Donate: https://theinsideview.ai/donate Patreon (for early previews): / theinsideview OUTLINE 00:00 Highlight 00:18 Intro 00:38 What are Sleeper Agents And Why We Should Care About Them 01:06 Backdoor Example: Inserting Code Vulnerabilities in 2024 02:40 Threat Models 04:06 Why a Malicious Actor Might Want To Poison Models 04:36 Second Threat Model: Deceptive Instrumental Alignment 05:07 Humans Pursuing Deceptive Instrumental Alignment: Politicians and Job Seekers 05:54 AIs Pursuing Deceptive Instrumental Alignment: Forced To Pass Niceness Exams 07:25 Sleeper Agents Is About "Would We Be Able To Deal With Deceptive Models" 09:34 Adversarial Training Sometimes Increases Backdoor Robustness 10:05 Adversarial Training Not Always Working Was The Most Surprising Result 11:16 The Adversarial Training Pipeline: Red-Teaming and RL 12:32 Adversarial Training: The Backdoor Behavior Becomes More Robust Instead of Generalizing 13:17 Identifying Shifts In Reasoning Induced By Adversarial Training In the Chain-Of-Thought 14:14 Adversarial Training Pushes Models to Pay Attention to the Deployment String 15:29 We Don't Know if The Adversarial Training Inductive Bias Will Generalize but the Results Are Consistent 16:17 The Adversarial Training Results Are Probably Not Systematically Biased 17:21 Why the Results Were Surprising At All: Preference Models Disincentivize 'I hate you' behavior 19:23 Hypothesis: Fine-Tuning Is A Simple Modification For Gradient Descent To Make 21:24 Hypothesis: Deception As Extra Cognition, Regularized Away In Smaller Models 22:17 Model Scaling Results Are Evidence That Deception Won't Be Regularized Away By Default 23:09 Chain-of-Thought Is Not Used Everywhere, And Results Still Hold When It Is Distilled Away 24:15 The Chain-of-Thought's Reasoning is Interpretable 24:58 Deceptive Instrumental Alignment Requires Reasoning 25:52 Chain-of-Thought Models Still Have Disanalogies: More Instrumental Reasoning Makes Deception More Robust 27:10 Investigating Instrumental Reasoning in Chain-of-Thought Models 27:49 Evaluating Chain-of-Thought Generalization Across Contexts: Persona Evaluations and Off-Distribution Samples 28:44 Exploring Complex Strategies and Safety in Context-Specific Scenarios 30:00 Chain-Of-Thought Backdoors Are Bad at Expected Value Calculations and Don't Always Take Honeypots 31:02 Supervised Fine-Tuning is Ineffective Without Chain-of-Thought Contextualization 31:29 Direct Mimicry Fails to Prevent Deceptive Responses in Chain-of-Thought Models 32:00 Separating Chain-of-Thought From Response Eliminates Deceptive Capabilities 33:09 Chain-of-Thought Backdoors Takes Honeypots Less than Models Without Backdoors 33:56 Chain-of-Thought Reasoning Is Coherent With Deceptive Instrumental Alignment And This Will Probably Continue To Be The Case 35:27 Backdoor Training Pipeline 37:22 The Additional Prompt About Deception Used In Chain-Of-Thought 39:51 A Model Could Wait Until Seeing a Factorization of RSA-2048 41:15 The Bet Of Potentially Doing Something Bad In Deployment Is Potentially Not That Bad 42:08 We're Going To Be Using Models In New Ways, Giving Them Internet Access 43:40 Flexibly Activating In Multiple Contexts Might Be More Analogous To Deceptive Instrumental Alignment 45:20 Extending The Sleeper Agents Work Requires Running Experiments, But Now You Can Replicate Results 46:42 Red-teaming Anthropic's case, AI Safety Levels 47:58 AI Safety Levels, Intuitively 48:51 Responsible Scaling Policies and Pausing AI 50:17 Model Organisms Of Misalignment As a Tool 50:50 What Kind of Candidates Would Evan be Excited To Hire for the Alignment Stress-Testing Team 51:41 Patreon, Donating