У нас вы можете посмотреть бесплатно Steering vectors: tailor LLMs without training. Part II: Code (Interpretability Series) или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
See Part I for an intro into Steering Vectors • Steering vectors: tailor LLMs without trai... . Code from this video: https://github.com/abrvkh/explainabil... State-of-the-art foundation models are often seen as black boxes: we send a prompt in and we get out our - often useful - answer. But what happens inside the system as the prompt gets processed remains a bit of a mystery & our ability to control or steer the processing into specific directions is limited. Enter steering vectors! By computing a vector that represents a particular feature or concept, we can use this to steer the model to include any property in the output we want: add more love into the answers, ensure it always answers your prompts (even if harmful!), or make the model such that it cannot stop talking about the Golden Gate Bridge. In this video we will code up fully from scratch a steering vector setup and use it to find refusals and hate-love directions. Disclaimer: the ability to remove refusals (i.e. make the model answer even harmful prompts) so easily is a big downside (instability) of these models. Further reading & references I used: Activation addition: https://arxiv.org/abs/2308.10248 Refusal directions: https://www.alignmentforum.org/posts/... and https://huggingface.co/posts/mlabonne... Golden Gate Claude: https://www.anthropic.com/news/golden... Superposition: https://transformer-circuits.pub/2022... Sparse autoencoders: https://arxiv.org/pdf/2406.04093v1