У нас вы можете посмотреть бесплатно Extracting training data from Large Language Models или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Large Language Models (LLMs) can memorize parts of their training data. This has been a key point in several lawsuits being filed against AI companies, with claims of copyright infringement tied to how training data is collected and used. But how much do these models actually memorize? And is their use of prior data really so different from how people take inspiration from books, music, or art to create new things? In this video, I show how training data can sometimes be extracted from LLMs and discuss what this means for the way we think about originality, learning, and intellectual property. 00:00 Introduction 02:43 Fair Use for AI 03:40 Examples of memorization 07:20 Extracting train data 16:44 What do models memorize? 26:40 Aligning models 31:42 Non-adversarial memorization 34:51 Towards resolutions Main references: Extracting Training Data from Large Language Models: https://arxiv.org/pdf/2012.07805 Quantifying Memorization Across Neural Language Models: https://arxiv.org/pdf/2202.07646 Scalable Extraction of Training Data from (Production) Language Models: https://arxiv.org/pdf/2311.17035 and https://spylab.ai/blog/training-data-... Measuring Non-Adversarial Reproduction of Training Data in Large Language Models: https://arxiv.org/pdf/2411.10242 and https://spylab.ai/blog/non-adversaria... Deduplicating Training Data Makes Language Models Better: https://arxiv.org/pdf/2107.06499 Fair use and lawsuits: NYT lawsuit: https://www.theverge.com/2023/12/27/2... Fair use discussion by Suchir Balaji: https://suchir.net/fair_use.html Meta lawsuit: https://www.wired.com/story/matthew-b... LibGen story by Alex Reisner: https://www.theatlantic.com/technolog... Note from Nicholas Carlini: https://nicholas.carlini.com/writing/... Other references for the motivated reader :) Driven by compression progress: https://arxiv.org/abs/0812.4360 Rethinking LLM Memorization through the Lens of Adversarial Compression: https://arxiv.org/pdf/2404.15146 Large language modelling is compression: https://arxiv.org/abs/2309.10668 Language Models May Verbatim Complete Text They Were Not Explicitly Trained On: https://arxiv.org/pdf/2503.17514 Talkin’ ’Bout AI Generation: https://arxiv.org/pdf/2309.08133 An Archaeology of Books Known to ChatGPT/GPT-4: https://arxiv.org/pdf/2305.00118