У нас вы можете посмотреть бесплатно Deep Dive: Quantizing Large Language Models, part 1 или скачать в максимальном доступном качестве, которое было загружено на ютуб. Для скачивания выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Quantization is an excellent technique to compress Large Language Models (LLM) and accelerate their inference. In this video, we discuss model quantization, first introducing what it is, and how to get an intuition of rescaling and the problems it creates. Then we introduce the different types of quantization: dynamic post-training quantization, static post-training quantization, and quantization-aware training. Finally, we start looking at and comparing actual quantization techniques: PyTorch, ZeroQuant, and bitsandbytes. In part 2 • Deep Dive: Quantizing Large Language ... , we look at and compare more advanced quantization techniques: SmoothQuant, GPTQ, AWQ, HQQ, and the Hugging Face Optimum Intel library based on Intel Neural Compressor and Intel OpenVINO. Slides: https://fr.slideshare.net/slideshow/j... ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️ 00:00 Introduction 02:05 What is quantization? 06:50 Rescaling weights and activations 08:17 The mapping function 12:38 Picking the input range 16:15 Getting rid of outliers 19:50 When can we apply quantization? 26:00 Dynamic post-training quantization with PyTorch 28:42 ZeroQuant 34:50 bitsandbytes