У нас вы можете посмотреть бесплатно ONNXCommunityMeetup2023: INT8 Quantization for Large Language Models with Intel Neural Compressor или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
The explosive growth of large language models (LLMs) has facilitated a significant number of breakthroughs in fields like text analysis, language translation, and chatbot technologies. However, the deployment of LLMs presents a formidable challenge due to their large parameter (e.g., over 700GB memory required to run BLOOM-176B model in FP32), making them impractical to run on commodity hardware. Users, therefore, have an ongoing demand for methods of compressing LLMs that maintain comparable accuracy while reducing their memory footprint, for which general quantization recipes may not work. To compress LLMs with reasonable accuracy, Intel® Neural Compressor integrates as well as enhances SmoothQuant algorithm, which effectively addresses the compression challenge by efficiently compensating for the accuracy loss introduced by activation quantization. Our team has validated the efficacy of this solution on numerous LLMs such as GPT-J, LLaMA, and BLOOM, achieving promising latency on Intel hardware. Furthermore, Intel® Neural Compressor eliminates the gap that exists in exporting int8 PyTorch models to ONNX format, making it ideal for production deployment. We continue to upload ONNX models to the ONNX model zoo and Hugging Face hub (e.g., GPT-J and Whisper-large), which can make contributions to the ONNX community.