У нас вы можете посмотреть бесплатно Why is saving state_dict getting slower as training progresses? или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Discover the reasons behind the slowdown in saving your model state_dict during training and learn effective strategies to solve it! --- This video is based on the question https://stackoverflow.com/q/67620556/ asked by the user 'Penguin' ( https://stackoverflow.com/u/14735451/ ) and on the answer https://stackoverflow.com/a/67624265/ provided by the user 'trialNerror' ( https://stackoverflow.com/u/10935717/ ) at 'Stack Overflow' website. Thanks to these great users and Stackexchange community for their contributions. Visit these links for original content and any more details, such as alternate solutions, latest updates/developments on topic, comments, revision history etc. For example, the original title of the Question was: Why is saving state_dict getting slower as training progresses? Also, Content (except music) licensed under CC BY-SA https://meta.stackexchange.com/help/l... The original Question post is licensed under the 'CC BY-SA 4.0' ( https://creativecommons.org/licenses/... ) license, and the original Answer post is licensed under the 'CC BY-SA 4.0' ( https://creativecommons.org/licenses/... ) license. If anything seems off to you, please feel free to write me at vlogize [AT] gmail [DOT] com. --- Understanding the Slowdown in Saving state_dict During Training When working with deep learning models, it's common to save the state of your model at various intervals to avoid losing progress. However, what happens when this saving process starts taking significantly longer as training progresses? If you've faced this problem, you are not alone. In this guide, we will delve into the reasons behind the slowing down of saving state_dict in PyTorch as training continues, and we will provide you with a clear solution to prevent this slowdown. The Problem: Slow Saving of Model State As you save your model's and optimizer's state_dict during training, you might notice that initially, the process takes just a few seconds. But after hours of training, it can become a cumbersome task that takes over two minutes. This leads to frustration and can hinder your workflow, especially if you require frequent checkpoints. Dissecting the Code From the shared code snippet, the state_dict-saving process is initiated every 50,000 epochs. The process includes saving the model, optimizer states, a scheduler, current loss, and a list of losses: [[See Video to Reveal this Text or Code Snippet]] Potential Culprit: The List of Losses One factor that may lead to the increased saving time is the accumulation of the list of losses, stored as a tensor object. Focusing particularly on the following line in the training step can shed light on the issue: [[See Video to Reveal this Text or Code Snippet]] The Solution: Optimize Loss Storage The reason for the significant increase in saving time lies in how the loss tensor is stored in your list. When you use self.losses.append(loss), you are not just saving the loss value but also the entire computational graph associated with it. This includes pointers to each tensor involved in computing the loss, leading to rapid expansion and a considerable amount of memory being processed during saving. Recommended Change To solve this issue, instead of saving the entire tensor, you should save just the loss value as a simple float. You can achieve this with the following modification: [[See Video to Reveal this Text or Code Snippet]] Alternatively, you can use loss.detach() to detach the loss tensor from its computational graph, which would also prevent the issue of bloat: [[See Video to Reveal this Text or Code Snippet]] Benefits of the Change By implementing either of these changes, you reduce the data being saved significantly, as you're now only saving the scalar value of the loss rather than a tensor object with an associated computational graph. This small adjustment can lead to substantial improvements in: Saving time: The saving process should return to its original speed, typically under 5 seconds. Memory management: You reduce the overall memory footprint by only saving essential information. Conclusion If you've encountered a slowdown in saving your model's state_dict during training, remember to evaluate how you're storing your loss values. By simply converting the loss tensor to a float before appending it to your list, you can solve this frustrating issue. Keeping your code optimized not only speeds up your training checkpoints but also enhances your overall productivity in model training. Good luck with your deep learning projects, and stay tuned for more tips and tricks!