У нас вы можете посмотреть бесплатно The Real Way AI Understands Language или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
In this comprehensive deep dive into the mathematics and mechanics of neural network optimization, we explore the algorithms that serve as the engine of modern artificial intelligence. Gradient descent remains the most fundamental optimization method, functioning by iteratively updating model parameters in the opposite direction of the gradient of the objective function to find a local minimum. To visualize this, imagine a hiker trapped in a thick fog in the mountains; they must use the local steepness of the ground beneath their feet to decide which direction to step to reach the valley floor. In mathematical terms, the gradient is a vector of all partial derivatives that points in the direction of the steepest ascent; thus, moving against it—toward the negative gradient—leads to the fastest decrease in the cost function.As we transition from basic methods to adaptive learning rate era, the video explores how simple SGD was enhanced with Momentum. By adding a fraction of the previous update to the current one, momentum acts like a heavy ball rolling down a hill, accumulating speed in consistent directions and dampening oscillations in narrow "ravines" of the loss landscape. We then analyze the Adam (Adaptive Moment Estimation) optimizer, which is currently the de-facto standard for many NLP tasks. Adam combines the benefits of momentum and RMSprop, maintaining moving averages of both the gradients (first moment) and the squared gradients (second moment) to adjust the learning rate for each parameter individually.For developers working with limited hardware or massive models, we highlight Adafactor and Lion. Adafactor reduces memory overhead by maintaining only the per-row and per-column sums of the squared gradient moving averages, allowing for sublinear memory cost when training huge Transformer models. Lion (EvoLved sIgn mOmeNtum), an optimizer discovered by Google's AutoML using an evolutionary algorithm, is even more efficient. Unlike other optimizers, Lion only cares about the sign of the gradient, applying a constant magnitude update to every weight. This simplicity allows Lion to save roughly 33% of GPU memory compared to AdamW while delivering comparable or superior performance. The video also touches on the importance of second-order optimization methods like Newton’s method. While first-order methods only use the gradient, second-order methods use the Hessian matrix (the derivative of the derivative) to account for the curvature of the loss surface. Although Newton's method can converge much faster to a minimum, it is often too computationally expensive for deep learning because inverting a large Hessian matrix scales cubically with the number of parameters. This leads us to Quasi-Newton methods like L-BFGS and structured preconditioning methods like Shampoo, which approximate the Hessian to speed up convergence without the full computational cost. We further explore the nuances of training stability, particularly the role of the learning rate schedule. Using a fixed learning rate is often suboptimal; instead, models often benefit from a warmup phase, where the rate gradually increases to prevent early divergence, followed by a decay phase (such as Cosine Decay) to allow the model to settle into a sharp minimum. We also discuss how torch.autograd in PyTorch simplifies the implementation of these complex steps by automatically tracking all operations in a Directed Acyclic Graph (DAG) to compute gradients via the chain rule. Finally, the video covers cutting-edge research such as JEST (Joint Example Selection), a technique from Google DeepMind that is up to 13 times faster than standard training by selecting complementary batches of data to maximize the model's "learnability". We also discuss the theory that Transformers actually learn in-context by implicitly performing gradient descent in their forward pass, functioning as "mesa-optimizers". Whether you are a researcher aiming to understand the neurobiological inspiration behind neural nets—which emulate the parallel, fault-tolerant thinking style of the brain—or a developer looking for a hyperparameter tuning guide, this video provides the foundational knowledge needed to master AI optimization. By understanding the relationship between loss functions, gradients, and curvature, you can build models that are not only faster to train but also more accurate and robust in real-world applications. #DeepLearning #GradientDescent #AdamW #LionOptimizer #MachineLearningMath #NeuralNetworks #PyTorch #AIoptimization #Transformers #DataScience