У нас вы можете посмотреть бесплатно Gradient Code Generation for AI Accelerators withSpecialized Data Layout Requirements или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
Authors: Linh H. Tran, Amy Wang, Zichun Ye and Giancarlo Colmenares Convolution is the most popular and extensively optimized operator in modern deep neural networks. As machine learning frameworks such as TensorFlow [1] enable network training by an end user using commodity hardware, efforts are being put to optimize the backward or gradient kernels for convolution to speed up the training process. Current method of computing the convolution data gradient suffers from low efficiency with the usage of the column-to-image (col2im) function which performs multiple-to-one gradient aggregation. To overcome the inefficiency, new generations of tensor processors with better support for vector operations would be needed. In this paper, we present an alternative approach of generating convolution backward kernels with respect to data and weights inputs using the forward convolution kernel which consists of image-to-column (im2col) and matrix multiplications. This approach requires non-trivial data format or layout conversions on the inputs and outputs surrounding the use of the forward convolution kernel. Such conversions can become even more complex and possibly inefficient when dealing with tensor processors or accelerators that require peculiar data format to begin with. As such, we formulate an iterator method to systematically perform the required data conversions while taking into account hardware specific optimizations in data movement. We illustrate the iterator method using Huawei DaVinci’s tensor data layout [2]. Our test results using the shapes from ResNet-50 [3] show that on CPU, using library kernels [4] to perform column-to-image, image-to-column, matrix multiplications and the needed format conversions, our approach brings better performance than the original approach that uses col2im. Further, our approach outperforms TVM’s automatically generated backward kernels [5]. We also investigate how a fast image-to-column native hardware support can affect the overall performance of the backward kernels. Keywords: backward gradient; specialized data format ; convolution operator; performance improvement; DaVinci hardware; AI processor