Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)

Coquelin, Daniel; Debus, Charlotte; Götz, Markus; Lehr, Fabrice von der; Kahn, James; Siggel, Martin; Streit, Achim

doi:10.21203/rs.3.rs-832355/v1

Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)

Coquelin, Daniel

; Debus, Charlotte; Götz, Markus

; Lehr, Fabrice von der; Kahn, James

; Siggel, Martin; Streit, Achim

Abstract (englisch):

With increasing data and model complexities, the time required to train neural networks has become prohibitively large. To address the exponential rise in training time, users are turning to data parallel neural networks (DPNN) and large-scale distributed resources on computer clusters. Current DPNN approaches implement the network parameter updates by synchronizing and averaging gradients across all processes with blocking communication operations after each forward-backward pass. This synchronization is the central algorithmic bottleneck. We introduce the Distributed Asynchronous and Selective Optimization (DASO) method, which leverages multi-GPU compute node architectures to accelerate network training while maintaining accuracy. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks while adjusting the global synchronization rate during the learning process. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks, as compared to current optimized data parallel training methods.

Zugehörige Institution(en) am KIT	Scientific Computing Center (SCC) Universität Karlsruhe (TH) – Zentrale Einrichtungen (Zentrale Einrichtungen)
Publikationstyp	Forschungsbericht/Preprint
Publikationsjahr	2021
Sprache	Englisch
Identifikator	KITopen-ID: 1000137220
HGF-Programm	46.21.04 (POF IV, LK 01) HAICU
Verlag	Springer
Bemerkung zur Veröffentlichung	Under review in: Journal of Big Data
Vorab online veröffentlicht am	12.04.2021
Schlagwörter	machine learning, neural networks, data parallel training, multi-node, multi-GPU, stale gradients
Nachgewiesen in	Dimensions OpenAlex

KITopen-Download

Volltext

DOI: 10.5445/IR/1000137220

Veröffentlicht am 06.09.2021

Externe Links

Originalveröffentlichung
DOI: 10.21203/rs.3.rs-832355/v1

Dimensions

Export

Statistiken

Seitenaufrufe: 197
seit 06.09.2021

Downloads: 542
seit 10.09.2021

Repository KITopen

Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)

Abstract (englisch):