A container-based workflow for distributed training of deep learning algorithms in HPC clusters

González-Abad, Jose; López García, Álvaro; Kozlov, Valentin Y. ORCID iD icon 1
1 Scientific Computing Center (SCC), Karlsruher Institut für Technologie (KIT)


Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited user permission. In addition, different HPC clusters may possess different peculiarities that can entangle the research cycle (e.g., libraries dependencies). In this paper we develop a workflow and methodology for the distributed training of deep learning models in HPC clusters which provides researchers with a series of novel advantages. It relies on udocker as containerization tool and on Horovod as library for the distribution of the models across multiple GPUs. udocker does not need any special permission, allowing researchers to run the entire workflow without relying on any administrator. Horovod ensures the efficient distribution of the training independently of the deep learning framework used. ... mehr

DOI: 10.5445/IR/1000152892
Veröffentlicht am 21.11.2022
Zugehörige Institution(en) am KIT Scientific Computing Center (SCC)
Publikationstyp Forschungsbericht/Preprint
Publikationsjahr 2022
Sprache Englisch
Identifikator KITopen-ID: 1000152892
HGF-Programm 46.21.02 (POF IV, LK 01) Cross-Domain ATMLs and Research Groups
Umfang 35 S.
Vorab online veröffentlicht am 04.08.2022
Schlagwörter Distributed Training, Deep Learning, High Performance Computing, udocker, Docker, Horovod
Nachgewiesen in arXiv
