KIT | KIT-Bibliothek | Impressum | Datenschutz

A container-based workflow for distributed training of deep learning algorithms in HPC clusters

González-Abad, Jose ; López García, Álvaro; Kozlov, Valentin Y. ORCID iD icon 1
1 Scientific Computing Center (SCC), Karlsruher Institut für Technologie (KIT)

Abstract:

Deep learning has been postulated as a solution for numerous problems in different branches of science. Given the resource-intensive nature of these models, they often need to be executed on specialized hardware such graphical processing units (GPUs) in a distributed manner. In the academic field, researchers get access to this kind of resources through High Performance Computing (HPC) clusters. This kind of infrastructures make the training of these models difficult due to their multi-user nature and limited user permission. In addition, different HPC clusters may possess different peculiarities that can entangle the research cycle (e.g., libraries dependencies). In this paper we develop a workflow and methodology for the distributed training of deep learning models in HPC clusters which provides researchers with a series of novel advantages. It relies on udocker as containerization tool and on Horovod as library for the distribution of the models across multiple GPUs. udocker does not need any special permission, allowing researchers to run the entire workflow without relying on any administrator. Horovod ensures the efficient distribution of the training independently of the deep learning framework used. ... mehr


Verlagsausgabe §
DOI: 10.5445/IR/1000152890
Veröffentlicht am 21.11.2022
Originalveröffentlichung
DOI: 10.1007/s10586-022-03798-7
Scopus
Zitationen: 4
Web of Science
Zitationen: 1
Dimensions
Zitationen: 6
Cover der Publikation
Zugehörige Institution(en) am KIT Scientific Computing Center (SCC)
Publikationstyp Zeitschriftenaufsatz
Publikationsjahr 2022
Sprache Englisch
Identifikator ISSN: 1386-7857, 1573-7543
KITopen-ID: 1000152890
HGF-Programm 46.21.02 (POF IV, LK 01) Cross-Domain ATMLs and Research Groups
Erschienen in Cluster Computing
Verlag Springer
Band 26
Heft 5
Seiten 2815–2834
Vorab online veröffentlicht am 07.11.2022
Schlagwörter Distributed training, Deep learning, High performance computing, udocker, Docker, Horovod
Nachgewiesen in Scopus
Web of Science
Dimensions
Relationen in KITopen
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page