KIT | KIT-Bibliothek | Impressum | Datenschutz

A scalable architecture for online anomaly detection of WLCG batch jobs

Kuehn, E. ORCID iD icon 1; Fischer, M. ORCID iD icon 1; Giffels, M. ORCID iD icon 1; Jung, C. 1; Petzold, A. ORCID iD icon 1
1 Scientific Computing Center (SCC), Karlsruher Institut für Technologie (KIT)

Abstract:

For data centres it is increasingly important to monitor the network usage, and learn from network usage patterns. Especially con_guration issues or misbehaving batch jobs preventing a smooth operation need to be detected as early as possible. At the GridKa data and computing centre we therefore operate a tool BPNetMon for monitoring tra_c data and characteristics of WLCG batch jobs and pilots locally on di_erent worker nodes. On the one hand local information itself are not su_cient to detect anomalies for several reasons, e.g. the underlying job distribution on a single worker node might change or there might be a local miscon_guration. On the other hand a centralised anomaly detection approach does not scale regarding network communication as well as computational costs. We therefore propose a scalable architecture based on concepts of a super-peer network.


Volltext §
DOI: 10.5445/IR/1000063495
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Kernphysik (IKP)
Scientific Computing Center (SCC)
Universität Karlsruhe (TH) – Zentrale Einrichtungen (Zentrale Einrichtungen)
Publikationstyp Zeitschriftenaufsatz
Publikationsjahr 2016
Sprache Englisch
Identifikator ISSN: 1742-6588, 1742-6596
urn:nbn:de:swb:90-634953
KITopen-ID: 1000063495
HGF-Programm 53.52.02 (POF III, LK 02) GridKa
Erschienen in Journal of physics / Conference Series
Verlag Institute of Physics Publishing Ltd (IOP Publishing Ltd)
Band 762
Seiten 012002
Nachgewiesen in Scopus
Dimensions
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page