KIT | KIT-Bibliothek | Impressum | Datenschutz

Dimension-based subspace search for outlier detection

Trittenbach, Holger; Böhm, Klemens

Scientific data often are high dimensional. In such data, finding outliers are challenging because they often are hidden in subspaces, i.e., lower-dimensional projections of the data. With recent approaches to outlier mining, the actual detection of outliers is decoupled from the search for subspaces likely to contain outliers. However, finding such sets of subspaces that contain most or even all outliers of the given data set remains an open problem. While previous proposals use per-subspace measures such as correlation in order to quantify the quality of subspaces, we explicitly take the relationship between subspaces into account and propose a dimension-based measure of that quality. Based on it, we formalize the notion of an optimal set of subspaces and propose the Greedy Maximum Deviation heuristic to approximate this set. Experiments on comprehensive benchmark data show that our concept is more effective in determining the relevant set of subspaces than approaches which use per-subspace measures.

DOI: 10.1007/s41060-018-0137-7
Zugehörige Institution(en) am KIT Institut für Programmstrukturen und Datenorganisation (IPD)
Publikationstyp Zeitschriftenaufsatz
Publikationsjahr 2018
Sprache Englisch
Identifikator ISSN: 2364-415X, 2364-4168
KITopen-ID: 1000083489
Erschienen in International Journal of Data Science and Analytics
Projektinformation GRK 2153/1 (DFG, DFG KOORD, GRK 2153/1)
Vorab online veröffentlicht am 14.06.2018
Schlagwörter Outlier mining Subspace search High-dimensional data
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page