KIT | KIT-Bibliothek | Impressum | Datenschutz

A framework for dependency estimation in heterogeneous data streams

Fouché, Edouard; Mazankiewicz, Alan; Kalinke, Florian; Böhm, Klemens

Abstract:
Estimating dependencies from data is a fundamental task of Knowledge Discovery. Identifying the relevant variables leads to a better understanding of data and improves both the runtime and the outcomes of downstream Data Mining tasks. Dependency estimation from static numerical data has received much attention. However, real-world data often occurs as heterogeneous data streams: On the one hand, data is collected online and is virtually infinite. On the other hand, the various components of a stream may be of different types, e.g., numerical, ordinal or categorical. For this setting, we propose Monte Carlo Dependency Estimation (MCDE), a framework that quantifies multivariate dependency as the average statistical discrepancy between marginal and conditional distributions, via Monte Carlo simulations. MCDE handles heterogeneity by leveraging three statistical tests: the Mann–Whitney U, the Kolmogorov–Smirnov and the Chi-Squared test. We demonstrate that MCDE goes beyond the state of the art regarding dependency estimation by meeting a broad set of requirements. Finally, we show with a real-world use case that MCDE can discover useful patterns in heterogeneous data streams.

Open Access Logo


Verlagsausgabe §
DOI: 10.5445/IR/1000120837
Veröffentlicht am 02.07.2020
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Programmstrukturen und Datenorganisation (IPD)
Publikationstyp Zeitschriftenaufsatz
Publikationsjahr 2020
Sprache Englisch
Identifikator ISSN: 0926-8782, 1573-7578
KITopen-ID: 1000120837
Erschienen in Distributed and parallel databases
Vorab online veröffentlicht am 06.06.2020
Nachgewiesen in Scopus
Web of Science
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page