KIT | KIT-Bibliothek | Impressum | Datenschutz

Discovering Functional Dependencies through Hitting Set Enumeration

Bleifuß, Tobias; Papenbrock, Thorsten; Bläsius, Thomas ORCID iD icon 1; Schirneck, Martin; Naumann, Felix
1 Institut für Theoretische Informatik (ITI), Karlsruher Institut für Technologie (KIT)

Abstract:

Functional dependencies (FDs) are among the most important integrity constraints in databases. They serve
to normalize datasets and thus resolve redundancies, they contribute to query optimization, and they are
frequently used to guide data cleaning efforts. Because the FDs of a particular dataset are usually unknown,
automatic profiling algorithms are needed to discover them. These algorithms have made considerable advances
in the past few years, but they still require a significant amount of time and memory to process datasets of
practically relevant sizes.
We present FDhits, a novel FD discovery algorithm that finds all valid, minimal FDs in a given relational
dataset. FDhits is based on several discovery optimizations that include a hybrid validation approach, effective
hitting set enumeration techniques, one-pass candidate validations, and parallelization. Our experiments show
that FDhits, even without parallel execution, has a median speedup of 8.1 compared to state-of-the-art FD
discovery algorithms while using significantly less memory. This allows the discovery of all FDs even on
datasets that could not be processed by the current state-of-the-art.


Verlagsausgabe §
DOI: 10.5445/IR/1000169608
Veröffentlicht am 27.03.2024
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Theoretische Informatik (ITI)
Publikationstyp Zeitschriftenaufsatz
Publikationsdatum 26.03.2024
Sprache Englisch
Identifikator ISSN: 2836-6573
KITopen-ID: 1000169608
Erschienen in Proceedings of the ACM on Management of Data
Band 2
Heft 1
Seiten 1–24
Nachgewiesen in Dimensions
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page