Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees (Extended Version)

Willkomm, Jens; Schäler, Martin; Böhm, Klemens

doi:10.5445/IR/1000128678

Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees (Extended Version)

Willkomm, Jens; Schäler, Martin; Böhm, Klemens

Abstract:

Estimating the cost of a query plan is one of the hardest problems in query optimization. This includes cardinality estimates of string search patterns, of multi-word strings like phrases or text snippets in particular. At first sight, suffix trees address this problem. To curb the memory usage of a suffix tree, one often prunes the tree to a certain depth. But this pruning method "takes away" more information from long strings than from short ones. This problem is particularly severe with sets of long strings, the setting studied here. In this article, we propose respective pruning techniques. Our approaches remove characters with low information value. The various variants determine a character's information value in different ways, e.g., by using conditional entropy with respect to previous characters in the string. Our experiments show that, in contrast to the well-known pruned suffix tree, our technique provides significantly better estimations when the tree size is reduced by 60% or less. Due to the redundancy of natural language, our pruning techniques yield hardly any error for tree-size reductions of up to 50%.

KITopen-Download

Volltext

DOI: 10.5445/IR/1000128678

Veröffentlicht am 29.03.2021

Export

Statistiken

Seitenaufrufe: 378
seit 20.01.2021

Downloads: 177
seit 29.03.2021

Zugehörige Institution(en) am KIT	Institut für Programmstrukturen und Datenorganisation (IPD)
Publikationstyp	Forschungsbericht/Preprint
Publikationsmonat/-jahr	03.2021
Sprache	Englisch
Identifikator	KITopen-ID: 1000128678
Verlag	Karlsruher Institut für Technologie (KIT)
Umfang	20 S.
Schlagwörter	Query optimization, cardinality estimation, suffix tree
Nachgewiesen in	OpenAlex
Relationen in KITopen	Verweist auf Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees. Willkomm, Jens; Schäler, Martin; Böhm, Klemens (2021) Proceedingsbeitrag (1000131311)
Globale Ziele für nachhaltige Entwicklung

Repository KITopen

Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees (Extended Version)

Abstract: