KIT | KIT-Bibliothek | Impressum | Datenschutz

On the Usefulness of SQL-Query-Similarity Measures to Find User Interests

Arzamasova, Natalia; Böhm, Klemens; Goldman, Bertrand; Saaler, Christian; Schäler, Martin

In the sciences and elsewhere, the use of relational databases has become ubiquitous. An important challenge is finding hot spots of user interests. In principle, one can discover user interests by clustering the queries in the query log. Such a clustering requires a notion of query similarity. This, in turn, raises the question of what features of SQL queries are meaningful. We have studied the query representations proposed in the literature and corresponding similarity functions and have identified shortcomings of all of them. To overcome these limitations, we propose new similarity functions for SQL queries. They rely on the so-called access area of a query and, more specifically, on the overlap and the closeness of the access areas. We have carried out experiments systematically to compare the various similarity functions described in this article. The first series of experiments measures the quality of clustering and compares it to a ground truth. In the second series, we focus on the query log from the well-known SkyServer database. Here, a domain expert has interpreted various clusters by hand. We conclude that clusters obtained with our new measures of similarity seem to be good indicators of user interests.

Open Access Logo

Volltext §
DOI: 10.5445/IR/1000093761
Veröffentlicht am 16.04.2019
Cover der Publikation
Zugehörige Institution(en) am KIT Institut für Programmstrukturen und Datenorganisation (IPD)
Publikationstyp Forschungsbericht/Preprint
Publikationsjahr 2019
Sprache Englisch
Identifikator ISSN: 2190-4782
KITopen-ID: 1000093761
Verlag KIT, Karlsruhe
Umfang 18 S.
Serie Karlsruhe Reports in Informatics ; 2019,3
Schlagwörter SQL log analysis, SQL query representations, similarity measures
KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft
KITopen Landing Page