In the sciences and elsewhere, the use of relational databases has become ubiquitous.
To get maximum profit from a database, one should have in-depth knowledge in both
SQL and a domain (data structure and meaning that a database contains). To assist
inexperienced users in formulating their needs, SQL query recommendation system
(SQL QRS) has been proposed. It utilizes the experience of previous users captured by
SQL query log as well as the user query history to suggest. When constructing such
a system, one should solve related problems: (1) clean the query log and (2) define
appropriate query similarity functions. These two tasks are not only necessary for
building SQL QRS, but they apply to other problems. In what follows, we describe
three scenarios of SQL query log analysis: (1) cleaning an SQL query log, (2) SQL
query log clustering when testing SQL query similarity functions and (3) recommending
SQL queries. We also explain how these three branches are related to each other.
Scenario 1. Cleaning SQL query log as a general pre-processing step
The raw query log is often not suitable for query log analysis tasks such as clustering,
giving recommendations. That is because it contains antipatterns and robotic data
downloads, also known as Sliding Window Search (SWS). An antipattern in software
engineering is a special case of a pattern. While a pattern is a standard solution, an
antipattern is a pattern with a negative effect.
When it comes to SQL query recommendation, leaving such artifacts in the log during
analysis results in a wrong suggestion. Firstly, the behaviour of "mortal" users who
need a recommendation is different from robots, which perform SWS. Secondly, one
does not want to recommend antipatterns, so they need to be excluded from the query
pool. Thirdly, the bigger a log is, the slower a recommendation engine operates. Thus,
excluding SWS and antipatterns from the input data makes the recommendation
better and faster.
The effect of SWS and antipatterns on query log clustering depends on the chosen
similarity function. The result can either (1) do not change or (2) add clusters which
cover a big part of data. In any case, having antipatterns and SWS in an input log
increases only the time one need to cluster and do not increase the quality of results.
Scenario 2. Identifying User Interests via Clustering
To identify the hot spots of user interests, one clusters SQL queries. In a scientific
domain, it exposes research trends. In business, it points to popular data slices which
one might want to refactor for better accessibility. A good clustering result must be
precise (match ground truth) and interpretable.
Query similarity relies on SQL query representation. There are three strategies to
represent an SQL query. FB (feature-based) query representation sees a query as
structure, not considering the data, a query accesses. WB (witness-based) approach
treat a query as a set of tuples in the result set. AAB (access area-based) representation
considers a query as an expression in relational algebra. While WB and FB query
similarity functions are straightforward (Jaccard or cosine similarities), AAB query
similarity requires additional definition. We proposed two variants of AAB similarity
measure – overlap (AABovl) and closeness (AABcl). In AABovl, the similarity of two
queries is the overlap of their access areas. AABcl relies on the distance between two
access areas in the data space – two queries may be similar even if their access areas
do not overlap.
The extensive experiments consist of two parts. The first one is clustering a rather
small dataset with ground truth. This experiment serves to study the precision of
various similarity functions by comparing clustering results to supervised insights. The
second experiment aims to investigate on the interpretability of clustering results with
different similarity functions. It clusters a big real-world query log. The domain expert
then evaluates the results. Both experiments show that AAB similarity functions
produce better results in both precision and interpretability.
Scenario 3. SQL Query Recommendation
A sound SQL query recommendation system (1) provides a query which can be run
directly, (2) supports comparison operators and various logical operators, (3) is scalable
and has low response times, (4) provides recommendations of high quality. The existing
approaches fail to fulfill all the requirements. We proposed DASQR, scalable and
data-aware query recommendation to meet all four needs. In a nutshell, DASQR is
a hybrid (collaborative filtering + content-based) approach. Its variations utilize all
similarity functions, which we define or find in the related work.
Measuring the quality of SQL query recommendation system (QRS) is particularly
challenging since there is no standard way approaching it. Previous studies have
evaluated the results using quality metrics which only rely on the query representations
used in these studies. It is somewhat subjective since a similarity function and a
quality metric are dependent. We propose AAB quality metrics and then evaluate
each approach based on all the metrics.
The experiments test DASQR approaches and competitors. Both performance and
runtime experiments indicate that DASQR approaches outperform the existing ones.