Nitesh_Arjun_Nishant_Algo_research_paper_presentation
Download
Report
Transcript Nitesh_Arjun_Nishant_Algo_research_paper_presentation
Ranking of Database Query Results
Nitesh Maan, Arujn Saraswat, Nishant Kapoor
Introduction
As the name suggests ‘Ranking’ is the process
of ordering a set of values (or data items)
based on some parameter that is of high
relevance to the user of ranking process.
Ranking and returning the most relevant
results of user’s query is a popular paradigm
in information retrieval.
Ranking and databases
Not much work has been done in ranking of
results of query in database systems
We have all seen example of ranking of
results in the internet. The most common
example is the internet search engines (like
Google). A set of WebPages (satisfying the
users search criteria) are returned, with most
relevant results featuring at the top of the list.
In contrast to the WWW, databases support
only a Boolean query model. For example a
selection query on a SQL database schema
returns all tuples that satisfy the conditions
specified in the query. Depending on the
conditions specified in the query, two
situations may arise:
Empty Answers: when the query is too
selective, the answer may be empty.
Many Answers: when the query is not too
selective, too many tuples may be there in the
answer.
We next consider these two scenarios in detail
and look at various mechanism to produce
ranked results in these circumstances.
The Empty Answers Problem
Empty answers problem is the consequence of a
very selective query in database system.
In this case it would be desirable to return a
ranked list of ‘approximately’ matching tuples
without burdening the user to specify any
additional conditions. In other words, an
automated approach for ranking and
returning approximately matching tuples.
Automated Ranking Functions
Automated ranking of query results is the
process of taking a user query and mapping it
to a Top-K query with a ranking function that
depends on conditions specified in the user
query.
A ranking function should be able to work
well even for large databases and have
minimum side effects on query processing
Automated Ranking functions for the
‘Empty Answers Problem’
IDF Similarity
QF Similarity
QFIDF Similarity
IDF Similarity
IDF (inverse document frequency) is an
adaptation of popular IR technique based on
the philosophy that frequently occurring
words convey less information about user’s
needs than rarely occurring words, and thus
should be weighted less.
IDF Similarity, formal definition
For every value ‘t’ in the domain of attribute ‘A’,
IDF(t) can be defined as log(n/F(t)),
where ‘n’ = number of tuples in the database
F(t) = frequency of tuples in database where ‘A’ = ‘t’
The similarity between a tuple ‘T’ and a query ‘Q’ is
m
defined as:
SIM (T , Q) S k (t k , q )
k 1
k
i.e., similarity between a tuple T and a query Q is
simply the sum of corresponding similarity
coefficients over all attributes in T
QF Similarity – leveraging workloads
There may be instances where relevance of a
attribute value may be due to factors other
than the frequency of its occurrence.
QF similarity is based on this very
philosophy. According to QF Similarity, the
importance of attribute values is directly
related to the frequency of their occurrence in
query strings in workload.
QFIDF Similarity
QF is purely workload based, i.e., it does not
use data at all. This may be a disadvantage in
situations wherein we have insufficient or
unreliable workloads.
QFIDF Similarity is a remedy in such
situations. It combines QF and IDF weights.
This way even if a value is never referenced
in the workload, it gets a small non-zero QF.
Breaking ties….
In case of many answers problem, the recently
discussed ranking functions might fail to
perform.
This is because many tuples may tie for the
same similarity score. Such a scenario could
arise for empty answer problem also.
To break this tie, requires looking beyond the
attributes specified in the query, i.e., missing
attributes.
Many Answers Problem
We know by now, that many answers problem
in database systems is the consequence of not
too selective queries.
Such a query on a database system produces a
large number of tuples that satisfy the
condition specified in the query.
Let us see how ranking of results in such a
scenario is accomplished.
Basic Approach
Any ranking function for many answers
problem has to look beyond the attributes
specified in the query, since all or a large
number of tuples satisfy the specified
conditions.
To determine precisely the unspecified
attributes is a challenging task. We show
adaptation of Probabilistic Information
Retrieval (PIR) ranking methods.
Ranking function for Many Answers
Problem
Ranking function for many answers problem
is developed by adaptation of PIR models that
best model data dependencies and
correlations.
The ranking function of a tuple depends on
two factors: (a) a global score, and (b) a
conditional score.
These scores can be computed through
workload as well as data analysis.
Ranking Function: Adaptation of PIR
Models for Structured Data
The basic philosophy of PIR models is that
given a document collection, ‘D’, the set of
relevant documents, ‘R’, and the set of
irrelevant documents, R (= D – R), any
document ‘t’ in ‘D’ can be ranked by finding
out score(t). The score(t) is the probability of
‘t’ belonging to the relevant set, ‘R’
Problem in adapting this approach
The problem in computing the score(t) using
PIR model for the databases is that the
relevant set, ‘R’ is unknown at query time.
This approach is well suited to IR domain as
‘R’ is usually determined through user
feedback.
User feedback based estimation of ‘R’ might
be attempted in databases also but we propose
an automated approach.
The ranking formula
Score(t )
yY
p( y | W )
p( x | y, W )
p( y | D) yY xX p( x | y, D)
This is the final ranking formula that we will use in
computing scores of tuples in order to rank them.
The ranking formula is composed of two large factors:
Global part of the score: measures the global importance of
unspecified attributes
Conditional part of the score: measures the dependencies
between the specified and unspecified attributes
Architecture of Ranking System
Detailed Architecture of the Ranking System
Implementation
Pre-Processing : the pre-processing component is
composed of ‘Atomic Probabilities Module’ and
‘Index Module’.
Atomic Probabilities Module – is responsible for
computation of several atomic probabilities
necessary for the computation of Ranking Function,
Score(t).
Index Module – is responsible for pre-computation of
ranked lists necessary for improving the efficiency
of query processing module.
Implementation
Intermediate Layer: the atomic probabilities, lists
computed by the Index Module are all stored as
database tables in the intermediate layer. All the
tables in the intermediate layer are indexed on
appropriate attributes for fast access during the later
stages.
Primary purpose of the intermediate layer is to avoid
computing the score from scratch each time a query
is received, by storing pre-computed results of all
atomic computations
The Index Module
Index module pre-computes the ranked lists of tuples
for each possible atomic query.
Purpose is to take the run-time load off the query
processing component.
To assist the query processing component in
returning the Top-K tuples, ranked lists of the tuples
for all possible “atomic” queries are pre-computed
Taking as input, the association rules and the
database, Conditional List and Global List are
created for each distinct value ‘x’ in the database
Query Processing Component
List merge algorithm is the key player in the
query processing component.
Its function is to take the user query, compute
scores for all the tuples that satisfy the
condition specified in the query, rank the
tuples in a sorted order of the scores and then
return the Top-K tuples.
Space Requirements
To build the conditional and the global lists, space
consumed is O(mn) bytes (where ‘m’ is the number
of attributes and ‘n’ is the number of tuples of the
database table)
There may be applications where space is an
expensive resource.
In such cases, only a subset of the lists may be stored
at pre-processing times, but this will at the expense
of an increase in query processing time.
What’s Next…..
The ranking function so presented works on
single table databases and does not allow
presence of NULL values.
A very interesting but nevertheless
challenging extension to this work would be
to develop ranking functions that work on
multi-table databases and allow NULL’s as
well as non-text data in database columns.