Sideway Values

Download Report

Transcript Sideway Values

Sideway Value Algebra for Object-Relational
Databases
G. Ozsoyoglu
A. Al-Hamdani
I. S. Altıngovde
S.A. Ozel
O. Ulusoy
Z.M. Ozsoyoglu
Case Western Reserve University, USA
Bilkent University, Turkey
Attach
(a) Functions to relations “Sideway Value Fncts”
(b) Function evaluations to tuples of relations
“Sideway Values”
“Scores” :
Cohen, Sigmod’98
“Preference values” :
Agrawal and Wimmers, Sigmod’00
Hristidis, Papakonstantinou, Koudas, Sigmod’01
“Probabilistic values”: Barbara, Garcia-Molina, Porter, IEEE TKDE’92
Our Motivation: Recent database applications on the web
Sideway Values and Functions
Represent
•
Recommendations of Data Creators,
•
Preferences of users.
Importance Values
Employed for
•
User-guided query output ranking,
•
Top-K “query evaluation”.
Integrated with “approximate text joins”.
Web Resources: DBLP Bibliography and SIGMOD Anthology
Data Model (by Data Provider)
Topics
Tid
TName
TType
T01
Edward F. Codd
Researcher
T08 data models database mngement
TDomain TImp
Database
0.9
ResearchPaper Database
0.8
Edward F. Codd, “Data Models in Database Management”, 1980
Query: Database Researchers with a name similar to E. Codd?
Text similarity between Edward F. Codd and E. Codd is judged to be 0.7
Result: Tuple T01 is returned with importance value 0.63
(0.9 * 0.7)
Data Model
Topics
Relationships (Metalinks)
MId
MType
AntecedentId
ConsequentId
M01
ResearchPaperOf
T01
T08
“T08 is a research paper of T01”
Recursive Metalinks:
RelatedToPapers, PrerequisitePapers
Topic Sources (web documents)
Query (top-k, approximate text similarity)
Use the advice at www.expert.com/advice.
Find the 20 most-important papers
having index terms with similarity to “join algorithms” above 0.9
Query (time constraints, topic closure, threshold )
Use the advice at www.expert.com/advice .
Find in two minutes those Anthology papers that are prerequisites
of T08 with importance values above 0.7.
Summary So Far
• Importance values as sideway values/functions
• Top-k or threshold-based queries
• Regular-expression-based topic closure queries:
PapersWrittenBy(RelatedToPapers*(T08))
• Expensive metalink importance value computation *
RelatedToPapers, PrerequisitePapers
• Time Constraints *
Rest of the talk
• SQL extensions for sideway values
• Sideway Value Algebra (SVA) operators for
Selection
Join
Topic Closure
• Evaluations of SVA operators
Join
Topic Closure
• Experimental results
SVA queries propagate sideway values to output
relations in automated ways.
Query output sizes controlled by
(a) Ranking threshold k
(final output size control)
(b) Sideway value threshold Vt
(intermediate output size control)
(c) k and Vt
SVs attached to base relations
• Open Form: SV stored in column of base relation.
• Closed Form: SV derived thru closed function.
Ex: for relation Employee, we have
Imp (Salary,JobCode) = a.Salary + b.JobCode
• Semi-Closed Form: SV value for a set of
tuples identified through regular expression.
Ex:
Imp(<TopicName = “*kidney complications*”>) = 0.9
Approximate Text Similarity
• Vector space based similarity model.
• Vocabulary W for all words (terms).
• Represent each topic as a vector of |W| terms.
• TF-IDF model of IR, and cosine similarity.
SQL Extensions
• Database
using advice at www.xx as database
• Propagate Sideway Values
propagate importance as <type> function of <arg list>
• Stopping Conditions
stop after k
stop with threshold Vt
stop after k and with threshold Vt
20 highest topic-importance-ranked papers
having index terms with similarity to “join algorithms” above 0.9.
A product-based importance propagation function.
IndexedBy: SetOf IndexTermId  PaperId
select M.ConsequentId
using advice at www.expert.com/advice as database DB
from DB.Topics T, DB.Metalinks M
where T.TType=”Index Term” and
M.MType=”IndexedBy” and
T.TId is-in M.AntecedentId and
T.TName (0.9) ”join algorithms”
propagate importance as product function of T
stop after 20 most important
SVA Operators
For each RA operator,
there is SVA counterpart extended with
(a) Output SV function fout
(b) Output threshold  (ranking or sideway threshold )
 Operators with superscript * : SVA operators.
 Operators without superscript * : RA operators.
They carry sideway values of the left or right relation.
SVA Selection: σ*C, fout,  (R)
Input
•
•
•
•
Input relation R with sideway function fin
Selection Condition C
Output sideway propagation function fout
Output threshold  (k or Vt)
Output
=k : Top-k fout-ranking tuples that satisfy C
=Vt : Tuples with fout value greater than Vt and satisfy C
 = (k,Vt)
20 most-important papers
having index terms with similarity to “join algorithms” above 0.9
STOP AFTER (20)
ΠConsequentId
ORDER BYImportance
L
Logical Query Tree
L.TId in R.AntecedentId
σ MType "IndexedBy"
*TName (0.9) “join algorithms” and TType=”Index Term”,
fout=fin*sim(TName,”join algorithms”),  = 0.0
DB.Topics
DB.Metalinks
SVA Join: L
*A  B, fout,  R
Input
•
L and R with two functions flin and frin
•
Join Condition  on attributes A and B
•
Output sideway propagation function fout
•
Output threshold  (k or Vt)
Output: join of L and R
Sideway values of output tuples computed by fout.
Output threshold  satisfied.
Query: Five researchers with
(a) papers having index terms similar to “join algorithms” above 0.9
(b) geometric averages of highest importance values.
ResearchTopicOf: SetOf Index-termId  ResearcherId
select M.ConsequentId
using advice at www.expert.com/advice as database DB
from DB.Topics T, DB.Metalinks M
where TType = ”Index Term” and
M.MType = ”ResearchTopicOf” and
T.Td is-in M.AntecedentId and
T.TName (0.9) “join algorithms”
propagate importance as gmtrc-average function of T, M
stop after 5 most important
STOP AFTER (5)
 ConsequentId
ORDER BYImportance
*
L.TId in R.AntecedentId, fout=sqrt(flin*frin),  = 5
*

TName (0.9) “join algorithms”
σ MType "ResearchTopicOf"
and TType=”Index Term”,
fout=fin*sim(TName,”join algorithms”),  = 0.0
DB.Topics
DB.Metalinks
Semantics of TClosure
X
.
FPath:
P1
.
.
.
FPathMerge
. . .
Pn
t
Metalink types have associated axioms
Ex.
RelatedTo is transitive and reflexive
SubTopicOf is transitive but not reflexive
Computing topic closure requires
• Sound and complete set of axioms for metalink types,
and
• Polynomial-time algorithm to compute the topic closure
using the axioms.
Evaluation of SVA Join Operator
•
Nested-Loops-Based algorithms for top-k-based and
sideway-value-based threshold Join operators
•
Sorted input relations wrpt sideway values.
•
fout() function is monotone
e.g. Product, numeric average, geometric average
Nested-Loops Threshold Join Algorithm
• Enforce new stopping conditions while processing inner
and outer loops.
– Inner loop (S) exits: fout() value of the tuple ri*sj < Vt
(ri is in R and sj is in S).
– Outer loop (R) exits at the ith iteration: fout() value of
the output tuple ri.s1 < Vt
(ri is in R and s1 is the first tuple in S)
R
ri
S
sj
Nested-Loops Threshold Join Algorithm with Approximate
Text Match Join Conditions
• Similarity of text-valued join attributes > threshold tsim
• Use vector-space model.
– Term vector ur = <u1 u2 … ux> corresponding to join attribute A of
tuple r in R, where ui represents weight of term i in A
– Filter vector fS = <w1 ... wx>: each value wi is the max weight of
the corresponding term i among all vectors of S
• Cosine (ur, fS) : the maximal similarity of a record r in R
to any other record s in S
• if Cosine (ur, fS) < Vt then r can not be similar to any
tuple s in S with similarity > Vt
Evaluation of SVA TClosure Operator
• Compute Topic Closure X+ where regular expression R is a single
metalink type M
• Each metalink VM Tid is represented by a tuple in the Metalinks
table
V is a set of topic identifiers and Tid is a topic identifier
• All metalinks are RHS-decomposable.
• If a metalink type is LHS-decomposable then each metalink with V in
LHS is decomposed into multiple metalinks with a single topic in LHS
• Create index MIndex for all metalink instances
• TClosure algorithms use only MIndex to find
closure of a given set of topics.
• Create a second index HIndex to maintain all
nodes that are not decomposable.
0.8
Example 5.
T2
RelatedTo
0.6
0.9
T1
11
Pre
0..95
Pre
0.9
0.85
T3
33
0.7
0.95
Pre
0.9
T5
T4
MIndex
HIndex
ChildList
<Tid2, Imp(Tid2),
Imp(Mid)> triplets
MType
Tid1
Imp(Tid1)
ParentList
Pre
T1
0.9
{}
<T3,0.85,0.95>, <T4,0.95,0.9>
Pre
T3
0.85
{T1}
-
Pre
H1
Avg(0.9, 0.95) =
0.925
{T3, T4}
<T5,0.7,0.9>
RelatedTo
T1
0.9
{T2, T3, T4}
<T2,0.8,0.6>,<T3,0.85, 0.95>,
<T4,0.95,0.9>
Tid
NodeList <TidList, Hid>
T3
<{T3,T4}, H1>
T4
<{T3,T4}, H1>
Ranking-based Topic Closure
Algorithm
• First compute the initial candidate top k ranked topics
from input topics X.
• Then, in each iteration i,
– Extract ith top-ranked topic from current k-i+1
candidate top-ranked topics, and
– Update current candidate topics by processing all
emanating metalinks from ith topic.
• Algorithm requires k iterations in order to compute the
top-k-ranked topic closure of a set X of input topics.
Experimental Results
SVA Join Operator
• Extracted titles of journals and conf. papers from the
DBLP data set into two different files, R and S.
• For arbitrary predicates and monotone SV functions,
algorithms NLoopSVT and NLoopTop-k improve the
performance of (block-nested loops) BNL considerably.
•
For similarity-based joins, the algorithms are further
optimized, and more gains are obtained.
Comparison # (Millions)
Tuple Comparisons
12000
10500
9000
7500
6000
4500
3000
1500
0
BNL
NLoopTopK
NLoopSimTopK
0
100
200
300
400
500
Top-k
Tuple Comparisons
Comparison # (Millions)
12000
10500
BNL
NLoopSVT
NLoopSimSVT
9000
7500
6000
4500
3000
1500
0
0
0.2
0.4
0.6
SV threshold
0.8
1
Experimental Results
SVA Topic Closure Operator
• Synthetically generated data for
– Topics file with N, 1000 ≤ N ≤ 100,000, topics.
Topic importance values in [0.4, 1.0]
– Metalink file with metalinks/topics ratio of 3.
Metalink importance values in [0.4, 1.0]
– Set X of input topics with size= 100 topics.
Topic importance values in [0.5, 1.0]
• Disk-based MIndex file generated from topic and metalink files.
• In-memory buffer of size 100KB.
A sparse index table of size 1000 tuples.
Conclusions
• Sideway Values and Functions
• SQL Extensions and SVA
• Application: Web Querying