Talk - CSE, IIT Bombay

Download Report

Transcript Talk - CSE, IIT Bombay

XRANK: Ranked Keyword
Search over XML Documents
Lin Guo Feng Shao
Chavdar Botev Jayavel Shanmugasundaram
Presentation by:
Meghana Kshirsagar Nitin Gupta
Indian Institute of Technology, Bombay
Outline
Motivation
● Problem Definition, Query Semantics
● Ranking Function
● A New Datastructure – Dewey Inverted List
(DIL)
● Algorithms
● Performance Evaluation
●
Motivation
Motivation - I
●
●
Why do we need search over XML data?
Why not use search techniques used on
WWW (keyword search on HTML)?
Motivation - II
Keyword Search: XML Vs HTML
●
●
HTML
XML
structural
structural
Links: document-to-document
Tags: Format specifiers
●
●
ranking
ranking
●
●
●
Result: Document
Page-level ranking
Proximity:
● width: distance between
words
Links: IDREFs and Xlinks
Tags: Content specifiers
●
●
●
Result: XML element (a tree)
Element-level ranking
Proximity:
● width
● height
Problem Definition,
Query Semantics,
and Ranking
Problem Definition
●
Input:
Set of keywords
●
Output:
Ranked XML elements
What is a result?
How to rank results ?
Bird's eye view of the system
Query Keywords
Results
XML doc
repository
Query Evaluator
Preprocessing
(ElemRank computation)
Data Structures
(DIL)
What is a result?
●
A minimal Steiner tree of XML elements
●
Result-set is a set of XML elements that
●
includes a subset of elements containing all
query-keywords at least once, after excluding
the occurrences of keywords in contained
results (if any).
result 1
result 2
Result: Graphical representation
containment edge
ancestor
descendant
Ranking: Which results to return first?
Properties:
The Ranking function should
● reflect Result Specificity
● consider Keyword-Proximity
● be Hyperlink Aware
Ranking function:
f (height, width, link-structure)
Less specific result
More specific result
Ranking Function
For a single XML element (node):
t-1
r (v1, ki) = ElemRank ( vt ) . decay
v1
v
ki
t
Ranking Function
Combining ranks in case of multiple occurrences:
Overall Rank:
Semantics of the ranking function
Link structure
t-1
r (v1, ki) = ElemRank ( vt ) . decay
Specificity
(height)
Proximity
ElemRank Computation – adopt PageRank??
●
PageRank
●
Short-comings:
Fails to capture:
✗ bidirectional transfer of “ElemRanks”
✗ discrimination between edge-types (containment and
hyperlink)
✗ doesn't aggregate “ElemRanks” for reverse containment
relationships
ElemRank Computation - I
●
Consider Both forward and reverse ElemRank propagation.
Ne = total # of XML elements
● N (u) = # hyperlinks from 'u'
h
● N (u) = # children of 'u'
c
● E = HE U CE U CE'
● CE' = reverse containment edges
●
ElemRank Computation - II
●
Seperate containment and hyperlink edges
CE = containment edges
● HE = hyperlink edges
● ElemRank (sub elements)
α 1 / ( # sibling sub-elements )
●
ElemRank Computation - III
Sum over the reverse-containment edges, instead of
distributing the weight
●
Nd(u) = total # XML documents
● N
de(v) = # elements in the XML doc containing v
● ElemRank (parent)
α Sum (ElemRank(sub-elements))
●
Datastructures and Algorithms
Naïve Algorithm
Approach:
●
●
XML element ~ doc
Use “keyword search on WWW”
Limitations:
●
●
●
Space overhead (in inverted indices)
Failure to model Hierarchical relationships
(ancestor~decendent).
Inaccurate Ranking
Need a new datastructure which can model
hierarchical relationships !!
Answer: Dewey Inverted Lists
Labeling nodes using Dewey Ids
Dewey Inverted Lists
●
One entry per keyword
Entry for keyword 'k'
has Dewey-IDs of
elements directly
containing 'k'
●
Simple equi merge-join of Dewey-ID-lists won't work !
Need to compute prefixes.
System Architecture
DIL : Query Processing
●
●
●
●
●
Simple equality merge-join will not work
Need to find LCP (longest common prefix) over
all elements with query keyword-match.
Single pass over the inverted lists suffices!
Compute LCP while merging the ILs of
individual keywords.
ILs are sorted on Dewey-IDs
Datastructures
●
Array of all inverted lists : invertedList[]
●
●
●
●
invertedList[i] for keyword 'i'
each invertedList[i] is sorted on Dewey-ID
Heap to maintain top-m results : resultHeap
Stack to store current Dewey-ID, ranks,
position List, longest common prefixes :
deweyStack
Algorithm on DILs - Abstract
While all inverted-lists are not processed
●
Read the next entry from DIL having smallest Dewey-ID
●
call this 'currentEntry'
Find the longest common prefix (lcp) between stack components
and entry read from DIL
●
●
lcp (deweyStack , currentEntry)
Pop non-matching entries from Dewey-stack; Add result to heap if
appropriate
●
●
●
Push non-matching part of 'currentEntry' to 'deweyStack'
●
●
check if current top-of-stack contains all keywords
● if yes, compute OverallRank, put this result onto heap
● else
● non-matching entries are popped one component at a time and
update (rank, posList) on each pop
non-matching components of 'currentEntry.deweyID' are pushed onto
stack
Update components of top entry of deweyStack
Example
Query: “XQL Ricardo”
Algorithm Trace – Step 1
Rank[i] = Rank due to keyword 'i'
PosList[i] = List of occurrences of keyword 'i'
Smallest ID: 5.0.3.0.0
DIL: invertedList[]
DeweyStack
push all
components
and find
rank, posL
Algorithm Trace – Step 2
Smallest ID: 5.0.3.0.1
DIL: invertedList[]
DeweyStack
find lcp
and pop
nonmatching
components
Algorithm Trace – Step 3
Smallest ID: 5.0.3.0.1
DIL: invertedList[]
DeweyStack
updated
rank, posL
Algorithm Trace – Step 4
Smallest ID: 5.0.3.0.1
DIL: invertedList[]
DeweyStack
push
non-matching
components
Algorithm Trace – Step 5
Smallest ID: 6.0.3.8.3
DIL: invertedList[]
DeweyStack
find lcp,
update,
finally pop
all components
Problems with DIL
●
●
Scans the entire inverted-list for all keywords
before a result is output
Very inefficient for top-k computation
Other Techniques - RDIL
●
Ranked Dewey Inverted List:
● For efficient top-k result computation
● IL is ordered by ElemRank
● Each IL has a B+ tree index on the DeweyIDs
● Algorithm with RDIL uses a threshold
Algorithm using RDIL (Abstract)
●
Choose the next entry from one of the invertedList[] in a RoundRobin fashion.
●
●
●
Find the longest common prefix that contains all query-keywords
●
●
●
●
●
say chosen IL = invertedList[i]
d = top-ranked Dewey-ID from invertedList[i]
Probe the B+ tree index of all other keyword ILs, for the longest
common prefix
Claim:
● d2 = smallest Dewey-ID in invertedList[j] of query-keyword 'j'
● d3 = immediate predecessor of d2
● lcp = max_prefix (lcp ( d, d2) , lcp ( d, d3))
Check if 'lcp' is a complete result
Recompute 'threshold' = sum (ElemRank of last processed element
in each query keyword IL)
If (rank of top-k results on heap) >= threshold)
return;
Performance of RDIL
●
●
●
Works well for queries with highly correlated
keywords
BUT ! becomes equivalent (actually worse) to
DIL for totally uncorrelated keywords
Need an intermediate technique
HDIL
●
Uses both DIL and RDIL
●
Adaptive strategy:
●
–
Start with RDIL
–
Switch to DIL if performance is bad
Performance?
–
–
Estimated remaining time for RDIL = (m – r ) * t / r
●
t = time spent so far
●
r = no. of results above threshold so far
●
m = desired no. of results
Estimated remaining time for DIL ?
●
No. of query-keywords is known
●
Size of each IL is known
HDIL
●
Datastructures?
–
Store full IL sorted on Dewey-ID
–
Store small fraction of IL sorted on ElemRank
–
Share the leaf level between IL and B+ tree (in RDIL)
–
Overhead : top levels of B+ tree
Updating the lists
●
Updation is easy
●
Insertion – very bad!
–
techniques from Tatarinov et al.
–
we've seen a better technique in this course :) –
OrdPath
Evaluation
●
●
Criteria:
● no. of query-keywords
● correlation between query-keywords
● desired no. of query results
● selectivity of keywords
Setup:
● Datasets used: DBLP, Xmark
● d1 = 0.35, d2 = 0.25, d3 = 0.25
● 2.8GHz Pentium IV + 1GB RAM + 80GB HDD
Performance - 1
Performance - 2
Critique
New datastructure (DIL) defined to represent
hierarchical relationships accurately and
efficiently.
● Hyperlinks and IDREFs are considered only
while computing ElemRank. Not used while
returning results.
● Only containment edges (ancestor-descendant)
are considered while computing result trees.
● Works only on trees, can't handle graphs.
●
The SphereSearch Engine for
Unified Banked Retrieval of
Heterogenous XML and Web
Documents
Jens Graupmann
Ralf Schenkel
Gerhard Weikum
Max-Plack-Institut fur Informatik
Presentation by:
Nitin Gupta
Meghana Kshirsagar
Indian Institute of Technology Bombay
Why another search engine ?
●
●
●
To cope with diversity in the structures and
annotations of the data
Ranked retrieval paradigm for producing
relevance ordered results lists rather than a
mere boolean retrieval.
Short comings of the current search engines
–
Concept aware
–
Context aware (or link-awareness)
–
Abstraction aware
–
Query Language
Concept awareness
●
●
●
Example: researcher max planck yields many
results about researchers who work at the
institute “Max Plack” Society
Better formulation would be researcher
person=“max planck”
Objective attained by
–
Transformation to XML
–
Data Annotation
Concept awareness ::
Transformation
<Experiments>
<H1>Experiments</
H1>
... Text1 ...
<Settings>
... Text1 ...
... Text2 ...
<H2>Settings</H2>
</Settings>
... Text2 ...
<H1> ...
</Experiments>
...
Abstraction Awareness
●
●
●
Example: Synonyms, Ontologies
Is connection to various encyclopedias/ Wiki's
possible?
Objective attained by using
–
Ontology Service: provides quantified ontological
information to the system
–
Preprocessed information based on focused web
crawls to estimate statistical correlations between
the characteristic words of related concepts
Context Awareness
●
●
●
●
Query may not be answered by web search
engines as no single web page may be a match
Unlike usual navigation axes in XML, context
should go beyond trees.
Consider graph structure spanned by
Xlink/XPointer references and href hyperlinks
Objective attained by
–
introduction of the concept of a SPHERE
Context Awareness :: Sphere
●
What is a sphere?
–
Relevance of an element for a group of query
conditions is not just determined by its own content,
but also by the content of other neighboring
elements, including linked documents, in an
environment - called Sphere - of the element.
Query Language
●
●
●
Query S = (Q, J) consists of
–
set Q = { G1 .. Gq } of query groups
–
set J = { J1 .. Jm } of join conditions
Each Qi consists of
–
set of keyword conditions t1 .. tk
–
set of concept value conditions c1 = v1 ... cl = vl
Each join has the form Qi.v = (or ~) Qj.w
Query Language
●
Example:
German professors who
teach database courses
– P(professor, location=~Germany)
and have projects on
– C(course, ~databases)
XML
–
R(~project, ~XML)
–
A(gothic, church)
–
B(romanic, church)
–
A.location = B.location
Gothic and Romanic
churches at the same location
Data Model
●
●
●
Collection X = (D, L) of XML documents D together
with a set L of (href, Xpointer, or Xlink) links between
their elements
Consider all attributes as elements: then element level
graph GE(X) = (VE(X), EE(X)) has the union of all the
elements of the document as nodes and undirected
edges between them
Each edge has nonnegative weight
–
●
1 for parent-child; ‘λ’ for links
A distance function δX(x,y) : computes weight of a
shortest path in GE(X) between x and y
Spheres and Query Groups
●
●
Node-score ns(n,t) is computed using Okapi BM25 model
Similarity condition ~K: Compute exp(K) for the keyword. The
node score is defined as max xЄexp(K) sim(K,x) * ns(n,x)
where sim(K,x) is the ontological similarity
●
Concept value:
–
–
{
ns(n, c=v) =
0
ns(n,v)
if name(n) ≠ c
otherwise
●
Similarity concept value: ~c = v: sim(name(n), c) * ns(n,v)
●
This is insufficient
–
in the presence of linked documents
–
when content is spread over several elements
Spheres and Query Groups
Sphere Sd(n): set of nodes at distance d from node n
sd(n,t) = ∑ v Є Sd(n) ns(v,t)
s(n,t) = ∑ si(n,t) * αi
s(1,t) =
1 + 4*0.5 + 2*0.25 + 5*0.125
= 4.175
s(2,t) =
3 + 0*0.5 + 0*0.25 + 1*0.125 =
3.125
s(n, G) = ∑ j s(n,tj) + ∑ j s(n, cj=vj)
Spheres and Query Groups ::
Ranking
Create a connection graph G(N) = (V(N), E(N))
Weight of an edge between x,y:
0
1/ δx(x,y)+1
if x and y are not connected
otherwise
Compactness C(N) of a potential answer N is then the sum of the total edge
weights of a maximal spanning tree for G(N), and the score is given by:
s(N, S) = β C(N) + (1- β) ∑i s(ni, Gi)
Spheres and Query Groups ::
Joins
New virtual links to form an extended collection X' = (D, L')
–
Connect the elements that match the join
–
Similarity join: For Qi.v ~ Qj.w, consider sets N(v) (resp N(w)) with
name v (w) or contain v (w) in their content. For each pair x N(v),
y N(w) add a link {x,y} with weight 1/csim(x,y)
System Architecture
Focused web crawls used
to estimate statistical
correlations between the
characteristic words of
related concepts. Current
version uses Dice
coefficient.
Content stored in
inverted lists with
corresponding tf*idfstyle term statistics
Indexer stores with each element the
corresponding Dewer encoding of its
position within the document
Query Processor
1. First compute a result list for each query group
2. Add virtual links for join conditions
3. Compute the compactness of a subset of all potential
answers of the query in order to return the top-k results
1. Compute a list of results for each of query keywords and concept-value
conditions.
2. Candidate nodes: Nodes that are at distance at most D from any node that
occurs in at least one of the lists. Sphere score is computed only for these
nodes since only these can have a non-zero score!
3. For eachl candidate node N, look up the node scores of nodes in the sphere of
N, and adding these scores with a proper damping factor.
Query Processor
●
●
●
Virtual links: Processor considers only a limited set
of possible end points for efficient computation
Nodes in the spheres upto distance D around nodes
with nonzero sphere score for any query group
–
Why? Any other node will have distance atleast D+1 to
any results node and thus contributes at most 1/ (D+1)+1
to the compactness, which is negligible
–
This set of candidate nodes can be computed on the fly
Set further reduced by testing join attributes, for
example A.x = B.y results in two sets of potential end
points.
Query Processor
●
Generating answers
–
Naïve method: generate all possible potential
answers from the answers to query groups,
compute connection graphs and compactness, and
finally their score
–
For top-k answers, use Fagin's Threshold Algorithm
with sorted lists only
●
●
Input: Sorted list of node scores and pairwise node
scores (edges)
Output: k potential answers with the best scores
Experiments
●
●
Sun V40z, 16GB RAM, Windows 2003 Server, Tomcat
4.0.6 environment, Oracle 10g database
Benchmarks: XMach, Xmark, INEX, TREC
Does not consider
Designed for XQuery-style exact match Semantically poor
tags
XML at all
Wikipedia Collection from the Wikipedia project: HTML Collection transformed into
XML and annotated
Wikipedia++ Collection: Extension of Wikipedia with IMDB data, with generated XML
files for each movie and actor
DBLP++ Collection: Based on the DBLP project which indexes more than 480,000
publications
INEX: Set of 12,107 XML documents, a set of queries with and without structural
constraints
Experiments
Conversion from HTML to XML
Dataset Statistics
Experiments
●
SSE-basic: basic version limited to keyword conditions using spherebased scoring
●
SSE-CV: basic version plus concept-value conditions
●
SSE-QC: CV version plus query groups (full contest awareness)
●
SSE-Join: full version will all features
●
SSE-KW: very restricted version with simple keyword search
●
GoogleWiki: Google search restricted to Wikipedia.org
●
Google~Wiki: Google on wikipedia.org with Google's ~ operator for
query expansion
●
GoogleWeb: Google search on the entire web
●
Google~Web: Google search on the entire web with query expansion
Experiments
Aggregated results for Wikipedia
Experiments
Aggregated results for Wikipedia++ and DBLP++
Experiments
Graph showing the average runtimes for different versions
Thank you