CS561 - Research Paper Presentation

Download Report

Transcript CS561 - Research Paper Presentation

CS561 - Research Paper
Presentation
Kourdounakis Harry MET. 567
12/05/2009
CS561
1
Topic Introduction


The topic under consideration is full-text search
languages and more specifically ‘structered’ full-text
queries.
Why bother?



Many information systems deal with large document
collections with unknown or ill defined structure
Just keyword search isn’t enough. ( what about ordering,
distance etc? )
It is a rather new area of research trying to bridge the gap
between the IR/XML community.
12/05/2009
CS561
2
Expressiveness and
Performance of Full-Text
Search Languages
Chavdar Botev,Sihem Amer-Yahia, and
Jayavel Shamnugasindaram
12/05/2009
CS561
3
Contents





Introduction
The proposed model for full-text search
Incompleteness of Existing Full-Text Search
Languages
PPRED: Language and Query Evaluation
Proof of idea (Experiments)
12/05/2009
CS561
4
The Problem to be addressed




There is no formal basis for comparing full-text
languages.
Unlike SQL, there is no well-accepted language for
expressing complex full-text search queries
Existing languages are incomplete.
Some practical problems on full-text queries:



How to model?
How to optimize?
How to efficiently evaluate?
12/05/2009
CS561
5
Existing Work. What is missing (1/2)?

Some study has been made by the IR
community, but



it’s focus was primarily on simple structured full
text queries,
no fully composable language was developed,
no study of expressiveness, complexity
whatsoever.
12/05/2009
CS561
6
Existing Work. What is missing (2/2)?

Text Region Algebras (TRA’s).


A region s is represented as the ordered pair (s.l,s.r)
A query operator has the form:


{ sES | exists dED Pred(s,d) } where S,D are sets of regions
and Pred is a Boolean expression with the logical operators
‘and’ ‘or’ containing clauses of the form (x*y) where * is one of
{ =,<,>,<=,>= } and x E{ s.l, s.r, s.l+const, s.l-const, s.r+const,
s.r-const }, yE{ d.l,d.r } and const is a constant.
Although efficient algorithms have been devised to
evaluate TRA queries, their expressive power is
limited ( simultanious inclusion and ordering? )
12/05/2009
CS561
7
Main Contributions of research work




A new formal model for structured full-text search
and scoring, based on first order logic (FTC) and the
relational algebra (FTA) is introduced.
A notion of completeness of full text languages is
defined.
It is shown that existing languages are incomplete
with respect to the above completeness
A practical subset of the FTC and FTA which
subsumes TRA’s is defined and experimental
studies are done upon the performance of it.
12/05/2009
CS561
8
Contents





Introduction
A formal basis for Full-Text search languages
Incompleteness of Existing Full-Text Search
Languages
PPRED: Language and Query Evaluation
Proof of idea (Experiments)
12/05/2009
CS561
9
Approach to formalism and the full-text
search model (1/3)


The full-text queries are not modeled in one
specific syntax. A more general approach is
taken using FTC and FTA
The Full-Text Search Model:

Let N be the set of nodes, P the set of positions,
and T the set of tokens. The following functions
are defined:


12/05/2009
Positions: N -> 2^p that maps a node to the set of
positions in that node
Token: P -> T that maps each position to the token at
that position
CS561
10
Approach to formalism and the full-text
search model (2/3)




Queries are specified over a collection of nodes (
text docs, html, XML elements, relational tuples, etc
).
Positions uniquely identifies a token in a node.
Each context node is considered separately
Completeness requirement: The full-text search
language should be at least as expressive as first
order logic formulas specified over the positions of
tokens in a context node
12/05/2009
CS561
11
Approach to formalism and the full-text
search model (3/3)
Positions(cs561) = {1,2,4,5,6}
Token(1) = “An”
12/05/2009
CS561
12
Full-Text Calculus (FTC) 1/2

FTC base predicates:




SearchContext(node) is true iff node inN
hasPos(node,pos) is true iff pos in
Positions(node)
hasAsToken(pos,tok) is true iff tok=Token(pos)
FTC extensions:


Position based predicates,
pred(p1,…,pm,const1,…,constn).
Example: preds = { distance(pos1,pos2,dist),
ordered(pos1,pos2) }
12/05/2009
CS561
13
Full-Text Calculus (FTC) 2/2

An FTC query has the form:



{ node| SearchContext(node) and QueryExpresion(node) }
where QueryExpression is a FOL expression that specifies
the full-text search condition
A query example:


“find context nodes that contain the words ‘IR’ and ‘XML’ in
that order”
exists p1,p2 ( hasPos(node,p1) and hasAsToken(p1,”IR”)
and hasPos(node,p2) and hasAsToken(pos2,”XML”) and
ordered(p1,p2) )
12/05/2009
CS561
14
Full-Text Algebra (FTA) 1/2



FTA is defined upon the full-text relation data model which is of
the form R[node,attr1,…,attrN] m>=0 with node in N and atti in P.
FTA can be defined recursively as follows:
 Rtoken(node,attr1) is an expression. (node,pos) in R iff: node in
D and pos in Positions(node) and token= Token(pos)
 If Expr1 is a expression then π_node,attr1…,attrM(Expr1) is an
expressio
 If Expr1 and Expr2 are expression then Expr1 join Expr2 is an
expression.
 ... same for all other operators
An FTA query is an FTA expression that produces a s full-text
relation with a single attribute (the node)
12/05/2009
CS561
15
Full-Text Algebra (FTA) 2/2

An FTA query example

Π_node( select_ordered(att1,att2) ((R_”ir” join R_”xml”) ))
R_”IR”
cs561
cs463
2
7
cs561
R_”XML”
2
5
cs561 5
12/05/2009
CS561
16
Relevance Scoring


Instead of hard-coding the scoring method in
the framework, a general scoring framework
is taken
How is the score incorporated in the full-text
search model?


A per tuple scoring approach is taken
How do FTA operators transform tuple
scores? (if TF*IDF is used)


Join: t.score = t1.score/|r2| + t2.score/|r1|
Projection: t.score = t1.score+ … + tn.score
12/05/2009
CS561
17
Contents





Introduction
A formal basis for Full-Text search languages
Incompleteness of Existing Full-Text Search
Languages
PPRED: Language and Query Evaluation
Proof of idea (Experiments)
12/05/2009
CS561
18
Completeness

A full-text language L is said to be full-text
complete with respect to a set of positionbased predicates Preds iff all queries that can
be expressed in the FTC using Preds can
also be expressed in L
12/05/2009
CS561
19
Incompleteness of predicate-based
languages 1/2

Let DIST be the following predicate-based query
language:





Query := Token | Query op Query |
dist(Token,Token,Integer)
op := ‘AND’ | ‘OR’ | ‘AND NOT’
token := StringLiteral | ‘ANY’
The semantics of DIST can be recursively defined in
terms of the FTC
Theorem1: if |T| >=3, where T is the set of tokens
then there exists a query that can be expressed in
FTC with Preds={ distance(p1,p2,d) } that cannot be
expressed by DIST
12/05/2009
CS561
20
Incompleteness of predicate-based
languages 2/2

No query in DIST can express the following FTC:


The proof is by contradiction on the context nodes:




exists p1,p1,p3 ( (hasPos(n,p1) and hasAsToken(p1,t1))
and (hasPos(n,p2) and hasAsToken(p2,t2)) and
(hasPos(n,p3) and hasAsToken(p3,t3)) and
distance(p1,p2,0) and distance(p2,p3,0) )
CN1 = “t1 t2 t3”
CN2 = “t1 t2 t2 t3 t3 t1”
The previous FTC would only return CN1. This is not
the case of DIST which returns either both CN1,CN2
or none.
Why? ( dist(token1,token2,int) as opposed to
dist(p1,p2,int) )
12/05/2009
CS561
21
Incompleteness of text region algebras
1/2


Theorem2 : There exists a query that can be
expressed in FTC with Preds = {
ordered(p1,p2), samepara(p1,p2) } that
cannot be expressed in TRA
The following FTC query cannot be
expressed using TRA:

Exists p1,p2( hasPos(node,p1) and
hasAsToken(p1,t1) and hasPos(node,p2) and
hasAsToken(p2,t2) and ordered(p1,p2) and
samepara(p1,p2) )
12/05/2009
CS561
22
Incompleteness of text region algebras
2/2


It is proven that TRA’s cannot represent
simultaneously inclusion and ordering constraints.
Example?


Find documents with regions sES that contains two other
regions tET and uEU such that t comes before u.
Making the assumption that s,t are regions with
same start and end and set U to be regions
representing paragraphs concludes to the previously
defined FTC query. The theorem therefore follows
12/05/2009
CS561
23
Contents





Introduction
A formal basis for Full-Text search languages
Incompleteness of Existing Full-Text Search
Languages
PPRED: Language and Query Evaluation
Proof of idea (Experiments)
12/05/2009
CS561
24
Defining Positive Predicates

An FTC query evaluation algorithm can only be:



polynomial in the size of the data and
exponential in the size of the query
Can we do better?


Observation: many full-text search predicates are true in a
contiguous region of the position space.
These predicates are called Positive predicates and can be
efficiently evaluated by scanning nodes in increasing order
of positions. (single scan)
12/05/2009
CS561
25
PPRED Language




Based on previous observations the PPRED
language is defined which is a subset of FTC
The syntax is as follows:
PPRED is a strict super set if DIST and TRA’s
Example: some p1 has ‘IR’ some p2 has
‘XML’ ordered(p1,p2)
12/05/2009
CS561
26
Query Evaluation 1/3
Each R_token relation is represented as an inverted
list associated to token. Each entry in R_token is of
the form (node,PosList,score)

“judge” inverted list
“assignment” inverted list
“District” inverted list
n
PosList
n PosList
n PosList
1
75
90 105 140
81
1
75
1
51
89
12/05/2009
85 97
83 210
CS561
80 99 139
56 59
96 102 108
27
Query Evaluation 2/3



A query is first converted to FTA operators
Since it is not wanted to materialize the out put
relation each operator exposes a new API for
traversing it’s output
The API maintains the following state:


node, which is the current node and
list<Position> which track the current positions in the
node
12/05/2009
CS561
28
Query Evaluation 3/3

The API defines the following methods

advanceNode():



node = projection_node(R)
List<Position> = the smallest positions that appear in R
for that node
andvancePosition(i,pos):

12/05/2009
List<Position> = the smallest positions that appear in R
and satisfy condition pi>pos
CS561
29
API Example
advancePosition(p1,4)
Node1
Node1
Node1
Node2
Node2
12/05/2009
advanceNode()
5
3
7
19
23
6
4
8
20
21
CS561
30
Query Evaluation 4/4

Given the specific query plan the
execution would be as follows:
 advanceNode() is called on the
top project which forwards the
call to the distance selection
operator below it
 The latter continiously calls
advancePosition() on the
ordered predicate until it finds a
satisfying tuple of positions
 The ordered predicates
advance through the result of
the underlying operator until it
finds a tuple that satisfies it
 The evaluation proceeds down
the tree.
12/05/2009
CS561
31
Contents





Introduction
A formal basis for Full-Text search languages
Incompleteness of Existing Full-Text Search
Languages
PPRED: Language and Query Evaluation
Proof of idea (Experiments)
12/05/2009
CS561
32
Query Evaluator Complexity






O(entries_per_token*pos_per_entry*toksQ*(predsq+ops+1))
entries_per_token = maximum num of entries in the inverted
list
pos_per_entry = maximum number of positions in an entry
in an inverted list
toksQ = number of keywords in q
predsQ = number of predicates in query
opsQ = number of operators in a query
12/05/2009
CS561
33
Experimental Results
•The performance of PPRED was compared to:
•BOOL which is a restriction of DIST
•REL which is the implementation of FTA
using regular relational operators.
12/05/2009
CS561
34
QUESTIONS?
12/05/2009
CS561
35
TexQuery: A Full-Text Search
Extension to XQuery
Chavdar Botev,Sihem Amer-Yahia, and
Jayavel Shamnugasindaram
12/05/2009
CS561
36
Contents






Introduction
Design Goals
Limitations of existing framework
The TexQuery Language
TexQuery Fromal Semantics
Conclusions
12/05/2009
CS561
37
Current State and facts


XML also has unstructured information ( ie.
Library of Congress in XML, etc)
Current query languages such as
Xpath/Xquery provide powerful structured
queries but cannot express complex full-text
queries.

Expressiveness limited to contains() function
12/05/2009
CS561
38
What is missing?

A full-text query language that:




can provide complex full-text search conditions ( Boolean
connectives, distance predicates, etc )
could be easily integrated with existing query languages
such as XQuery ( respect to sequence of items’ data model
)
is fully compositional ( closed under a particular data model
)
Research work’s contribution:

TexQuery and the underlying FullMatch data model
12/05/2009
CS561
39
Challenges

Integrating full-text search languages in
XQuery is challenging because:



Need to identify powerful and composable set of
full-text primitives
seamlessly integrate regular XQuery with full-text
search capabilities is not trivial. ( nodes vs
keyword tokens ).
A notion of ranking should be supported
12/05/2009
CS561
40
Contents






Introduction
Design Goals
Limitations of existing framework
The TexQuery Language
TexQuery Fromal Semantics
Conclusions
12/05/2009
CS561
41
Searching over semi-structured data


Users should be able to specify the search
context, or the context over which the full-text
predicate is to be performed
Users should be able to specify the return
context, or part of the document collection to
be returned
12/05/2009
CS561
42
Expressive power, extensibility and
ranking






Users should be able to express complex full-text searches
The language should be extensible with respect to new full-text
primitives
Users should be able to obtain relevance scores for the results
Users should be able to control how scores are computed
Users should be able to obtain the top-k results based on their
relevance score
Users should be able to specify a scoring condition which is
possibly different from the full-text search condition
12/05/2009
CS561
43
Integrating with XQuery, Language Syntax
and Efficiency






Users should be able to embed full-text searches in XQuery
expressions and vice versa
XQuery’s capabilities should be leveraged whenever possible
There should be no extension to the XQuery data model
It should be possible to statically verify that a query is
syntactically correct
The language syntax should allow for static type checking and
inference
The language should allow an efficient implementation
12/05/2009
CS561
44
Contents






Introduction
Design Goals
Limitations of existing framework
The TexQuery Language
TexQuery Fromal Semantics
Conclusions
12/05/2009
CS561
45
Is extending current XQuery functions
enough?

Why not create a a function for every complex full-text searsch
predicate?
The XQuery data model would need to be changed. Search token
positions would need to be captured

How about creating a ‘universal’ contains function for all complex
full-text predicates?
Uninterpreted string that is opaque to the XQuery language. Problems
When trying to embed with XQuery
12/05/2009
CS561
46
Contents






Introduction
Design Goals
Limitations of existing framework
The TexQuery Language
TexQuery Fromal Semantics
Conclusions
12/05/2009
CS561
47
TexQuery Overview

TexQuery introduces 2 new expressions:




FTConatinsExpr and
FTScoreExpr
These expressions use a set of fully
composable full-text primitives called
FTSelections
As all other XQuery expressions:
‘Sequence
of items’
12/05/2009
TexQuery
Expr
CS561
‘Sequence
of items’
48
TexQuery Expressions 1/2

FTContainsExpr




expr ‘ftcontains’ FTSelection
expr is an XQuery expression that defines the search
context and FTSelection specifies the search condition
This condition returns true or flase
Examples:
12/05/2009
CS561
49
TexQuery Expressions 2/2

FTScoreExpr




expr ‘ftscore’ FTSelectionWithWeights
expr is an XQuery expression that defines the search
context and FTSelectionWithWeights specifies the search
condition. Weights are used for scoring
This condition returns: Seq<float(0,1,inclusive)>, A score
for every node returned.
Examples:
12/05/2009
CS561
50
FTSelections and FTContextModifiers


FTSelection can be either a simple word token or a
more complex ‘predicate’
FTContextModiefiers can be applied to on any
FTSelection to modify how the full-text search is
performed
12/05/2009
CS561
51
Contents






Introduction
Design Goals
Limitations of existing framework
The TexQuery Language
TexQuery Fromal Semantics
Conclusions
12/05/2009
CS561
52
The FullMatch Data Model



The XQuery’s data model is inadequate for fully
compositional FTSelections since it is defined at the
granularity of XML nodes
The solution? A new data model based on the
positions of the linguistic tokens within XML nodes
A position represents an occurrence of a linguistic
token in an XML and contains information about:

the token, a unique identifier, the xml node containing it, the
relative position of the sentence, the relative position of the
paragraph, the context ( tag name, attribute, etc )
12/05/2009
CS561
53
The Full Match Data Model
The building block of FullMatch
An annotated example of XML
12/05/2009
CS561
54
The Full Match Data Model


A Full match is
essentially a
propositional logic
disjunctive normal form
(DNF) predicate
specified using XML
positions
The predicates capture
the precise condition
that an XML node
needs to satisfy in order
to be a result of for a
full-text search.
12/05/2009
CS561
Each simple match represents a
Possible solution
55
The FullMatch Data Model


It is obvious that full match has a hierarchical
structure thus can be expressed in XML. A
possible DTD follows:
The benefits that derive from the definition of FullMatch are:
 It ensures that FTSelections are fully composable
 TexQuery fully extensible. Only semantics in FullMatch
need to be defined for new FTSelection primitives
12/05/2009
CS561
56
Semantics of TexQuery Expressions

The big picture:

In order to define the semantics of TexQuery
expression the following 2 implementation
defined functions are used:
12/05/2009
CS561
57
The Semantics of FTContainsExpr

The function
defined returns true
iff some node in
the search context
satisfies at least
one of the
SimpleMatches
FullMatch
BOOL
FTContainsExprs
Node*
12/05/2009
CS561
58
The Semantics of FTScoreExpr

This function
returns a score in
the interval [0,1]
using a call to
implementation
specific function
fts:score
FullMatch
FLOAT*
FTScoreExprs
Node*
12/05/2009
CS561
59
The Semantics of FTStringSelection

FTStringSelection ::= Expr
Transforms a search token
into a FullMatch
12/05/2009
CS561
60
FTAndConnective semantics 1/2


FTAndConnective ::= FTSelection ‘&&’ FTSelection
Each SimpleMatch in
the Resulting FullMatch
is a combination of on
SimpleMatch from the
first input FullMatch and
on SimpleMatch from
the second input
FullMatch
12/05/2009
CS561
61
FTAndConnective semantics 2/2
12/05/2009
CS561
62
FTScopeSelection semantics

The
FTScopeSelectionSemantics
take takes the FullMatch
corresponding to its input and
restricts the SimpleMatches
12/05/2009
CS561
63
Contents






Introduction
Design Goals
Limitations of existing framework
The TexQuery Language
TexQuery Fromal Semantics
Questions?
12/05/2009
CS561
64