Transcript Document

Chapter 4
Query Languages
..
.
Introduction


Cover different kinds of queries posed
to text retrieval systems
Keyword-based query languages


Pattern matching


include simple words and phrases as
well as Boolean operators
complement keyword searching with
data retrieval capabilities
Structural queries

querying on structure of text
Keyword-Based Querying


Query is formulation of user information
need
Keyword-based queries are popular
intuitive
 easy to express
 allow for fast ranking


Query can be simply a word

in general more complex combination of
operations involving several words
Single-Word Queries


Most elementary query is a word
Word is sequence of letters surrounded
by separators


some characters are not letters but do not
split a word, e.g. hyphen in on-line
Result of word queries is


set of documents containing at least one of
the words in query
resulting documents are ranked


term frequency (count of word in document)
inverse document frequency (count of no. of
documents in which word appears)
Context Queries



Complement single-word queries with
ability to search words in given
context, I.e. near other words
words near each other signal higher
likelihood of relevance than if they
appear apart
form phrases of words or find words
which are proximal in text
Phrase



Sequence of single-word queries
for instance, possible to search for
word ‘enhance’ and then word
‘retrieval’
uninteresting words in text are not
considered at all

e.g. above example query could match
text such as ‘…enhance the retrieval…’
Proximity


Sequence of single words or phrases
is given together with maximum
allowed distance between them
For instance, above query stated as
‘enhance’ and ‘retrieval’ should occur
within four words
 a possible match could be ‘… enhance
the power of retrieval…’


Distance can be measured in characters
or words
Boolean Queries


Oldest form of combining keyword
queries is to use Boolean operators
Boolean query has following syntax
 atoms (I.e. basic queries) that retrieve
documents, and of

Boolean operators which work on their
operands (sets of documents)

query syntax tree can be defined
 leaves
are basic queries
 internal nodes are operators
Boolean Queries (Cont.)

Retrieve all documents that contain the word
‘translation’ as well as either the word ‘syntax’ or
the word ‘syntactic’
AND
OR
translation
syntax
syntactic
Boolean Queries (Cont.)

No ranking of retrieved documents
provided



document either satisfies query (retrieved)
or does not (not retrieved)
does not allow partial matching between
document and user query
to overcome this limitation, idea of ‘fuzzy
Boolean’ set of operators proposed

instead of all the operands (AND) or at
least in one of operands (OR), retrieve
elements in some operands
Natural Language





Distinction between AND and OR
completely blurred
simply an enumeration of words and
context queries
all documents matching portion of user
query are retrieved
higher ranking assigned to documents
matching more parts of query
eliminated any reference to Boolean
operators
Pattern Matching




Query formulation based on concept of
pattern that allow retrieval of pieces of
text that have some property
Pattern is set of syntactic features that
occur in text segment
Segments satisfying pattern specification
said to ‘match’ the pattern
We are interested in documents
containing segments that match given
search pattern
Pattern Matching (Cont.)

Most used types of pattern are

words


prefixes



string (sequence of characters) that is a word
in text
string that form beginning of text word
prefix ‘comput’ retrieve documents with words
such as ‘computer’, ‘computation’
suffixes


string that form termination of word
suffix ‘ters’ retrieve documents with words such
as ‘testers’, ‘computers’

substrings



string which can appear within word
substring ‘tal’ retrieve documents with words
such as ‘coastal’, ‘talk’, ‘metallic’
ranges



pair of strings that match any word lying
between them in lexicographical order
alphabets sorted to order string into
lexicographical order (dictionary order)
range between words ‘held’ and ‘hold’ retrieve
strings such as ‘hoax’, ‘hissing’

allowing errors





word together with error threshold
retrieves all text words ‘similar’ to given word
pattern may have errors (typing, spelling) and
documents with words with erroneous variants
are retrieved (with edit distance)
if typing error splits ‘flower’ into ‘flo wer’, still
found with one error
regular expression (r.e.)


r.e. is built up by simple strings and operators
like union, concatenation and repetition
query like ‘pro (plem | tein) (s | ) (0 | 1 | 2)*’
will match words like ‘problem02’, ‘proteins’

Extended patterns




subset of regular expressions expressed with
simpler syntax
classes of characters, I.e. some position in
pattern matched by any character from predefined set (e.g. some characters must be digit,
not a letter, vowel, etc.)
conditional expressions, I.e. part of pattern
may or may not appear
wild characters, I.e. match any sequence in
text (e.g. any word starts as ‘flo’ and ends with
‘ers’ which match ‘flowers’ as well as ‘flounders’
Structural Queries



Allowing user to query texts based on
structure, and not content
mixing contents and structure in queries
can pose powerful queries (much more
expressive)
An example


select set of documents that satisfy certain
constraints on content (using word, phrase, or
patterns) and then
structural constraints expressed using containment,
proximity, or chapters, sections present in
documents

Types of structures

fixed structure

hypertext

hierarchical structure
Fixed Structure






Document has fixed set of fields
each field has some text inside
some fields not present in all documents
fields not allowed to nest or overlap
retrieval activity restricted to specifying
that given pattern was to be found only in
given fields
this model reasonable when text
collection has fixed structure
Hypertext




Retrieval from hypertext began as
navigational activity
user manually traverse hypertext nodes
following links to search what he wanted
not possible to query hypertext based on
its structure
WebGlimpse - interesting proposal to
allow navigation plus ability to search by
content in neighborhood of current node
Hierarchical Structure



Represent recursive decomposition of
text
natural model for many text collections
Figure 4.3 shows example of hierarchical
structure that consists of page of a book,
its schematic view and parsed query to
retrieve figure
Hierarchical Models
PAT Expressions

Structure marked in the text by tags (as
in HTML)


each pair of initial and final tags defines
a region, set of contiguous text areas


defined in terms of initial and final tags
area of region cannot nest or overlap
possible to select areas containing other
areas, contained in other areas, or
followed by other areas
Overlapped Lists



Allows area of regions to overlap, but not
to nest
considers use of inverted list where
words and also regions are indexed
allows to perform set union, and to
combine regions
List of References



Attempt to make definition and querying
of structured text uniform, using common
language
the language allows for querying on ‘path
expressions’, which describe paths in
structure tree
answers to queries are list of ‘references’

reference is pointer to region of database
Proximal Nodes




Tries to find good compromise between
expressiveness and efficiency
specifies fully compositional language
where leaves of query syntax tree
formed by basic queries on contents or
names of structural elements (e.g. all
chapters)
internal nodes combine results
for efficiency, operations at internal
nodes must relate nodes close in text
Tree Matching


Relies on single primitive: tree inclusion
interpret structure of text database and
query as trees


determine embedding of query into
database which respects hierarchical
relationships between nodes of query
simple queries return roots of the
matches