Transcript Document
Chapter 4
Query Languages
..
.
Introduction
Cover different kinds of queries posed
to text retrieval systems
Keyword-based query languages
Pattern matching
include simple words and phrases as
well as Boolean operators
complement keyword searching with
data retrieval capabilities
Structural queries
querying on structure of text
Keyword-Based Querying
Query is formulation of user information
need
Keyword-based queries are popular
intuitive
easy to express
allow for fast ranking
Query can be simply a word
in general more complex combination of
operations involving several words
Single-Word Queries
Most elementary query is a word
Word is sequence of letters surrounded
by separators
some characters are not letters but do not
split a word, e.g. hyphen in on-line
Result of word queries is
set of documents containing at least one of
the words in query
resulting documents are ranked
term frequency (count of word in document)
inverse document frequency (count of no. of
documents in which word appears)
Context Queries
Complement single-word queries with
ability to search words in given
context, I.e. near other words
words near each other signal higher
likelihood of relevance than if they
appear apart
form phrases of words or find words
which are proximal in text
Phrase
Sequence of single-word queries
for instance, possible to search for
word ‘enhance’ and then word
‘retrieval’
uninteresting words in text are not
considered at all
e.g. above example query could match
text such as ‘…enhance the retrieval…’
Proximity
Sequence of single words or phrases
is given together with maximum
allowed distance between them
For instance, above query stated as
‘enhance’ and ‘retrieval’ should occur
within four words
a possible match could be ‘… enhance
the power of retrieval…’
Distance can be measured in characters
or words
Boolean Queries
Oldest form of combining keyword
queries is to use Boolean operators
Boolean query has following syntax
atoms (I.e. basic queries) that retrieve
documents, and of
Boolean operators which work on their
operands (sets of documents)
query syntax tree can be defined
leaves
are basic queries
internal nodes are operators
Boolean Queries (Cont.)
Retrieve all documents that contain the word
‘translation’ as well as either the word ‘syntax’ or
the word ‘syntactic’
AND
OR
translation
syntax
syntactic
Boolean Queries (Cont.)
No ranking of retrieved documents
provided
document either satisfies query (retrieved)
or does not (not retrieved)
does not allow partial matching between
document and user query
to overcome this limitation, idea of ‘fuzzy
Boolean’ set of operators proposed
instead of all the operands (AND) or at
least in one of operands (OR), retrieve
elements in some operands
Natural Language
Distinction between AND and OR
completely blurred
simply an enumeration of words and
context queries
all documents matching portion of user
query are retrieved
higher ranking assigned to documents
matching more parts of query
eliminated any reference to Boolean
operators
Pattern Matching
Query formulation based on concept of
pattern that allow retrieval of pieces of
text that have some property
Pattern is set of syntactic features that
occur in text segment
Segments satisfying pattern specification
said to ‘match’ the pattern
We are interested in documents
containing segments that match given
search pattern
Pattern Matching (Cont.)
Most used types of pattern are
words
prefixes
string (sequence of characters) that is a word
in text
string that form beginning of text word
prefix ‘comput’ retrieve documents with words
such as ‘computer’, ‘computation’
suffixes
string that form termination of word
suffix ‘ters’ retrieve documents with words such
as ‘testers’, ‘computers’
substrings
string which can appear within word
substring ‘tal’ retrieve documents with words
such as ‘coastal’, ‘talk’, ‘metallic’
ranges
pair of strings that match any word lying
between them in lexicographical order
alphabets sorted to order string into
lexicographical order (dictionary order)
range between words ‘held’ and ‘hold’ retrieve
strings such as ‘hoax’, ‘hissing’
allowing errors
word together with error threshold
retrieves all text words ‘similar’ to given word
pattern may have errors (typing, spelling) and
documents with words with erroneous variants
are retrieved (with edit distance)
if typing error splits ‘flower’ into ‘flo wer’, still
found with one error
regular expression (r.e.)
r.e. is built up by simple strings and operators
like union, concatenation and repetition
query like ‘pro (plem | tein) (s | ) (0 | 1 | 2)*’
will match words like ‘problem02’, ‘proteins’
Extended patterns
subset of regular expressions expressed with
simpler syntax
classes of characters, I.e. some position in
pattern matched by any character from predefined set (e.g. some characters must be digit,
not a letter, vowel, etc.)
conditional expressions, I.e. part of pattern
may or may not appear
wild characters, I.e. match any sequence in
text (e.g. any word starts as ‘flo’ and ends with
‘ers’ which match ‘flowers’ as well as ‘flounders’
Structural Queries
Allowing user to query texts based on
structure, and not content
mixing contents and structure in queries
can pose powerful queries (much more
expressive)
An example
select set of documents that satisfy certain
constraints on content (using word, phrase, or
patterns) and then
structural constraints expressed using containment,
proximity, or chapters, sections present in
documents
Types of structures
fixed structure
hypertext
hierarchical structure
Fixed Structure
Document has fixed set of fields
each field has some text inside
some fields not present in all documents
fields not allowed to nest or overlap
retrieval activity restricted to specifying
that given pattern was to be found only in
given fields
this model reasonable when text
collection has fixed structure
Hypertext
Retrieval from hypertext began as
navigational activity
user manually traverse hypertext nodes
following links to search what he wanted
not possible to query hypertext based on
its structure
WebGlimpse - interesting proposal to
allow navigation plus ability to search by
content in neighborhood of current node
Hierarchical Structure
Represent recursive decomposition of
text
natural model for many text collections
Figure 4.3 shows example of hierarchical
structure that consists of page of a book,
its schematic view and parsed query to
retrieve figure
Hierarchical Models
PAT Expressions
Structure marked in the text by tags (as
in HTML)
each pair of initial and final tags defines
a region, set of contiguous text areas
defined in terms of initial and final tags
area of region cannot nest or overlap
possible to select areas containing other
areas, contained in other areas, or
followed by other areas
Overlapped Lists
Allows area of regions to overlap, but not
to nest
considers use of inverted list where
words and also regions are indexed
allows to perform set union, and to
combine regions
List of References
Attempt to make definition and querying
of structured text uniform, using common
language
the language allows for querying on ‘path
expressions’, which describe paths in
structure tree
answers to queries are list of ‘references’
reference is pointer to region of database
Proximal Nodes
Tries to find good compromise between
expressiveness and efficiency
specifies fully compositional language
where leaves of query syntax tree
formed by basic queries on contents or
names of structural elements (e.g. all
chapters)
internal nodes combine results
for efficiency, operations at internal
nodes must relate nodes close in text
Tree Matching
Relies on single primitive: tree inclusion
interpret structure of text database and
query as trees
determine embedding of query into
database which respects hierarchical
relationships between nodes of query
simple queries return roots of the
matches