Presentation.
Download
Report
Transcript Presentation.
Selecting Relevant Documents
• Assume:
– we already have a corpus of documents defined.
– goal is to return a subset of those documents.
– Individual documents have been separated into
individual files
• Remaining components must parse, index,
find, and rank documents.
• Traditional approach is based on the words
in the documents
Extracting Lexical Features
• Process a string of characters
– assemble characters into tokens
– choose tokens to index
• In place (problem for www)
• Standard lexical analysis problem
• Lexical Analyser Generator, such as lex,
Lexical Analyser
• Basic idea is a finite state machine
• Triples of input state, transition token,
output state
A-Z
1
blank
A-Z
blank, EOF
0
2
• Must be very efficient; gets used a LOT
Design Issues for Lexical Analyser
• Punctuation
– treat as whitespace?
– treat as characters?
– treat specially?
• Case
– fold?
• Digits
– assemble into numbers?
– treat as characters?
– treat as punctuation?
Lexical Analyser
• Output of lexical analyser is a string of
tokens
• Remaining operations are all on these
tokens
• We have already thrown away some
information; makes more efficient, but
limits the power of our search
Stemming
• Additional processing at the token level
• Turn words into a canonical form:
– “cars” into “car”
– “children” into “child”
– “walked” into “walk”
• Decreases the total number of different
tokens to be processed
• Decreases the precision of a search, but
increases its recall
Stemming -- How?
•
•
•
•
Plurals to singulars (eg, children to child)
Verbs to infinitives (eg, talked to talk)
Clearly non-trivial in English!
Typical stemmers use a context-sensitive
transformation grammar:
– (.*)SSES -> /1SS
• 50-250 rules are typical
Noise Words (Stop Words)
• Function words that contribute little or
nothing to meaning
• Very frequent words
– If a word occurs in every document, it is not
useful in choosing among documents
– However, need to be careful, because this is
corpus-dependent
• Often implemented as a discrete list (stop.wrd on CD)
Example Corpora
• We are assuming a fixed corpus. Text uses
two sample corpora:
– AI Abstracts
– Email. Anyone’s email.
• Textual fields, structured attributes
• Textual: free, unformatted, no metainformation
• Structured: additional information beyond
the content
Structured Atributes for AI
Theses
•
•
•
•
•
•
Thesis #
Author
Year
University
Advisor
Language
Textual Fields for AIT
• Abstract
– Reasonably complete standard academic
English capturing the basic meaning of
document
• Title
– Short, formalized, captures most critical part of
meaning
– (proxy for abstract)
Indexing
• We have a tokenized, stemmed sequence of
words
• Next step is to parse document, extracting
index terms
– Assume that each token is a word and we don’t
want to recognize any more complex structures
than single words.
• When all documents are processed, create
index
Basic Indexing Algorithm
• For each document in the corpus
– get the next token
– save the posting in a list
• doc ID, frequency
• For each token found in the corpus
– calculate #doc, total frequency
– sort by frequency
(p 53-54)
Fine Points
• Dynamic Corpora
• Higher-resolution data (eg, char position)
• Giving extra weight to proxy text (typically
by doubling or tripling frequency count)
• Document-type-specific processing
– In HTML, want to ignore tages
– In email, maybe want to ignore quoted material
Choosing Keyword
• Don’t necessarily want to index on every word
– Takes more space for index
– Takes more processing time
– May not improve our resolving power
• How do we choose keywords?
– Manually
– Statistically
• Exhaustivity vs specificity
Manually Choosing Keywords
• Unconstrained vocabulary: allow creator of
document to choose whatever he/she wants
– “best” match
– captures new terms easily
– easiest for person choosing keywords
• Constrained vocabulary: hand-crafted ontologies
– can include hierarchical and other relations
– more consistent
– easier for searching; possible “magic bullet” search
Examples of Constrained
Vocabularies
• ACM headings
• Medline Subject Headings
Automated Vocabulary Selection
• Frequency: Zipf’s Law
– Within one corpus, words with middle
frequencies are typically “best”
• Document-oriented representation bias: lots
of keywords/document
• Query-Oriented representation bias: only
the “most typical” words. Assumes that we
are comparing across documents.
Choosing Keywords
• “Best” depends on actual use; if a word
only occurs in one document, may be very
good for retrieving that document; not,
however, very effective overall.
• Words which have no resolving power
within a corpus may be best choices across
corpora
• Not very important for web searching; will
be more relevant for text mining.
Keyword Choice for WWW
• We don’t have a fixed corpus of documents
• New terms appear fairly regularly, and are
likely to be common search terms
• Queries that people want to make are wideranging and unpredictable
• Therefore: can’t limit keywords, except
possibly to eliminate stop words.
• Even stop words are language-dependent.
So determine language first.