Information Retrieval 2 and DIALOG

Download Report

Transcript Information Retrieval 2 and DIALOG

LIS618 lecture 2
Thomas Krichel
2002-09-22
Structure of talk
• General round trip on theoretical matters, part
– Information retrieval models
• vector model
• Probabilistic model
– Retrieval performance evaluation
– Query languages
• Introduction to online searching
• Introduction to DIALOG
– Overview
– bluesheets
vector model
• associates weights with each index term
appearing in the query and in each
database document.
• relevance can be calculated as the cosine
between the two vector, i.e. their cross
product divided be the square roots of the
squares of each vector. This measure
varies between 0 and 1.
tf/idf weighting technique
• Let n_i be the number of documents
where the term d_i appears. Let F_i_j be
the number of times term i appears in the
document j.
• The normalized frequency is f_i_j, given by
f_i_j=(F_i_j/max_l(F_l_j)
that is the raw frequency divided by the
maximum raw frequency achieved by any
term in the document j.
tf/idf weighting technique
• Let N be the number of documents.
• Then the most frequently used weighting
scheme is
w_i_j=f_i_j * log(N/n_i)
• There are other methods, but these are
variations on this one.
advantages of vector model
•
•
•
•
term weighting improves performance
sorting is possible
easy to compute, therefore fast
results are difficult to improve without
– query expansion
– user feedback circle
probabilistic model (outline only)
• starts with the assumption that there is a subset
of documents that form the ideal answer set
• query process specifies properties of the answer
set
• query terms can be used to form a probability
that a document is part of the answer
• then we start an iterative process with the user
to gain more characteristics about the answer
set
recursive method
• If we assume that the probability that the
documents that are relevant among a set
of initially retrieved documents is
proportional to the appearance of index
terms that are part of the query, the
probability can further be refined.
probabilistic model
• For any user requirement, we assume that there
is an answer set and that the probability that the
user finds a document interesting only depends
on the document and the query
• Then the similarity of the document to the query
can be expressed as
• s=(probability that the document is part of the
answer set / probability that it is not part of the
answer set).
• There are ways to calculate this (with some
more assumptions).
retrieval performance evaluation
• There are two classic measures. Both
assume that there is an answer set.
• Recall is the fraction of the relevant
documents that the query result has
captured.
• Precision is the fraction of the retrieved
documents that is relevant.
recall and precision curves
• assume that all the retrieved documents
arrive at once and are being examined.
• during that process, the user discover
more and more relevant documents.
Recall increases.
• during the same process, at least
eventually, there will be less and less
useful document. Precision declines
(usually).
Example
• let the answer set be {4,7,5,3,6,1,0,8,9}
and non-relevant documents represented
by letters.
• A query reveals the following result:
7,a,3,b,c,9,n,j,l,5,r,o,s,e,4.
• for the first document, (recall, precison) is
(10%,100%), for the third, (20%,60%),
then follow (30%,50%), (40%,40%),
(50%,27%)
recall/precision curves
• Such curves can be formed for each
query.
• An average curve, for each recall level,
can be calculated for several queries.
• Recall and precision levels can also be
used to calculate two single-valued
summaries.
average precision at seen
document
• sum all the precision level for each new
relevant document and divide by the total
number of relevant documents is the
query.
• In our example, it is 0.57
• This measure favors retrieval methods that
get the relevant documents to the top.
R-precision
•
•
•
•
•
•
•
a more ad-hoc measure.
Let R be the size of the answer set.
Take the first R results of the query.
Find the number of relevant documents
Divide by R.
In our example, the R-precision is .4.
An average can be calculated for a
number of queries.
critique of recall & precision
• recall has to be estimated by an expert
• recall is very difficult to estimate in a large
collection
• measures most appropriate to a situation
where queries are run in batch mode, they
are difficult to reconcile with the idea of
interactive use.
• there are some other measures.
simple queries
• single-word queries
– one word only
– Hopefully some word combinations are
understood as one word, e.g. on-line
• Context queries
– phrase queries (be aware of stop words)
– proximity queries, generalize phrase queries
• Boolean queries
simple pattern queries
•
•
•
•
•
prefix queries (e.g. "anal" for analogy)
suffix queries (e.g. "oral" for choral)
substring (e.g. "al" for talk)
ranges (e.g. form "held" to "hero")
within a distance, usually Levenshtein
distance (i.e. the minimum number of
insertions, deletions, and replacements) of
query term
regular expressions
• come from UNIX computing
• build form strings where certain characters are
metacharacters.
• example: "pro(blem)|(tein)s?" matches problem,
problem, protein and proteins.
• example: New .*y matches "New Jersey" and
"New York City", and "New Delhy".
• great variety of dialects, usually very powerful.
• Extremely important in digital libraries.
structured queries
• make use of document structures
• simplest example is when the documents
are database records, we can search for
terms is a certain field only.
• if there is sufficient structure to field
contents, the field can be interpreted as
meaning something different than the word
it contains. example: dates
query protocols
• There are some standard languages
– Z39.50 queries
– CCL, "common command language" is a
development of Z39.50
– CD-RDx "compact disk read only data
exchange" is supported by US government
agencies such as CIA and NASA
– SFQL "structure full text query language" built
on SQL
document preprocessing
• operations done on the documents before
indexing
– lexical analysis
– elimination of stop words
– stemming of words
– selection of index term
– construction of term categorization structures
• receives a decline in attention
lexical analysis
• divides a stream of characters into a
stream of words
• seems easy enough but….
• should we keep numbers?
• hyphens. compare "state-of-the-art" with
"b-52"
• removal of punctuation, but "333B.C."
• casing. compare "bank" and "Bank"
elimination of stop words
• some words carry no meaning and should
be eliminated
• in fact any word that appears in 80% of all
documents is pretty much useless, but
• consider a searcher for "to be or not to
be".
stemming
• in general, users search for the
occurrence of a term irrespective of
grammar
• plural, gerund forms, past tense can be
subject to stemming
• important algorithm by Porter
• evidence on effect on retrieval is mixed
index term selection
• some engines try to capture nouns only
• some nouns that appear heavily together
can be considered to be one index term,
such as "computer science"
• Most web engines, however, index all
words, why?
thesauri
• a list of words and for each word, a list of related
words
– synonyms
– broader terms
– narrower terms
• used
– to provide a consistent vocabulary for indexing and
searching
– to assist users with locating terms for query
formulation
– allow users to broaden or narrow query
use of thesauri
• most users want to get a quick response
• often the selection of terms is erroneous
• frequently the relationship between terms
in the query is badly served by the
relationships in the query. Thus thesaurus
expansion of an initial query (if performed
automatically) can lead to bad results.
Online database searching
before a search
• what is purpose
– brief overview
– comprehensive search
• What perspective on the topic
– scholarly
– technical
– business
– popular
I
before search
• What type of information
–
–
–
–
Fulltext
Bibliographic
Directory
Numeric
• Are there any known sources?
–
–
–
–
Authors
Journals
Papers
Conferences
II
before search
•
•
•
•
III
What are the language restrictions?
What, if any, are the cost restrictions?
How current need the data to be?
How much of each record is required?
DIALOG
Literature
http://training.dialog.com/sem_info/courses/
pdf_sem/dlg1.pdf
http://training.dialog.com/sem_info/courses/
pdf_sem/dlg2.pdf
http://training.dialog.com/sem_info/courses/
pdf_sem/dlg3.pdf
http://training.dialog.com/sem_info/courses/
pdf_sem/dlg4.pdf
databank
• over 500 different databases
–
–
–
–
–
references and abstracts for published literature,
business information and financial data;
complete text of articles and news stories;
statistical tables
Directories
• Two interfaces to all this stuff.
– Guided search (for neophytes)
– Command search (for Masters)  for us!!
Four steps in a search
• Use the Databases Selection Tool to
select databases
• Identify search terms
• Use Dialog basic commands to conduct a
search
• View records online
B E S T strategy
• begin
– b 630,636
– b papersmj, not 630
• expand
– e co=long island university
– e au=krichel, t
I
BEST
II
• select
– s (mate?(N)drink?) or (lex(N)para?)
– s s1 and s2
• type
– type s1/3/1,6
Command search
• The first thing to be done is to select a
database.
• 8 categories
–
–
–
–
Government
News
Business
Reference
– Medicine & Pharmaceuticals
– Science & Technology
– Intellectual property
– Social Sciences and Humanities
• there we go to command search
databases menus
• databases are ordered in hierarchical
fashion
• at each level a Boolean search can be
executed
– on all of Dialog
– on the databases in the current hierarchical
level
searching
• result may be a just a blank screen
• otherwise, a table with the file number, the
database name and the number of hits
appears
• wait until the display is complete….
• sorting of database is possible by the
number of hits for the current query
blue sheet
• each database name is linked to a blueish
pop-up window called the blue sheet for
the database
• Contents of bluesheet is covered later
• at this stage we choose a database and hit
"begin". We see that there is a command
selected: "be numbers" where numbers
are the ones for the databases selected,
separated by comma.
finding a database
• file 411 contains the database of
databases
• 'sf category' selects files belonging to a
category category
• categories are listed at
http://library.dialog.com/bluesheets
• 'rank files' will rank the results
• 'b ref,ref' will select databases using rank
references.
closer look at the bluesheet
• file description
• subject coverage (free vocabulary)
• format options, lists all formats
– by number (internal)
– by dialog web format (external, i.e. crossdatabase)
• search options
– basic index, i.e. subject contents
– additional index, i.e. non-subject
search options: basic index
• select without qualifiers searches in all
fields in the basic index
• bluesheet lists field indicators available for
a database
• also note if field is indexed by word or
phrase. proximity searching only works
with word indices.
other search options
• additional indices lists those terms that
can lead a query. Often, these are phrase
indexed.
• special features will list other features that
the database has and that can be used in
queries
http://openlib.org/home/krichel
Thank you for your attention!