Information Retrieval 1

Download Report

Transcript Information Retrieval 1

LIS618 lecture 1
Thomas Krichel
2002-09-15
Organization
• homepage
http://wotan.liu.edu/home/krichel/lis618n02a
• Contents to be discussed today.
• Send mail to [email protected]
– Your name
– Your secret word for grades delivery
• Interrupt me with as many questions as
possible!
• Ask for breaks!
Proposed Organization
•
•
•
•
•
Normal lecture
Quiz at the beginning of every lecture.
Main quiz next week (25% of grade)
Search exercise 55%
Other quizzes 10%
• Formal syllabus to be made early next
week!
Search exercise
• find victim
• conduct interview about an information
need experienced by the victim, write
down expectations
• search in Dialog and on web
• discuss results with the victim
• write essay, no longer than 7 pages.
Structure of talk
• First talk about me, then about you and the
course
• General round trip on theoretical matters.
–
–
–
–
–
–
Context of database searching
Database searching and information retrieval
The retrieval process
Information retrieval models
Retrieval performance evaluation
Query languages
• Logging on to Dialog
• Web searching exercise (if time permits)
About me
• Born 1965, in Völklingen (Germany)
• Studied economics and social sciences at
the Universities of Toulouse, Paris, Exeter
and Leiceister.
• PhD in theoretical macroeconomics
• Lecturer in Economics at the University of
Surrey 1993 and 2001
• Since 2001 assistant professor at the
Palmer School
Why?
• During research assistantship period,
(1990 to 1993) I was constantly frustrated
with difficult access to scientific literature.
• At the same time, I discovered easy
access to freely downloadable software
over the Internet.
• I decided to work towards downloadable
scientific documents. This lead to my
library career (eventually).
Steps taken I
• 1993 founded the NetEc project at
http://netec.mcc.ac.uk, later available at
http://netec.ier.hit-u.ac.jp as well as at
http://netec.wustl.edu.
• These are networking projects targeted to
the economics community. The bulk is
– Information about working papers
– Downloadable working papers
– Journal articles were added later
Steps taken II
• Set up RePEc, a digital library for
economics research. Catalogs
– Research documents
– Collections of research documents
– Researchers themselves
– Organizations that are important to the
research process
• Decentralized collection, model for the
open archives initiative
Steps taken III
• Co-founder of Open Archives Initiative
• Work on the Academic Metadata Format
• Co-founded rclis, a RePEc clone for
(Research in Computing, Library and
Information Science)
summary
• There are three basic types of models in
classic information retrieval.
• Extensions of these types are a matter of
research concern and require good
mathematical skills.
• All classic models treat document as
individual pieces.
Database searching (DS)
• subset of the subject of information
retrieval (IR)
• DS mainly thought as applicable to the set
of large structured databases as opposed
to do web searching
• for those, a general knowledge of what
databases are seems useful
• Concentrate on textual databases
traditional social model
• user goes to a library
• describes problem to the librarian
• librarian does the search
– without the user present
– with the user present
• hands over the result to the user
• user fetches full-text or asks a librarian to
fetch the full text.
economic rational for traditional
model
• In olden days the cost of
telecommunication was high.
• database use costs
– cost of communication
– cost of access time to the database
• the traditional model controls an upper
bound on costs
disintermediation
• with access cost time gone, the traditional
model is under threat
• there is disintermediation where the
librarian looses her role
• but that may not be good news for
information retrieval results
– user knows subject matter best
– librarian knows searching best
Web searching
• IR has received a lot of impetus through
the web, which poses unprecedented
search challenges.
• with more and more data appearing on the
web DS may be a subject in decline,
because it is primarily concerned with nonweb databases
Main theory part
• Literature: "Modern Information Retrieval"
by Ricardo Baeza-Yates and Berthier
Ribiero-Neto
• Don't buy it. It is a not a good book.
before the IR process
• provider
– define data that is available
• documents that can be used
• document operations
• document structure
– index
• user
– user need
– IR system familiarity
the IR process
• query expresses user need in a query
language
• processing of query yields retrieved
documents
• calculation of relevance ranking
• examination of retrieved documents
• possible relevance cycle
main problem
• user is not an expert at the formulation of
a query
• garbage in garbage out, the retrieval yields
poor result
• ways out
– design very intuitive interface
– give expert guidance
key aid: index
• index term is a part of the document that has a
meaning on its own (usually a noun)
• retrieval based on index term raises questions
– semantics in query or document is lost
– matching done in imprecise space of index terms
• predicting relevance is a central problem
• the IR model determines the process of
relevance ranking
taxonomy of classic IR models
• Boolean, or set-theoretic
– fuzzy set models
– extended Boolean
• vector, or algebraic
– generalized vector model
– latent semantic indexing
– neural network model
• probabilistic
– inference network
– belief network
basic concepts: index term
• an index term is a word whose semantics
help to remember the document's main
themes.
• nouns are mainly used
• if all words are index terms, the logical
view of the document is full text
basic concept: weight of index term
• given all nouns, not all appear to have the same
relevance to the text
• sometimes, we can have a simple measure of
the importance of a term, example?
• more generally, for each indexing term and each
document we can associate a weight with the
term and the document.
• usually, if the document does not contain the
term, its weight is zero
basic concept: mutual term
independence
• Thinking of the weight of a term as a
function of the document and the term only
implies that it is independent of other
terms.
• This is an important oversimplification.
• But it allows for fast computation.
• No study has shown that not assuming
independence brings significant
performance gain.
Boolean model
• in the Boolean model, the index weight of
all index term for any document is 1 if the
term appears in the document. It is 0
otherwise.
• This allows to combine query terms with
Boolean operator AND, OR, and NOT
• thus powerful queries can be written
example: a AND (b OR NOT c)
•
•
•
•
•
•
•
1
2
3
4
5
6
7
•
•
•
•
•
•
•
abc
ab
ac
cb
c
b
a
advantages of Boolean model
• supposedly easy to grasp by the user
• precise semantics of queries
• implemented in the majority of commercial
systems
• why is it set-theoretic ?
problems of Boolean model
• sharp distinction between relevant and
irrelevant documents
• no ranking possible
• users find it difficult to formulate Boolean
queries
http://openlib.org/home/krichel
Thank you for your attention!