Introduction to the course and to Information Retrieval

Download Report

Transcript Introduction to the course and to Information Retrieval

LIS618 lecture 0
Thomas Krichel
2003-09-14
today's lecture
• I will not talk about the strike.
• A look at the course home page
http://wotan.liu.edu/home/krichel/lis618n03a
• administrative stuff
• historical matters about the course
• about me
• business of database searching
• indexes
• the Boolean information retrieval model
• practice example on Dialog
Organization
• homepage
http://wotan.liu.edu/home/krichel/lis618n03a
• Contents to be discussed today.
• Send mail to [email protected]
– Your name
– Your secret word for grades delivery
• Interrupt me with as many questions as
possible!
• Ask for breaks!
Proposed Organization
• Normal lecture
• Quiz at the beginning of every lecture
– Factually oriented, around 15 minutes
– Remove worst performance
– Average to form 50%
• Search exercise 50%
• Formal syllabus to be made early next
week!
Search exercise
• find victim of an information need
• best to take someone you know in a
professional capacity
• conduct interview about an information
need experienced by the victim, write
down expectations
• search in formal database and on web
• discuss results with the victim
• write essay, no longer than 7 pages.
about the course
• This course is new wine in an old bottle
• Officially a merger of
– lis566 information resources on the Internet
• mailing lists
• usenet news
• web searching
– lis618 database searching
• access and use of commercial databases
mix of theory and practice
• I am not a database search practitioner.
• Each database is different, practical skills
are not easily transferable.
• Thus my emphasis in the course is more
on theory.
• In the past, I theory first, then practice.
• This year I will try to mix. Some theory and
some practice in every session.
What databases?
• Dialog has been the traditional database
covered.
– They were the market leaders in online
databases in the past.
– Nowadays the field is much more open
• In addition I have done Nexis, FirstSearch
(OCLC) in the past.
• But I am open to suggestions.
About me
• Born 1965, in Völklingen (Germany)
• Studied economics and social sciences at
the Universities of Toulouse, Paris, Exeter
and Leiceister.
• PhD in theoretical macroeconomics
• Lecturer in Economics at the University of
Surrey 1993 and 2001
• Since 2001 assistant professor at the
Palmer School
Why?
• During research assistantship period,
(1990 to 1993) I was constantly frustrated
with difficult access to scientific literature.
• At the same time, I discovered easy
access to freely downloadable software
over the Internet.
• I decided to work towards downloadable
scientific documents. This lead to my
library career (eventually).
Steps taken I
• 1993 founded the NetEc project at
http://netec.mcc.ac.uk, later available at
http://netec.ier.hit-u.ac.jp as well as at
http://netec.wustl.edu.
• These are networking projects targeted to
the economics community. The bulk is
– Information about working papers
– Downloadable working papers
– Journal articles were added later
Steps taken II
• Set up RePEc, a digital library for
economics research. Catalogs
– Research documents
– Collections of research documents
– Researchers themselves
– Organizations that are important to the
research process
• Decentralized collection, model for the
open archives initiative
Steps taken III
• Co-founder of Open Archives Initiative
• Work on the Academic Metadata Format
• Co-founded rclis, a RePEc clone for
(Research in Computing, Library and
Information Science)
Interest in databases
• From my point of view I have two interests
in database searching
– As a provider, I must understand how people
search in order to provide some data that they
can use and will use.
– As an economist, I have a strong interest in
information as a commodity. The database
market is an important market place.
• Main emphasis of course is still on
databases.
Database searching (DS)
• subset of the subject of information
retrieval (IR)
• DS mainly thought as applicable to the set
of large structured databases as opposed
to do web searching
• for those, a general knowledge of what
databases are seems useful
• Concentrate on textual databases
traditional social model
• user goes to a library
• describes problem to the librarian
• librarian does the search
– without the user present
– with the user present
• hands over the result to the user
• user fetches full-text or asks a librarian to
fetch the full text.
economic rational for traditional
model
• In olden days the cost of
telecommunication was high.
• database use costs
– cost of communication
– cost of access time to the database
• the traditional model controls an upper
bound on costs
disintermediation
• with access cost time gone, the traditional
model is under threat
• there is disintermediation where the
librarian looses her role
• but that may not be good news for
information retrieval results
– user knows subject matter best
– librarian knows searching best
Web searching
• IR has received a lot of impetus through
the web, which poses unprecedented
search challenges.
• with more and more data appearing on the
web DS may be a subject in decline
– it is primarily concerned with non-web
databases
– There is more and more web-based methods
of searching
Public access vs quality
• Now the public at large is able to do online
searching.
• At the same time need for quality answers has
grown.
• Quality-filtered services will become more
important.
• In the current databases, there is as lot that
would already be available for free mixed with
quality-controlled stuff.
• Publishers have direct offerings and
intermediated vending is in decline.
Main theory part
• Literature: "Modern Information Retrieval"
by Ricardo Baeza-Yates and Berthier
Ribiero-Neto
• Don't buy it. It is a not a good book.
before the IR process
• provider
– define data that is available
• documents that can be used
• document operations
• document structure
– index
• user
– user need
– IR system familiarity
the IR process
• query expresses user need in a query
language
• processing of query yields retrieved
documents
• calculation of relevance ranking
• examination of retrieved documents
• possible relevance cycle
main problem
• user is not an expert at the formulation of
a query
• garbage in garbage out, the retrieval yields
poor result
• ways out
– design very intuitive interface for the query
– give expert guidance
taxonomy of classic IR models
• Boolean, or set-theoretic
– fuzzy set models
– extended Boolean
• vector, or algebraic
– generalized vector model
– latent semantic indexing
– neural network model
• probabilistic
– inference network
– belief network
summary
• There are three basic types of models in
classic information retrieval.
• Extensions of these types are a matter of
research concern and require good
mathematical skills.
• All classic models treat document as
individual pieces.
key aid: index
• an index is a list of terms, with a list of locations
where the term is to be found.
• The way to express locations usually depends
on the form that the indexed data takes.
– for a book, it is usually the page number, e.g.
"shmoo 34, 75"
– for computer files it is usually the name of the file plus
the number of the byte where the indexed term starts,
e.g. "krichel index.html 34, cv.html 890 1209"
• there is usually more than one location of the
term.
key aid: index terms
• index term is a part of the document that has a
meaning on its own.
• it is usually a noun word.
• retrieval based on index term raises questions
– semantics in query or document is lost
– matching done in imprecise space of index terms
• predicting relevance is a central problem
• the IR model determines the process of
relevance ranking
basic concept: weight of index term
• given all nouns, not all appear to have the same
relevance to the text
• sometimes, we can have a simple measure of
the importance of a term, example?
• more generally, for each indexing term and each
document we can associate a weight with the
term and the document.
• usually, if the document does not contain the
term, its weight is zero
Boolean model
• in the Boolean model, the index weight of
all index term for any document is 1 if the
term appears in the document. It is 0
otherwise.
• This allows to combine query terms with
Boolean operator AND, OR, and NOT
• thus powerful queries can be written
Classic implementation: dialog
http://training.dialog.com/sem_info/courses/
pdf_sem/dlg1.pdf
http://training.dialog.com/sem_info/courses/
pdf_sem/dlg2.pdf
http://training.dialog.com/sem_info/courses/
pdf_sem/dlg3.pdf
http://training.dialog.com/sem_info/courses/
pdf_sem/dlg4.pdf
Dialog is a databank
• over 500 databases
• these are also known as files and cover
– references and abstracts for published
literature,
– business information and financial data;
– complete text of articles and news stories;
– statistical tables
– Directories
• DIALOG uses the Boolean model
DIALOG interface
• is still rooted in "traditional" database
systems
• dismissed as "dial-a-dog"
• is uses a command-driven interface
• it is very complicated to learn fully
• it is not suitable for the end-user
• it therefore offers a valuable skill to the
information professional
• it is a challenge for a professor to teach
Accessing DIALOG
•
•
•
•
•
•
•
On the web, go to
http://www.dialogweb.com/
Enter username and password
Forget about subaccount
then click on logon
On the next screen go to command search
"continue" at the next screen
two steps in DIALOG
• step one: select databases (aka files) to
look at
• step two: perform searches on the
selected databases
• You may wonder why one does not have
one single step like in a search engine.
Discuss.
sample search
• We want to know something about "current
awareness in digital libraries"
• From dialogweb command search:
– databases
– social sciences and humanities
– library and information science
• leads you to
http://www.dialogweb.com/cgi/logoff?mode=
guided&url=/cgi/dwframe?href=search.html
This is database selection…
• At that screen you see a number of "files"
with their number.
• You can select those that you want to
search
• then you click "begin datasbase"
• and you get back to the command search
• "b numbers" it will say. That is the
command to begin working with files.
Boolean seach
• Do a number of searches
– s current(N)awarness
– s digital(N)library
– s digital(N)libraries
• Each search retrieves a set of documents
• The sets can be combined
– s s1 and (s2 or s3)
What is the deal?
• There are two stages.
• At stage two we make Boolean queries.
• Each query splits the the records into
matching and non-matching records.
• The set of matching records is return.
• It can be further searched or combined
with other sets using Boolean operators.
• Try this at home.
http://openlib.org/home/krichel
Thank you for your attention!