Retrieval - Faculty of Computer Science and Information Technology

Download Report

Transcript Retrieval - Faculty of Computer Science and Information Technology

WMES3103
INFORMATION RETRIEVAL
WEEK 1 AND 2
WHAT IS INFORMATION RETRIEVAL?




Information Retrieval – IR
Information
Retrieval
Lancaster (1968) :
An information retrieval system does not
inform (I.e change the knowledge) of the user
on the subject of his inquiry. It merely inform on
the existence (or non-existence ) and
whereabouts of documents relating to his
request





IR – process of getting/retrieving information
Now : a lot of information – print and electronic
Requirement : obtain information quickly and
accurately
IR – aims to provide fast , effective and efficient
methods of representing, managing , searching,
retrieving and presenting such information
IR = the representation , storage, organization
of and access to information items

Computer science perspective
 Design
and build a large scale system that will
store, manipulate, retrieve and display
electronic information of any kind
 Text, audio, image and graphics that are
stored in such a way that they are available
for interaction with human or machine

Library and information perspectives
 Search
features – au, ti, su, keywords
 Relevance of retrieve items
Examples of IRS
Examples of IRS
3 challenges for IR researchers and practitioners



Technical challenge : what tools should IR systems
provide to allow effective and efficient manipulation of
information within such diverse media as text, image,
video and audio?
Interaction challenge : what features should IR systems
provide in order to support a wide variety of users in their
search for relevant information.
Evaluation challenge : how can we evaluate which tools
and features are effective and usable, given the
increasing diversity of end-users and information seeking
situations?
3 basic areas of research
Content analysis – describing the contents
of the documents in a form suitable for
computer processing
 Information structures – exploiting
relationships between documents to
improve the efficiency and effectiveness of
retrieval strategies
 Evaluation – measurement of effectiveness
of retrieval

Information Retrieval System
Information Retrieval System = IRS
 Before :index document and retrieve
 Eg. OPAC of library – cataloguing
 Now: modelling, document classification
and categorization, system architecture,
user interface, data visualization, filtering
languages
 Eg. WWW

Basic Information Retrieval Process
Question OR Full description of
user information needs
Translate into query OR keywords
which summarizes the description
of user information needs
Query processed by a search engine
or IRS
IRS retrieves information which is
useful/relevant to the user
Basic Concepts in Information
Retrieval
User Task
 Logical View of documents

User Task
A user has to translate his information
needs into query in the language provided
by the system
 Specify a set of words
 English Language Statement :
I want a book by J. K Rowling titled The
Chamber of Secrets


Query entered in a computer system
 Au
= Rowling
 Ti = Chamber of Secrets
 “Chamber of Secret”
 Rowling AND Stone
 Au rowling ti chamber of secrets ti stone
2 User Task
2 user task – browsing and retrieval
Browsing – the process of retrieving info.
Whereby the main objective is not clearly
defined from the beginning and whose purpose
might change during the interaction with the
system.
 Eg. User search the internet for info about
marine organism  look for info. About
Australian aborigines  user is said to be
browsing in the collection and not searching
 Eg. Searching for a book in the library shelves



Retrieval – process of retrieving info
whereby the main obj. is clearly defined
from the onset of searching process – eg.
Eg. Searching for a book in the library
shelves
2 actions when user interacts with an IRS





2 actions can be identified when a user interacts
with an IRSYS – pulling and pushing actions.
Pulling action  user request for info in
interactive way eg browsing and retrieval
Pushing action  push info towards the user
periodically through the use of a specified or
specially designed s/ware  also known as
filtering
eg. Yahoo Msgr Service  alert user each time
new message arrive
Online Stock Exchange
Interaction of the user with IRSYS
through distinct task
IR
DB
Browsing
USER
Logical View of Documents

Documents in a collection are represented
by a set on index terms or keywords
 Keywords
 Abstract
 Full
text
Logical View of Documents
•Documents in a collection are represented by a set of index
term/keywords
Documents
Indexing Process
Extracted from text
of document
Assigned by
humans
Keywords/subject headings = Logical view of document
LISANET – search by abstract
MJLIS - EJournal

If full text :
 Each
word in the text is a keyword
 Most complex form
 Expensive
 If full text is too large, there are mechanisms
built into the IRS to reduce the number of
keyword :
Logical view of documents - continue
1.
2.
3.
4.
5.
Stop words (eg articles and connectives – a,
the , an, and, of, etc)
Stemming (reduce distinct words to their
common grammatical root) eg diary** will find
diary or diaries
Truncation – eg catalog* will retrieve catalog,
catalogs, catalogue, catalogues
Noun words (eliminates adjectives, adverbs,
verbs) eg run will represent runs, running
compression
Conversion Process
Logical view of documents - continue
This conversion process is known as
text operation or transformation
 It reduce the complexity of the
document representation and allow the
logical view from that of a full text to a
set of index terms
 On the other hand, the human assigned
keywords provides the most concise
logical view of a document but might
lead to retrieval of poor quality –
different interpretations, limited
keywords if using thesaurus

2 modes of retrieval
Ad-Hoc – the documents in the IRS
remains static but new queries are
submitted to the system – eg. CD-ROM
Database
 Filtering – the queries remain relatively
static but new documents come into the
IRS eg. Stock market

Filtering





Construct a user profile that reflects the
user’s preferences and profile is matched
against incoming documents to find a match
or a hit
Retrieve only documents of interest to the
user and as specified in the user profile
User select relevant documents from the list.
Filtered documents can also be ranked to
further assist the user as to relevance
Construction of a user profile - user provide
necessary keywords or collect info about
preferences from the user and use this to
construct a user profile dynamically
INFORMATION RETRIEVAL PROCESS
A.





DEFINE TEXT DATABASE
The text database has to be defined before the
retrieval process begins
Done by database manager – documents to be used,
operations to be performed on the text, text model
Original documents is transformed into a logical view
of the documents via the various text operations
The database manager will then build up the index of
the text – manually / computer generated
The retrieval system is tested
B. RETRIEVAL PROCESS
 The IRS can be used once the document
database has been indexed
 User puts or present his question/ user need to
the IRS
 Question is change to a logical view of the
document via the text operation
 The query operation will present this to the
system in a form understandable by the system
 Query is processed to obtain the retrieved
documents.





Continue…
The retrieved document are ranked according to
relevance
Retrieved document are sent to the user
User looks through at the ranked documents and
can modify question/user need/ query via the
user feedback cycle
Same process repeated
DEVELOPMENT





For the past 4000 years , man has always been
organizing information for retrieval and usage.
It started out with a table of contents for a book.
Then, the amount of information extended over a
number of books
A specialized data structure is needed to ensure
faster access to the stored info.
The oldest and the most popular data form of data
structure for fast IR is a collections of words or
concept with which are associated pointers to the
related info = INDEX
Previously – Manual
Development…continue
Now, with the advent of computers, large
indexes can be generated automatically. This
automatic indexes provide the logical view of
the document as perceived by the system
and not the user
 2 different views of the IR problems:

 Computer-centered
 building efficient
indexes , processing user queries with high
performance, develop ranking algorithm which
will improve the quality of the answer set
 Human-Centered  studying the behavior of
the user , understand his main needs, and of
determining how such understanding affects
the organization and the operation the the
IRSYS.
IR in the Library





Libraries are the first users of IRSYS to
retrieve information
Usually develop by academic institution and
later by commercial vendors
1st generation – automation of the card
catalog and allowed searches based on
author and title
2nd generation – increased search
functionality - searching by subject headings,
keywords, complex queries -OPAC
3rd generation – graphical interfaces,
electronic forms, hypertext features, open
system architecture – Digital Libraries
The Web and Digital Libraries
Search engine on the web are still using
indexes which are similar to the ones used by
libraries years ago.
 So, what has change?
 Advances in computer technology has led to:
 Cheaper access to various sources of
information
 Greater access to network due to
advances in all kind of digital
communication
 Freedom to post information on the web

Problems
People still find it difficult to retrieve
info relevant to their information needs
from the web
 Issues to address:

 Dynamic
world on the web
 Demand for access and quick response
 Quality of retrieval task is affected by
user interaction with the system
THANK YOU