Retrieval - Faculty of Computer Science and Information Technology
Download
Report
Transcript Retrieval - Faculty of Computer Science and Information Technology
WMES3103
INFORMATION RETRIEVAL
WEEK 1 AND 2
WHAT IS INFORMATION RETRIEVAL?
Information Retrieval – IR
Information
Retrieval
Lancaster (1968) :
An information retrieval system does not
inform (I.e change the knowledge) of the user
on the subject of his inquiry. It merely inform on
the existence (or non-existence ) and
whereabouts of documents relating to his
request
IR – process of getting/retrieving information
Now : a lot of information – print and electronic
Requirement : obtain information quickly and
accurately
IR – aims to provide fast , effective and efficient
methods of representing, managing , searching,
retrieving and presenting such information
IR = the representation , storage, organization
of and access to information items
Computer science perspective
Design
and build a large scale system that will
store, manipulate, retrieve and display
electronic information of any kind
Text, audio, image and graphics that are
stored in such a way that they are available
for interaction with human or machine
Library and information perspectives
Search
features – au, ti, su, keywords
Relevance of retrieve items
Examples of IRS
Examples of IRS
3 challenges for IR researchers and practitioners
Technical challenge : what tools should IR systems
provide to allow effective and efficient manipulation of
information within such diverse media as text, image,
video and audio?
Interaction challenge : what features should IR systems
provide in order to support a wide variety of users in their
search for relevant information.
Evaluation challenge : how can we evaluate which tools
and features are effective and usable, given the
increasing diversity of end-users and information seeking
situations?
3 basic areas of research
Content analysis – describing the contents
of the documents in a form suitable for
computer processing
Information structures – exploiting
relationships between documents to
improve the efficiency and effectiveness of
retrieval strategies
Evaluation – measurement of effectiveness
of retrieval
Information Retrieval System
Information Retrieval System = IRS
Before :index document and retrieve
Eg. OPAC of library – cataloguing
Now: modelling, document classification
and categorization, system architecture,
user interface, data visualization, filtering
languages
Eg. WWW
Basic Information Retrieval Process
Question OR Full description of
user information needs
Translate into query OR keywords
which summarizes the description
of user information needs
Query processed by a search engine
or IRS
IRS retrieves information which is
useful/relevant to the user
Basic Concepts in Information
Retrieval
User Task
Logical View of documents
User Task
A user has to translate his information
needs into query in the language provided
by the system
Specify a set of words
English Language Statement :
I want a book by J. K Rowling titled The
Chamber of Secrets
Query entered in a computer system
Au
= Rowling
Ti = Chamber of Secrets
“Chamber of Secret”
Rowling AND Stone
Au rowling ti chamber of secrets ti stone
2 User Task
2 user task – browsing and retrieval
Browsing – the process of retrieving info.
Whereby the main objective is not clearly
defined from the beginning and whose purpose
might change during the interaction with the
system.
Eg. User search the internet for info about
marine organism look for info. About
Australian aborigines user is said to be
browsing in the collection and not searching
Eg. Searching for a book in the library shelves
Retrieval – process of retrieving info
whereby the main obj. is clearly defined
from the onset of searching process – eg.
Eg. Searching for a book in the library
shelves
2 actions when user interacts with an IRS
2 actions can be identified when a user interacts
with an IRSYS – pulling and pushing actions.
Pulling action user request for info in
interactive way eg browsing and retrieval
Pushing action push info towards the user
periodically through the use of a specified or
specially designed s/ware also known as
filtering
eg. Yahoo Msgr Service alert user each time
new message arrive
Online Stock Exchange
Interaction of the user with IRSYS
through distinct task
IR
DB
Browsing
USER
Logical View of Documents
Documents in a collection are represented
by a set on index terms or keywords
Keywords
Abstract
Full
text
Logical View of Documents
•Documents in a collection are represented by a set of index
term/keywords
Documents
Indexing Process
Extracted from text
of document
Assigned by
humans
Keywords/subject headings = Logical view of document
LISANET – search by abstract
MJLIS - EJournal
If full text :
Each
word in the text is a keyword
Most complex form
Expensive
If full text is too large, there are mechanisms
built into the IRS to reduce the number of
keyword :
Logical view of documents - continue
1.
2.
3.
4.
5.
Stop words (eg articles and connectives – a,
the , an, and, of, etc)
Stemming (reduce distinct words to their
common grammatical root) eg diary** will find
diary or diaries
Truncation – eg catalog* will retrieve catalog,
catalogs, catalogue, catalogues
Noun words (eliminates adjectives, adverbs,
verbs) eg run will represent runs, running
compression
Conversion Process
Logical view of documents - continue
This conversion process is known as
text operation or transformation
It reduce the complexity of the
document representation and allow the
logical view from that of a full text to a
set of index terms
On the other hand, the human assigned
keywords provides the most concise
logical view of a document but might
lead to retrieval of poor quality –
different interpretations, limited
keywords if using thesaurus
2 modes of retrieval
Ad-Hoc – the documents in the IRS
remains static but new queries are
submitted to the system – eg. CD-ROM
Database
Filtering – the queries remain relatively
static but new documents come into the
IRS eg. Stock market
Filtering
Construct a user profile that reflects the
user’s preferences and profile is matched
against incoming documents to find a match
or a hit
Retrieve only documents of interest to the
user and as specified in the user profile
User select relevant documents from the list.
Filtered documents can also be ranked to
further assist the user as to relevance
Construction of a user profile - user provide
necessary keywords or collect info about
preferences from the user and use this to
construct a user profile dynamically
INFORMATION RETRIEVAL PROCESS
A.
DEFINE TEXT DATABASE
The text database has to be defined before the
retrieval process begins
Done by database manager – documents to be used,
operations to be performed on the text, text model
Original documents is transformed into a logical view
of the documents via the various text operations
The database manager will then build up the index of
the text – manually / computer generated
The retrieval system is tested
B. RETRIEVAL PROCESS
The IRS can be used once the document
database has been indexed
User puts or present his question/ user need to
the IRS
Question is change to a logical view of the
document via the text operation
The query operation will present this to the
system in a form understandable by the system
Query is processed to obtain the retrieved
documents.
Continue…
The retrieved document are ranked according to
relevance
Retrieved document are sent to the user
User looks through at the ranked documents and
can modify question/user need/ query via the
user feedback cycle
Same process repeated
DEVELOPMENT
For the past 4000 years , man has always been
organizing information for retrieval and usage.
It started out with a table of contents for a book.
Then, the amount of information extended over a
number of books
A specialized data structure is needed to ensure
faster access to the stored info.
The oldest and the most popular data form of data
structure for fast IR is a collections of words or
concept with which are associated pointers to the
related info = INDEX
Previously – Manual
Development…continue
Now, with the advent of computers, large
indexes can be generated automatically. This
automatic indexes provide the logical view of
the document as perceived by the system
and not the user
2 different views of the IR problems:
Computer-centered
building efficient
indexes , processing user queries with high
performance, develop ranking algorithm which
will improve the quality of the answer set
Human-Centered studying the behavior of
the user , understand his main needs, and of
determining how such understanding affects
the organization and the operation the the
IRSYS.
IR in the Library
Libraries are the first users of IRSYS to
retrieve information
Usually develop by academic institution and
later by commercial vendors
1st generation – automation of the card
catalog and allowed searches based on
author and title
2nd generation – increased search
functionality - searching by subject headings,
keywords, complex queries -OPAC
3rd generation – graphical interfaces,
electronic forms, hypertext features, open
system architecture – Digital Libraries
The Web and Digital Libraries
Search engine on the web are still using
indexes which are similar to the ones used by
libraries years ago.
So, what has change?
Advances in computer technology has led to:
Cheaper access to various sources of
information
Greater access to network due to
advances in all kind of digital
communication
Freedom to post information on the web
Problems
People still find it difficult to retrieve
info relevant to their information needs
from the web
Issues to address:
Dynamic
world on the web
Demand for access and quick response
Quality of retrieval task is affected by
user interaction with the system
THANK YOU