Information Retrieval

Download Report

Transcript Information Retrieval

Information Retrieval
Unit 1
Seema Chandak
Unit 1 : Objective & Content
 Objective
To deal with IR representation, storage,
organization & access to information items.
Unit 1 : Content(contu..)
• Content ::
 Basic Concepts of IR,








Data Retrieval & Information Retrieval,
IR system block diagram.
Automatic Text Analysis,
Luhn's ideas,
Conflation Algorithm,
Indexing and Index Term Weighing,
Probabilistic Indexing,
Unit 1 : Content(contu…)










Automatic Classification.
Measures of Association,
Different Matching Coefficient,
Classification Methods,
Cluster Hypothesis.
Clustering Algorithms,
Single Pass Algorithm,
Single Link Algorithm,
Rochhio's Algorith
Dendogram
What is IR
Information retrieval:
Subfield of computer science that deals with automated
retrieval of infromaition (especially text) based on their
content and context.
The term Information Retrieval was first coined by Calvin
Moores (1950).
“ It is concerned with the representation, storage, and
organization and accessing of information items .“
Need for IR
• Information is considered as the most important
source, for most of the activities.
• Example : Timely Weather reports.
•
Timely sharing of information.
• The timely retrieval of information plays a major role,
keeping with the motto “right information at the right
time”.
Types of IR
– Structured (All Database management systems)
– Unstructured (Search engines)
– Semi structured(Datawarehouses)
IR Based on Structured Data
• Recollect Terms related to DBMS ..
– Data Organization in the form of schema, keys,
index, metadata….
– Query structure
– Results set
– …..
– ….
Why IR ?Why not Database?
What are some limitations of Database Systems?
IR Vs. DR
 Information Retrieval System: a system that allows a
user to retrieve documents that match her
“information need” from a large corpus.
 Example: Get documents about Java, except for ones
that are about the Java coffee.
 Data Retrieval System: a system that allows a user to
retrieve all documents that match her query from a
large corpus.
 Example: Get all documents containing the term
“Java” but no containing the term “coffee”.
IR Vs. DR
1. Matching.
– In data retrieval we are normally looking for an exact
match, that is, we are checking to see whether an
item is or is not present in the file.
– Eg.Select * from Student where per >= 75.0
– In information retrieval more generally we want to
find those items which partially match the request
and then select from those a few of the best
matching ones.
– Eg. Student having 75 or >75 percentage from
student of pict college.
2. Inference
IR Vs. DR
– In data retrieval is of the simple deductive kind, that is, aRb
and bRc then aRc.
– In information retrieval it is of inductive inference;
– Relations are only specified with a degree of certainty or
uncertainty and hence our confidence in the inference is
variable.
3. Model
– Data retrieval is deterministic but information retrieval is
probabilistic.
– Frequently Bayes' Theorem is invoked to carry out inferences
in IR, but in DR probabilities do not enter
– into the processing.
IR Vs. DR
4 .Classification:
– In DR most likely monothetic classification is used.
– That is, one with classes defined by objects
– possessing attributes both necessary and sufficient to
belong to a class.
– In IR such a classification is not very useful.
– A polythetic classification is mostly used.
– Each individual in a class will possess only a proportion
of all the attributes possessed by all the members of
that class.
– Hence no attribute is necessary nor sufficient for
membership to a class.
IR Vs. DR
5. Query Language:
– The query language for DR is one with restricted
syntax and vocabulary.
– In IR we prefer to use natural language although there
are some notable exceptions.
6. Query Specification :
– In DR the query is generally a complete specification
of what is wanted,
– In IR it is invariably incomplete.
IR Vs. DR
7. Items wanted :
– In IR we are searching for relevant documents as
opposed to exactly matching items in DR.
8. Error response :
– DR is more sensitive to error in the sense that, an
error in matching will not retrieve the wanted item
which implies a total failure of the system.
– In IR small errors in matching generally do not affect
performance of the system significantly
IR Vs. DR
Data Retrieval (DR)
Matching
Exact match
Inference
Deduction
Information Retrieval
(IR)
Partial match, best
match
Induction
Model
Deterministic
Probabilistic
Classification Monothetic
Polythetic
Data
Database tables,
structured
Free text,
unstructured
Query
language
Query
specification
Items wanted
Artificial, SQL,
relational algebras.
Complete
Natural, Keywords,
free text
Incomplete
Matching
Relevant
IR vs.DR
Information Retrieval
Data Retrieval
Error
Response
Insensitive
Sensitive
Results
Approximate
matches
Exact matches
Results
Ordered by
relevance
Unordered
Accessibility
Non-expert humans
Knowledgeable
users or automatic
processes
Issues with Information Retrieval?
Information Retrieval deals with uncertainty and
vagueness in information systems.
• Uncertainty: available representation does typically not
reflect true semantics/meaning of objects (text, images,
video, etc.)
• Vagueness: information need of user lacks clarity, is only
vaguel expressed in query, feedback or user actions.
• Differs conceptually from database queries!
Re Call the Definition
• What Is IR ?
• “ Finding some desired information in large data sets or store
of information “
• Means :
– Searching for documents
– Searching for information in documents
– Searching for metadata which describes documents
– Searching within database
–
• Web search engines like Google and Lycos are the most visible
IR applications.
• IR systems are used to reduce information overload.
Definition
 Automatic Information Retrieval
 Automatic – as against ‘manual’.
 Information – as against ‘data’.
 Defn : An information retrieval system does not inform
(i.e.change the knowledge of) the user on the subject of
his inquiry.
 It merely informs on the existence (or non-existence)
and whereabouts of documents relating to his request.
Media – Where Does Information Reside?
• Text documents: web pages, books, articles , papers,
emails etc.
• Manuscripts
• Graphics & Images
• Speech & Video
• Maps & Satellite Imagery
• Local Information, Yellow Pages
• Mismatch: given representation in specific medium vs.
semantic description of information (semantic gap)
Scale - How Much Information is out there?
• World Wide Web
Tens or hundreds billions of documents?
Approx. 10KB/doc of 100s of TB
• Then there is everything else
Email, personal files, proprietary databases,
•
•
•
broadcast media, print
Estimated 5 Exabytes p.a. (growing at 30%)
800 MB p.a. and person
Web is just a tiny starting point….
IR problem
 It is mainly dealing with a very large , mostly
unstructured data set
 IR problem consists of :
 building efficient indexes.
 processing user queries with high performance.
 improve ‘quality’ of answer set.
Basic Concepts
• Information retrieval is directly affected by
the :
– User Tasks
– Document Logical view
User Tasks
• Interaction of the user with retrieval system.
Retrieval
Documents
Browsing
User Tasks
• Classical information retrieval system allows IR
• Hypertext system are usually tuned for quick
Browsing.
• Modern digital lib. and Web interfacing might
attempt to combine these tasks.
Logical view of the document
• Documents are represented either by Keywords or
Indexes is known as logical view of the documents.
• Keywords are either extracted directly from the text of
document or specified by human.
• Modern computers represents doc by its set of :
– Full words.
– Small words.
• Stopwords : elimination of articles and
connectives.
• steaming : (reduces distinct words to their
common grammatical roots.)
Introduction…
•
Information Retrieval System:
Feedback
Sample retrieval
Queries
Processor
Input
Output
Documents
A typical IR system
28
Introduction…
• Information Retrieval System:
– Input: Store only a representation of the document (or query)
which means that the text of a document is lost once it has been
processed for the purpose of generating its representation.
– A document representative could be a list of extracted words
considered to be significant.
– The user has to use the language in which he/she can express the
needed information in the language.
– Processor: Involve in performing actual retrieval function,
executing the search strategy in response to a query.
– Feedback: Improving the subsequent run after a sample retrieval.
– Output:A set of document numbers. And the evaluation can be
done.
29
Information Retrieval Process
Information
need
text input
Parse
Introduction
How is
the query
constructed?
Pre-process
Query
Index
Rank
Collections
How is
the text
processed?
Definitions
• Searching: Seeking for specific information within
a body of information. The result of a search is a
set of hits.
• Browsing: Unstructured exploration of a body of
information.
• Linking: Moving from one item to another
following links, such as citations, references, etc.
The Basics of Information Retrieval
Query: A string of text, describing the information that
the user is seeking. Each word of the query is called a
search term.
A query can be a single search term, a string of terms, a
phrase in natural language, or a stylized expression using
special symbols.
Full text searching: Methods that compare the query
with every word in the text, without distinguishing the
function of the various words.
Fielded searching: Methods that search on specific
bibliographic or structural fields, such as author or
heading.
SORTING AND RANKING HITS
When a user submits a query to a search system, the
system returns a set of hits. With a large collection of
documents, the set of hits maybe very large.
The value to the use depends on the order in which the
hits are presented.
Three main methods:
• Sorting the hits, e.g., by date
• Ranking the hits by similarity between query and
document
• Ranking the hits by the importance of the documents
Examples of Search Systems
Find file on a computer system (Spotlight for
Macintosh).
Library catalog for searching bibliographic records
about books and other objects (Library of Congress
catalog).
Abstracting and indexing system for finding
research information about specific topics (Medline
for medical information).
Web search service for finding web pages (Google).