introductionx - COW :: Ceng

Download Report

Transcript introductionx - COW :: Ceng

CENG 776 Information Retrieval
Nihan Kesim Çiçekli
email: [email protected]
URL: http://www.ceng.metu.edu.tr/~nihan
/60
1
Course Description
This course is about
• Processing
• Indexing
• Retrieving
• … textual data
Fits in four lines, but much more complex and
interesting than that
2 /60
Course Resources
Course web page
• cow.ceng.metu.edu.tr
Text Book
• Introduction to Information Retrieval by Christopher D. Manning et
al., Cambridge University Press, 2008.
Reference Material
• Managing Gigabytes, by I. Witten, A. Moffat, and T. Bell.
• Information Retrieval: Algorithms and Heuristics by D. Grossman
and O. Frieder.
• Modern Information Retrieval, by R. Baeza-Yates and B. RibeiroNeto.
• Search Engines: Information Retrieval in Practice, B. Croft et al.
3 /60
Grading
•
•
•
•
Written assignments
Project
Presentation
Final Exam
20 %
30 %
20 %
30 %
4 /60
Course Content
•
•
•
•
•
•
•
•
•
•
Boolean Retrieval
Indexing and TF-IDF
Components of an IR system
Evaluation in information retrieval
Retrieval Models: Boolean, Vector Space Model,
Probabilistic Model
Web search basics
Web crawling, indexes and link analysis
XML-retrieval
Text Classification
Text clustering
5 /60
The task of Information Retrieval (IR)
• A large repository of documents is stored on
computers (corpus)
• There is a topic about which the user wants to
get some information (information need)
• Some of those documents may contain the
information that satisfies user’s need (relevance)
• How does the user retrieve those documents?
• The user communicates information need to the
computer by expressing it in the form of a query.
6 /60
Structured data
• How the query is expressed will depend on
whether the data in the document corpus is
structured or unstructured.
• Structured data tends to refer to information
in ˝tables˝ and has a clear semantic structure.
Employee
Manager
Salary
Smith
Jones
5000
Hopkins
Smith
6000
Black
Jones
5000
7 /60
Structured data
• Structured data allows for expressive queries
like:
List the social security numbers of all employees who have
worked more than 5 years and whose salaries are above
the average salary of all employees.
8 /60
Unstructured data
• Unstructured data does not have a clear
semantic structure (e.g. Free text on a web
page, audio, video).
• Allows less expressive queries of the form:
Give me all documents that have the keywords ‘culture of
cities’
Structured data
Unstructured data
Database systems
Information retrieval
9 /60
Information Need
• Example of an information need in the context
of the world wide web:
Find all documents (information!) about universities in
Turkey that (1) offer master degrees in Computer
Engineering and (2) are registered with ACM SIGIR. The
information (the document!) should include full
curriculum, student campus, e-mail and other con- tact
details.
• Formal representation of an information need
= Query
10 /60
Information Retrieval / Data Retrieval
Information Retrieval
Data Retrieval
Matching
vague
exact
Model
probabilistic
deterministic
Query language
natural
artificial
Query specification
incomplete
complete
Items wanted
relevant
all (matching)
Error handling
insensitive
sensitive
11 /60
12 /60
13 /60
Semi-structured data
• In reality, almost no data is truly unstructured
• Even this slide has identified zones such as
title and bullets.
• Headings, paragraphs and footnotes are
represented by explicit markup.
• Semi-structured data facilitates queries such
as:
List all documents in which Title contains data and Bullets
contain markup.
14 /60
Definition (IR)
• Information retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) from a corpus, that satisfies an
information need.
15 /60
What Types of Information?
•
•
•
•
•
•
•
Text (Documents)
XML and structured documents
Images
Audio (sound effects, songs, etc.)
Video
Source code
Applications/Web services
16 /60
The Information Retrieval Cycle
Source
Selection
Resource
Query
Formulation
Query
Search
Ranked List
Selection
Documents
query reformulation,
relevance feedback
result
17 /60
Information Retrieval
Representation, storage, organisation and access of information
(information items, information objects, documents).
Find relevant (useful) information
• Goal of an IR system – RECALL: Retrieve all relevant
documents (e.g. Law documents)
• Goal of an IR system – PRECISION: Retrieve the most relevant
documents (e.g. web).
• Goal of an IR system:
– Retrieve as few non-relevant documents as possible.
– Retrieve relevant documents before non-relevant documents.
18 /60
The IR Black Box
Query
Documents
Results
19 /60
Inside The IR Black Box
Query
Documents
Representation
Representation
Query Representation
Document Representation
Comparison
Function
Index
Results
20 /60
The Central Problem in IR
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
Why IR is hard? Because Language is hard!!!
/60
21
Relevance
• Relevance is a subjective judgment and may
include:
– Being on the proper subject.
– Being timely (recent information).
– Being authoritative (from a trusted source).
– Satisfying the goals of the user and his/her
intended use of the information (information
need).
22 /60
Keyword Search
• Simplest notion of relevance is that the query
string appears verbatim in the document.
• Slightly less strict notion is that the words in
the query appear frequently in the document,
in any order (bag of words).
23 /60
Problems with Keywords
• May not retrieve relevant documents that
include synonymous terms.
– “restaurant” vs. “café”
– “PRC” vs. “China”
• May retrieve irrelevant documents that
include ambiguous terms.
– “bat” (baseball vs. mammal)
– “Apple” (company vs. fruit)
– “bit” (unit of data vs. act of eating)
24 /60
Beyond Keywords
• We will cover the basics of keyword-based IR
• We will focus on extensions and recent
developments that go beyond keywords.
• We will cover the basics of building an
efficient IR system
• We will focus on basic capabilities and
algorithms rather than systems issues that
allow scaling to industrial size databases.
25 /60
Intelligent IR
• Taking into account the meaning of the words
used.
• Taking into account the order of words in the
query.
• Adapting to the user based on direct or
indirect feedback.
• Taking into account the authority of the
source.
26 /60
IR System Architecture
User Interface
Text
User
Need
Text Operations
Logical View
User
Feedback
Query
Ranked
Docs
Query
Operations
Searching
Ranking
Indexing
Index
Database
Manager
Inverted
file
Retrieved
Docs
Text
Database
27 /60
IR System Components
• Text Operations forms index words (tokens).
– Stopword removal
– Stemming
• Indexing constructs an inverted index of
word to document pointers.
• Searching retrieves documents that contain
a given query token from the inverted index.
• Ranking scores all retrieved documents
according to a relevance metric.
28 /60
IR System Components (continued)
• User Interface manages interaction with the
user:
– Query input and document output.
– Relevance feedback.
– Visualization of results.
• Query Operations transform the query to
improve retrieval:
– Query expansion using a thesaurus.
– Query transformation using relevance feedback.
29 /60
What to Evaluate?
– Coverage of information
– Form of presentation
– Effort required/ease of use
– Time and space efficiency
– Recall
• proportion of relevant material actually retrieved
– Precision
• proportion of retrieved material actually relevant
30 /60
Precision vs. Recall
All docs
Retrieved
Relevant
| RelRetriev ed |
Recall 
| Rel in Collection |
| RelRetriev ed |
Precision 
| Retrieved |
31 /60
Beyond Just Search
• Text Categorization
• Document/Term Clustering
• Text Summarization
32 /60
Web Search
• Application of IR to HTML documents on the
World Wide Web.
• Differences:
– Must assemble document corpus by spidering the
web.
– Can exploit the structural layout information in
HTML (XML).
– Documents change uncontrollably.
– Can exploit the link structure of the web.
33 /60
IR System
Document
corpus
Query
String
IR
System
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
34 /60
Web Search System
Web
Spider
Document
corpus
IR
System
Query
String
1. Page1
2. Page2
3. Page3
.
.
Ranked
Documents
35 /60
Other IR-Related Tasks
•
•
•
•
•
•
•
Automated document categorization
Information filtering (spam filtering)
Automated document clustering
Recommending information or products
Information extraction
Information integration
Question answering
36 /60
Information Retrieval vs Information Extraction
• Information Retrieval:
Given a set of terms and a set of document terms select
only the most relevant document (precision), and
preferably all the relevant ones (recall)
• Information Extraction:
Extract from the text what the document means
• IR can FIND documents but needs not “understand” them.
37 /60
History of IR
• 1960-70’s:
– Initial exploration of text retrieval systems for
“small” corpora of scientific abstracts, and law and
business documents.
– Development of the basic Boolean and vectorspace models of retrieval.
38 /60
IR History (continued)
• 1980’s:
– Large document database systems, many run by
companies:
• Lexis-Nexis
• Dialog
• MEDLINE
39 /60
IR History (continued)
• 1990’s:
– Searching FTPable documents on the Internet
• Archie
• WAIS
– Searching the World Wide Web
• Lycos
• Yahoo
• Altavista
40 /60
IR History (continued)
• 1990’s continued:
– Organized Competitions
• NIST TREC
– Recommender Systems
• Ringo
• Amazon
• NetPerceptions
– Automated Text Categorization & Clustering
41 /60
Recent IR History
• 2000’s
– Link analysis for Web Search
• Google
– Automated Information Extraction
• Whizbang
• Fetch
• Burning Glass
– Question Answering
• TREC Q/A track
42 /60
Recent IR History
• 2000’s continued:
– Multimedia IR
• Image
• Video
• Audio and music
– Cross-Language IR
• DARPA Tides
– Document Summarization
43 /60
Related Areas
•
•
•
•
•
Database Management
Library and Information Science
Artificial Intelligence
Natural Language Processing
Machine Learning
44 /60
Database Management
• Focused on structured data stored in relational tables
rather than free-form text.
• Focused on efficient processing of well-defined
queries in a formal language (SQL).
• Clearer semantics for both data and queries.
• Recent move towards semi-structured data (XML)
brings it closer to IR.
45 /60
Library and Information Science
• Focused on the human user aspects of information
retrieval (human-computer interaction, user
interface, visualization).
• Concerned with effective categorization of human
knowledge.
• Concerned with citation analysis and bibliometrics
(structure of information).
• Recent work on digital libraries brings it closer to CS
& IR.
46 /60
Artificial Intelligence
• Focused on the representation of knowledge,
reasoning, and intelligent action.
• Formalisms for representing knowledge and
queries:
– First-order Predicate Logic
– Bayesian Networks
• Recent work on web ontologies and intelligent
information agents brings it closer to IR.
47 /60
Natural Language Processing
• Focused on the syntactic, semantic, and pragmatic
analysis of natural language text and discourse.
• Ability to analyze syntax (phrase structure) and
semantics could allow retrieval based on meaning
rather than keywords.
48 /60
Natural Language Processing:
IR Directions
• Methods for determining the sense of an ambiguous
word based on context (word sense disambiguation).
• Methods for identifying specific pieces of
information in a document (information extraction).
• Methods for answering specific NL questions from
document corpora.
49 /60
Machine Learning
• Focused on the development of computational
systems that improve their performance with
experience.
• Automated classification of examples based on
learning concepts from labeled training examples
(supervised learning).
• Automated methods for clustering unlabeled
examples into meaningful groups (unsupervised
learning).
50 /60
Machine Learning:
IR Directions
• Text Categorization
– Automatic hierarchical classification (Yahoo).
– Adaptive filtering/routing/recommending.
– Automated spam filtering.
• Text Clustering
– Clustering of IR query results.
– Automatic formation of hierarchies (Yahoo).
• Learning for Information Extraction
• Text Mining
51 /60
IR systems on the Web
• Search for Web pages http://www.google.com
• Search for images http://www.picsearch.com
• Search for answers to questions
http://www.askjeeves.com
• Music retrieval
http://www.rotorbrain.com/foote/musicr/
52 /60
53 /60
54 /60
55 /60
56 /60
57 /60
58 /60
Open-source IR toolkits
• Smart (Cornell)
• MG (RMIT & Melbourne, Australia; Waikato,
New Zealand),
• Lemur (CMU/Univ. of Massachusetts)
• Lucene (Nutch)
59 /60
Information Retrieval Models
•
•
•
•
The Boolean Model
The Vector Space Model
Probabilistic models
Language models
60 /60