A Risk Minimization Framework for Information Retrieval
Download
Report
Transcript A Risk Minimization Framework for Information Retrieval
Overview of IR Research
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
What is Information Retrieval (IR)?
• Salton’s definition (Salton 68): “information retrieval is
a field concerned with the structure, analysis, organization, storage,
searching, and retrieval of information”
– Information: mostly text, but can be anything (e.g.,
multimedia)
– Retrieval:
• Narrow sense: search/querying
• Broad sense: filtering, classification, summarization, ...
• In more general terms
– Information access
– Information seeking
– Help people manage and make use of all kinds of information
Who are working on IR?
(IR and Related Areas)
Applications
Models
Machine Learning
Pattern Recognition
Data Mining
Statistics
Optimization
Natural
Language
Processing
Algorithms
Applications
Web, Bioinformatics…
Information
Retrieval
Library & Info
Science
Databases
Software engineering
Computer systems
Systems
IR and NLP
•
•
•
The two fields were closely related from day one, but
somewhat disconnected later when NLP focused more
on cognitive and symbolic approaches, while IR
focused more on pure statistical approaches
Most recently the two fields regained close
interactions
– More complex retrieval tasks (question answering,
opinons)
– More scalable/robust NLP techniques (parsing,
extraction)
IR researchers pioneered statistical approaches to
NLP in 1950’s (e.g., H. P. Luhn), which only became
popular in 1990’s among NLP researchers
IR and Databases
• “Sibling” fields, but they didn’t get along with
each other well
• IR and DB share many common tasks, but the
differences in the form of data and nature of
queries are large enough to separate the two
fields in most of the history
• Major differences in data, user, query, what
counts as answers: DB efficiency; IR
effectiveness
• The two fields are now getting closer and
closer now (DB researchers realized the importance of 80%
unstructured data, and IR researchers realized the importance of
semantic search)
IR and Machine Learning
•
IR as a subfield of AI (IR=intelligent text access)?
– AI is too big to have a coherent community (e.g.,
ML, NLP, Computer Vision all “spin off”)
•
IR researchers did machine learning as early as in
1960’s (Rocchio 1965, relevance feedback), but
supervised learning didn’t get popular in IR until in
early 1990’s when text categorization started getting a
lot of attention
– Lack of training data for search (no large-scale online system,
users don’t like to make effort on judgments)
– Learning-based approach didn’t prevail for ad hoc retrieval
• Machine learning is now very important for IR
IR and Library & Information Science
• Inseparable from day one (“Information
Science” vs. “Computer Science”)
• Early IR work was mostly done in the context
of library and information science (LIS)
• I-School initiative/movement: drop “library”
and enlarge the scope to “informatics”,
leading to merger of CS + LIS
• Another example where the boundary between
fields is disappearing (setting boundaries is
generally harmful for research, but is
sometimes needed in practice)
IR and Software Engineering
• Scalability of IR wasn’t a major concern until
the Web
– Data collection was relatively small and didn’t grow
quickly until the Web
– The most effective retrieval models remain simple
models based on bag-of-words representation
• However, scalability has always been a core
issue in IR, and how to engineer an IR system
optimally is extremely important for IR
applications
• Nowadays, data-intensive computing is
essential for large-scale IR applications
IR and Applications
• Early days: library search, literature
• 1970s: small-scale online search systems
• 1990s: large-scale systems
– TREC (mostly news data, later other kinds of data)
– Web search engines
• 2010s: search is everywhere!
• More and more applications in the future
Publications/Societies (broad view)
Learning/Mining
ICML
Applications
ISMB
ICML, NIPS, UAI
ACM SIGKDD
Statistics
ICDM, SDM
AAAI
NLP ACL
HLT
COLING, EMNLP, NAACL
WWW
RECOMB, PSB
WSDM
Info Retrieval
ACM SIGIR
JCDL
ECIR, CIKM, TREC
TOIS, IRJ, IPM
OSDI
Software/systems
Info. Science
JASIS
Databases
ACM SIGMOD,VLDB
ICDE, EDBT,TODS
Major IR Publication Venues
<1960
1970
1990
1980
2000
ACM SIGIR1978
CIKM 1994
ECIR 1978
WWW 1994
WSDM 2008
TREC 1992
IMP(ISR)
ACM TOIS 1983
1965
JASIST
1950
JDoc
1945
IRJ 1998
2010
IR Research Topics (Broad View)
Users
Retrieval
Applications
Visualization
Summarization
Filtering
Information
Access
Analytics
Applications
Mining
Information
Organization
Search
Categorization
Extraction
Clustering
Natural Language Content Analysis
Text
Text Acquisition
Text
Mining
IR Topics (narrow view)
docs
4. Efficiency & scalability
INDEXING
Query
Doc
3. Document
Rep
Rep
representation/structure
SEARCHING
Ranking Models
2. Retrieval (Ranking)
Feedback
7. Feedback/Learning
query
6. User interface
(browsing) User
1. Evaluation
results
5. Search
result
INTERFACE
summarization/presentation
judgments
QUERY MODIFICATION
LEARNING
“core” topics: 1-4, 7, especially 1, 2, 7
Major Research Milestones
•
Early days (late 1950s to 1960s): foundation and founding of the field
– Luhn’s work on automatic encoding
Indexing: auto vs. manual
– Cleverdon’s Cranfield evaluation methodology and index experiments
– Salton’s early work on SMART system and experiments
•
1970s-1980s: a large number of retrieval models
– Vector space model
– Probabilistic models
•
Evaluation
System
Indexing + Search
Theory
1990s: further development of retrieval models and new tasks
– Language models
Large-scale evaluation, beyond ad hoc retrieval
– TREC evaluation
•
2000s-present: more applications, especially Web search and interactions
with other fields
– Web search
– Learning to rank
– Scalability (e.g., MapReduce)
Web search
Machine learning
Scalability
Frontier Topics in IR: Overview
•
Two types of topics
– 30%: Fundamental challenges: IR models, evaluation, efficiency,
user models/studies
– 70%: Application-driven challenges: Web (1.0, 2.0, 3.0?),
Enterprise (text analytics), Scientific Research (bioinformatics,
…)
•
Methodology
– 50%: Machine learning (feature set + supervised)
– 30%: Language models (unigram + unsupervised)
– 20%: Others (user studies, empirical experiments)
•
Trends
– More interdisciplinary and internationalized
– More diversification of topics (new applications, new methods)
– Hard fundamental problems regularly revisited
15
Topics in SIGIR 2011/2012 CFP
•Document Representation and Content Analysis (e.g., text representation,
document structure, linguistic analysis, non-English IR, cross-lingual IR, information
extraction, sentiment analysis, clustering, classification, topic models, facets)
•Queries and Query Analysis (e.g., query representation, query intent, query log analysis,
question answering, query suggestion, query reformulation)
•Users and Interactive IR (e.g., user models, user studies, user feedback, search interface,
summarization, task models, personalized search)
•Retrieval Models and Ranking (e.g., IR theory, language models, probabilistic retrieval
models, feature-based models, learning to rank, combining searches, diversity)
•Search Engine Architectures and Scalability ( e.g., indexing, compression,
MapReduce, distributed IR, P2P IR, mobile devices)
•Filtering and Recommending (e.g., content-based filtering, collaborative filtering,
recommender systems, profiles)
•Evaluation (e.g., test collections, effectiveness measures, experimental design)
•Web IR and Social Media Search (e.g., link analysis, query logs, social tagging, social
network analysis, advertising and search, blog search, forum search, CQA, adversarial IR,
vertical and local search)
•IR and Structured Data (e.g., XML search, ranking in databases, desktop search, entity
search)
•Multimedia IR (e.g., Image search, video search, speech/audio search, music IR)
•Other Applications (e.g., digital libraries, enterprise search, genomics IR, legal IR, patent
16
search, text reuse)
My View of the Future of IR
Task Support
Full-Fledged Text
Mining
Info. Management
Access
Search
Current Search Engine
Keyword Queries
Search History
Personalization
Complete
User Model
(User Modeling)
Bag of words
Entities-Relations
Large-Scale
Knowledge
Semantic
Analysis
Representation
17
What You Should Know
• IR is a highly interdisciplinary area interacting
with many other areas, especially NLP, ML, DB,
HCI, software systems, and Information Science
• Major publication venues, especially ACM
SIGIR, ACM CIKM, ACM TOIS, IRJ, IPM, WSDM