No Slide Title - Microsoft Research
Download
Report
Transcript No Slide Title - Microsoft Research
Beyond Content-Based Retrieval:
Modeling Domains, Users and Interaction
Susan Dumais
Microsoft Research
http://research.microsoft.com/~sdumais
IEEE ADL’99 - May 21, 1998
Research in IR at MS
Microsoft Research (http://research.microsoft.com)
Adaptive Systems and Interaction - IR/UI
Machine Learning and Applied Statistics
Data Mining
Natural Language Processing
Collaboration and Education
MSR Cambridge; MSR Beijing
Microsoft Product Groups … many IR-related
IR Themes & Directions
Improvements in representation and contentmatching
Probabilistic/Bayesian models - p(Relevant|Document), p(Concept|Words)
NLP: Truffle, MindNet
Beyond content-matching
User/Task modeling
Domain/Object modeling
WWW8 Panel
Finding
in the Billion-Page
Advances in presentation
and Anything
manipulation
Web: Are Algorithms the Answer?
Moderator: Prabhakar Raghavan, IBM Almaden
Traditional View of IR
Query Words
Ranked List
What’s Missing?
User
Modeling
Query
Words
Ranked List
Domain
Modeling
Information
Use
Domain/Obj Modeling
Not all objects are equal … potentially big win
A priori importance
Information use (“readware”; collab filtering)
Inter-object relationships
Link structure / hypertext
Subject categories - e.g., text categorization.
text clustering
Metadata
E.g., reliability, recency, cost -> combining
User/Task Modeling
Demographics
Task -- What’s the user’s goal?
e.g., Lumiere
Short and long-term content interests
e.g., Implicit queries
Interest model = f(content_similarity, time, interest)
e.g., Letiza, WebWatcher, Fab
Information Use
Beyond batch IR model (“query->results”)
Consider larger task context
Knowledge management
Human attention is critical resource
Techniques for automatic information summarization,
organization, discover, filtering, mining, etc.
Advanced UIs and interaction techniques
E.g, tight coupling of search, browsing to support
information management
The Broader View of IR
User
Modeling
Query
Words
Ranked List
Domain
Modeling
Information
Use
Beyond Content Matching
Domain/Object modeling
A priori importance
Text classification and clustering
User/Task modeling
Implicit queries and Lumiere
Advances in presentation and manipulation
Combining structure and search (e.g., DM)
Estimating Priors
Access Count
e.g., DirectHit
Collaborative Filtering
Link Structure (citation analysis)
Clever (Hubs/Authorities) - Kleinberg et al.
Google - Brin & Page
Web Archeologist - Bharat, Henzinger
New Relevance Ranking
Relevance ranking can include:
content-matching … of couse
page/site popularity
<external link count; proxy stats>
page quality <site quality, dates, depth>
spam or porn or other downweighting
etc.
Combining these - relative weighting of these
factors tricky and subjective
Text Classification
Text Classification: assign objects to one or
more of a predefined set of categories using
text features
E.g., News feeds, Web data, OHSUMED, Email - spam/no-spam
Approaches:
Human classification (e.g., LCSH, MeSH, Yahoo!, CyberPatrol)
Hand-crafted knowledge engineered systems (e.g., CONSTRUE)
Inductive learning methods
(Semi-) automatic classification
Inductive Learning Methods
Supervised learning from examples
Examples are easy for domain experts to provide
Models easy to learn, update, and customize
Example learning algorithms
Relevance Feedback, Decision Trees, Naïve Bayes,
Bayes Nets, Support Vector Machines (SVMs)
Text representation
Large vector of features (words, phrases, hand-crafted)
Support Vector Machine
Optimization Problem
Find hyperplane, h, separating positive and negative examples
2
Optimization for maximum margin: min w , w x b 1, w x b 1
Classify new items using: f ( w
x)
w
support vectors
Reuters Data Set
(21578 - ModApte split)
9603 training articles; 3299 test articles
Example “interest” article
2-APR-1987 06:35:19.50
west-germany
b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
FRANKFURT, March 2
The Bundesbank left credit policies unchanged after today's regular
meeting of its council, a spokesman said in answer to enquiries. The
West German discount rate remains at 3.0 pct, and the Lombard
emergency financing rate at 5.0 pct.
REUTER
Average article 200 words long
Example: Reuters news
118 categories (article can be in more than one category)
Most common categories (#train, #test)
• Earn (2877, 1087)
•
•
•
•
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)
• Trade (369,119)
•
•
•
•
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
Overall Results
Linear SVM most accurate: 87% precision at
87% recall
Reuters ROC - Category Grain
1
0.9
0.8
0.7
0.6
Recall
0.5
LSVM
Decision Tree
Naïve Bayes
Find Similar
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Precision
Recall: % labeled in category among those stories that are really in category
Precision: % really in category among those stories labeled in category
Text Categ Summary
Accurate
classifiers can be learned
automatically from training examples
Linear
SVMs are efficient and provide very
good classification accuracy
Widely applicable, flexible, and adaptable
representations
Email spam/no-spam, Web, Medical abstracts, TREC
Text Clustering
Discovering structure
Vector-based document representation
EM algorithm to identify clusters
Interactive user interface
Text Clustering
Beyond Content Matching
Domain/Object modeling
Text classification and clustering
User/Task modeling
Implicit queries and Lumiere
Advances in presentation and manipulation
Combining structure and search (e.g., DM)
Lumiere
Inferring beliefs about user’s goals and ideal
actions under uncertainty
Pr(Goals, Needs)
• User query
• User activity
• User profile
• Data structures
E. Horvitz, et al.
*
Bayesian Models for
Information Access
Previously
Demonstrated
Interests
Context
Acute
Goals/Tasks
Current General
Interests
User’s
Acute
Interests / Needs
Applications,
Docs, Data Structures
User’s
actions
Explicit User Query
Lumiere
Lumiere
Office Assistant
Implicit Queries (IQ)
Explicit queries:
Search is a separate, discrete task
User types query, Gets results, Tries again …
Implicit queries:
Search as part of normal information flow
Ongoing query formulation based on user activities
Non-intrusive results display
Figure 2: Data Mountain with 100 web pages.
Data Mountain with Implicit Query results shown
(highlighted pages to left of selected page).
IQ Study: Experimental Details
Store 100 Web pages
50 popular Web pages; 50 random pages
With or without Implicit Query
IQ1: Co-occurrence based IQ
IQ2: Content-based IQ
Retrieve 100 Web pages
Title given as retrieval cue -- e.g., “CNN Home Page”
No implicit query highlighting at retrieval
Figure 2: Data
Mountain
with Page”
100 web pages.
Find:
“CNN Home
Results: Information Storage
Filing strategies
IQ Condition
IQ0: No IQ
IQ1: Co-occur based
IQ2: Content-based
Filing Strategy
Semantic Alphabetic No Org
11
3
8
1
10
1
1
0
0
Results: Information Storage
Number of categories (for semantic organizers)
IQ Condition
Average Number of Categories (std error)
IQ0: No IQ
12.0 (3.6)
IQ1: Co-occur based
15.8 (5.8)
IQ2: Content-based
13.6 (5.9)
7 Categ
23 Categ
Results: Retrieval Time
Web Page Retrieval Time
Average RT (seconds)
14
12
10
IQ 0
8
IQ 1
6
IQ 2
4
2
0
Im plicit Query Condition
Implicit Query Highlights
IQ built by tracking user’s reading behavior
No explicit search required
Good matches returned
IQ user model:
Combines present context + previous
interests
New interfaces for tightly coupling search
results with structure -- user study
Summary
Improving content-matching
And, beyond ...
Domain/Object Models
User/Task Models
Information Presentation and Use
Also …
non-text, multi-lingual, distributed
http://research.microsoft.com/~sdumais
The View from Wired, May 1996
“information retrieval seemed like the
easiest place to make progress …
information retrieval is really only a
problem for people in library science -- if
some computer scientists were to put
their heads together, they’d probably have
it solved before lunchtime” [GS]
DLI-1
UC Berkeley - Environmental planning and GIS
UC Santa Barbara - The Alexandria project spatially
referenced map information
CMU - Informedia digital video library
UIllinois - Federating repositories of scientific literature
UMichigan - Intelligent agents for information location
Stanford - Interoperation mechanisms among
heterogeneous services
DLI-2
OHSU/OGI: Tracking footprints through an information space:
Leveraging the document selections of expert problem solvers
Stanford: Image filtering for secure distribution of medical information
UWashington: Automatic reference librarians for the WWW
UKentucky: The digital atheneum: New techniques for restoring,
searching and editing humanities collection
USCarolina: A software and data library for experiments, simulation
and archiving
Johns Hopkins: Digital workflow management: The Lester S. Levy
collection of Music
UArizona: High-performance digital library classification systems: From
information retrieval to knowledge management
Popularity rankings - e.g., DirectHit
Link-based rankings - e.g., Google, Clever
Question answering - e.g., AskJeeves
Paid advertising - e.g. ,GoTo
Steve Kirsch’s talk at SIGIR
Example Web Searches
user = A1D6F19DB06BD694
150052
152004
152036
152219
153747
153848
160232
160642
161042
161144
161414
161602
161308
161823
161840
date = 970916
lion
lions
lions lion
lion facts
roaring
lions roaring
africa lion
lions, tigers, leopards and cheetahs
lions, tigers, leopards and cheetahs cats
wild cats of africa
africa cat
africa lions
africa wild cats
mane
lion
excite log
161858
163041
163919
164040
165002
165100
165211
165311
170013
172131
172207
172241
172334
172443
172450
lion lions
lion facts
picher of lions
lion picher
lion pictures
pictures of lions
pictures of big cats
lion photos
video in lion
pictureof a lioness
picture of a lioness
lion pictures
lion pictures cat
lions
lions
Earlier IR
Accomplishments in DTAS
Bayesian IR
Answer Wizard: words --> p(Concepti | Words)
User Modeling
Lumiere, Office Assistant: user words, user
profile, user behavior (events), active data
structures --> p(User needs, Goals | Evidence)
Collaborative Filtering
Current DTAS IR Projects
Domain/Object Modeling
Text Classification (Sue, David, Eric, John, Mehran)
Combining content and collaborative matches
(Carl)
User/Task Modeling
Implicit query (Sue, Eric, Krishna)
Prefetching information (Eric)
Cost-benefit analysis of a search result (Jack)
User Modeling for IQ/IR
IQ: Model of user interests based on actions
Explicit search activity (query or profile)
Patterns of scroll / dwell on text
Copying and pasting actions
Interaction with multiple applications
“Implicit Query (IQ)”
Explicit Queries
or Profile
User’s
Short- and Long-Term
Interests / Needs
Scroll/Dwell on Text
Copy and Paste
Other Applications
Improvements:
Using Probabilistic Model
MSR-Cambridge (Steve Robertson)
Probabilistic Retrieval (e.g., Okapi)
Theory-driven derivation of matching function
Estimate: PQ(ri=Rel or NotRel | d=document)
Using Bayes Rule and assuming conditional
independence given Rel/NotRel
PQ (ri | d) P(ri ) P(d | ri ) / P(d)
PQ(ri | d) P(ri)i 1 P( xi | ri) / P(d)
t
Improvements:
Using Probabilistic Model
Good performance for uniform length
document surrogates (e.g., abstracts)
Enhanced to take into account term
frequency and document
“BM25” one of the best ranking function at TREC
Easy to incorporate relevance feedback
Now looking at adaptive filtering/routing
Improvements: Using NLP
Current search techniques use word forms
Improvements in content-matching will
come from:
-> Identifying relations between words
-> Identifying word meanings
Advanced NLP can provide these
http:/research.microspft.com/nlp
NLP System Architecture
Intelligent
Summarizing
Meaning Representation
Search and
Retrieval
Grammar
& Style
Checking
Document
Understanding
Discourse
Logical Form
Generation
MindNet
Portrait
Machine
Translation
Sketch
Indexing
Smart
Selection
Projects
Morphology
Dictionary
Word Breaking
NL Text
Technology
NL Text
“Truffle”: Word Relations
% Relevant In Top Ten Docs
70
63.7%
Relevant hits
60
Result:
50
40
2-3 times as many
relevant documents
in the top 10 with
Microsoft NLP
33.1%
30
20
21.5%
10
0
Engine X
X+
NLP
“MindNet”: Word Meanings
A huge knowledge
base
Automatically
created from
dictionaries
Words (nodes) linked
by relationships
7 million links and
growing
MindNet
P1 Panel (WWW8) - Finding Anything in the
Billion-Page Web:
Are Algorithms the Answer?
Moderator: Prabhakar Raghavan, IBM Almaden Research
Center,USA
Panelists:
- Andrei Z. Broder, Compaq SRC
- Monika R. Henzinger, Compaq SRC
- Udi Manber, Yahoo! Inc.
- Brian Pinkerton, Excite Inc.
Web queries
average length 2.2 words
10% use syntax (often incorrectly)
1% use advanced search
noun phrase only
Precision more important than recall?
Computer Search Today: large central
indices
Human Search Today:
ask friends - 6dos
human, decentralized
History-Enriched Digital Objs
Hill, W., Hollan, J., Wroblewski,
D., McCandless, T.
Edit Wear and Read Wear: Their
Theory and Generalization.
ACM CHI'92 Human Factors in
Computing Systems
Proceedings. ACM Press (1992),
3-9.