No Slide Title - Microsoft Research

Download Report

Transcript No Slide Title - Microsoft Research

Beyond Content-Based Retrieval:
Modeling Domains, Users and Interaction
Susan Dumais
Microsoft Research
http://research.microsoft.com/~sdumais
IEEE ADL’99 - May 21, 1998
Research in IR at MS
Microsoft Research (http://research.microsoft.com)
Adaptive Systems and Interaction - IR/UI
Machine Learning and Applied Statistics
Data Mining
Natural Language Processing
Collaboration and Education
MSR Cambridge; MSR Beijing
Microsoft Product Groups … many IR-related
IR Themes & Directions
Improvements in representation and contentmatching
Probabilistic/Bayesian models - p(Relevant|Document), p(Concept|Words)
NLP: Truffle, MindNet
Beyond content-matching
User/Task modeling
Domain/Object modeling
WWW8 Panel
Finding
in the Billion-Page
Advances in presentation
and Anything
manipulation
Web: Are Algorithms the Answer?
Moderator: Prabhakar Raghavan, IBM Almaden
Traditional View of IR
Query Words
Ranked List
What’s Missing?
User
Modeling
Query
Words
Ranked List
Domain
Modeling
Information
Use
Domain/Obj Modeling
Not all objects are equal … potentially big win
A priori importance
Information use (“readware”; collab filtering)
Inter-object relationships
Link structure / hypertext
Subject categories - e.g., text categorization.
text clustering
Metadata
E.g., reliability, recency, cost -> combining
User/Task Modeling
Demographics
Task -- What’s the user’s goal?
e.g., Lumiere
Short and long-term content interests
e.g., Implicit queries
Interest model = f(content_similarity, time, interest)
e.g., Letiza, WebWatcher, Fab
Information Use
Beyond batch IR model (“query->results”)
Consider larger task context
Knowledge management
Human attention is critical resource
Techniques for automatic information summarization,
organization, discover, filtering, mining, etc.
 Advanced UIs and interaction techniques
E.g, tight coupling of search, browsing to support
information management
The Broader View of IR
User
Modeling
Query
Words
Ranked List
Domain
Modeling
Information
Use
Beyond Content Matching
Domain/Object modeling
A priori importance
Text classification and clustering
User/Task modeling
Implicit queries and Lumiere
Advances in presentation and manipulation
Combining structure and search (e.g., DM)
Estimating Priors
Access Count
e.g., DirectHit
Collaborative Filtering
Link Structure (citation analysis)
Clever (Hubs/Authorities) - Kleinberg et al.
Google - Brin & Page
Web Archeologist - Bharat, Henzinger
New Relevance Ranking
Relevance ranking can include:
content-matching … of couse
page/site popularity
<external link count; proxy stats>
page quality <site quality, dates, depth>
spam or porn or other downweighting
etc.
Combining these - relative weighting of these
factors tricky and subjective
Text Classification
Text Classification: assign objects to one or
more of a predefined set of categories using
text features
E.g., News feeds, Web data, OHSUMED, Email - spam/no-spam
Approaches:
Human classification (e.g., LCSH, MeSH, Yahoo!, CyberPatrol)
Hand-crafted knowledge engineered systems (e.g., CONSTRUE)
Inductive learning methods
(Semi-) automatic classification
Inductive Learning Methods
Supervised learning from examples
Examples are easy for domain experts to provide
Models easy to learn, update, and customize
Example learning algorithms
Relevance Feedback, Decision Trees, Naïve Bayes,
Bayes Nets, Support Vector Machines (SVMs)
Text representation
Large vector of features (words, phrases, hand-crafted)
Support Vector Machine
Optimization Problem
Find hyperplane, h, separating positive and negative examples
2  
 
Optimization for maximum margin: min w , w  x  b  1, w  x  b  1
 
Classify new items using: f ( w
 x)

w
support vectors
Reuters Data Set
(21578 - ModApte split)
9603 training articles; 3299 test articles
Example “interest” article
2-APR-1987 06:35:19.50
west-germany
b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
FRANKFURT, March 2
The Bundesbank left credit policies unchanged after today's regular
meeting of its council, a spokesman said in answer to enquiries. The
West German discount rate remains at 3.0 pct, and the Lombard
emergency financing rate at 5.0 pct.
REUTER
Average article 200 words long
Example: Reuters news
118 categories (article can be in more than one category)
Most common categories (#train, #test)
• Earn (2877, 1087)
•
•
•
•
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)
• Trade (369,119)
•
•
•
•
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
Overall Results
Linear SVM most accurate: 87% precision at
87% recall
Reuters ROC - Category Grain
1
0.9
0.8
0.7
0.6
Recall
0.5
LSVM
Decision Tree
Naïve Bayes
Find Similar
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
Precision
Recall: % labeled in category among those stories that are really in category
Precision: % really in category among those stories labeled in category
Text Categ Summary
 Accurate
classifiers can be learned
automatically from training examples
 Linear
SVMs are efficient and provide very
good classification accuracy
Widely applicable, flexible, and adaptable
representations
Email spam/no-spam, Web, Medical abstracts, TREC
Text Clustering
Discovering structure
Vector-based document representation
EM algorithm to identify clusters
Interactive user interface
Text Clustering
Beyond Content Matching
Domain/Object modeling
Text classification and clustering
User/Task modeling
Implicit queries and Lumiere
Advances in presentation and manipulation
Combining structure and search (e.g., DM)
Lumiere
Inferring beliefs about user’s goals and ideal
actions under uncertainty
Pr(Goals, Needs)
• User query
• User activity
• User profile
• Data structures
E. Horvitz, et al.
*
Bayesian Models for
Information Access
Previously
Demonstrated
Interests
Context
Acute
Goals/Tasks
Current General
Interests
User’s
Acute
Interests / Needs
Applications,
Docs, Data Structures
User’s
actions
Explicit User Query
Lumiere
Lumiere
Office Assistant
Implicit Queries (IQ)
Explicit queries:
Search is a separate, discrete task
User types query, Gets results, Tries again …
Implicit queries:
Search as part of normal information flow
Ongoing query formulation based on user activities
Non-intrusive results display
Figure 2: Data Mountain with 100 web pages.
Data Mountain with Implicit Query results shown
(highlighted pages to left of selected page).
IQ Study: Experimental Details
Store 100 Web pages
50 popular Web pages; 50 random pages
With or without Implicit Query
IQ1: Co-occurrence based IQ
IQ2: Content-based IQ
Retrieve 100 Web pages
Title given as retrieval cue -- e.g., “CNN Home Page”
No implicit query highlighting at retrieval
Figure 2: Data
Mountain
with Page”
100 web pages.
Find:
“CNN Home
Results: Information Storage
Filing strategies
IQ Condition
IQ0: No IQ
IQ1: Co-occur based
IQ2: Content-based
Filing Strategy
Semantic Alphabetic No Org
11
3
8
1
10
1
1
0
0
Results: Information Storage
Number of categories (for semantic organizers)
IQ Condition
Average Number of Categories (std error)
IQ0: No IQ
12.0 (3.6)
IQ1: Co-occur based
15.8 (5.8)
IQ2: Content-based
13.6 (5.9)
7 Categ
23 Categ
Results: Retrieval Time
Web Page Retrieval Time
Average RT (seconds)
14
12
10
IQ 0
8
IQ 1
6
IQ 2
4
2
0
Im plicit Query Condition
Implicit Query Highlights
IQ built by tracking user’s reading behavior
No explicit search required
Good matches returned
IQ user model:
Combines present context + previous
interests
New interfaces for tightly coupling search
results with structure -- user study
Summary
Improving content-matching
And, beyond ...
Domain/Object Models
User/Task Models
Information Presentation and Use
Also …
 non-text, multi-lingual, distributed
http://research.microsoft.com/~sdumais
The View from Wired, May 1996
“information retrieval seemed like the
easiest place to make progress …
information retrieval is really only a
problem for people in library science -- if
some computer scientists were to put
their heads together, they’d probably have
it solved before lunchtime” [GS]
DLI-1
UC Berkeley - Environmental planning and GIS
UC Santa Barbara - The Alexandria project spatially
referenced map information
CMU - Informedia digital video library
UIllinois - Federating repositories of scientific literature
UMichigan - Intelligent agents for information location
Stanford - Interoperation mechanisms among
heterogeneous services
DLI-2
 OHSU/OGI: Tracking footprints through an information space:
Leveraging the document selections of expert problem solvers
 Stanford: Image filtering for secure distribution of medical information
 UWashington: Automatic reference librarians for the WWW
 UKentucky: The digital atheneum: New techniques for restoring,
searching and editing humanities collection
 USCarolina: A software and data library for experiments, simulation
and archiving
 Johns Hopkins: Digital workflow management: The Lester S. Levy
collection of Music
 UArizona: High-performance digital library classification systems: From
information retrieval to knowledge management
Popularity rankings - e.g., DirectHit
Link-based rankings - e.g., Google, Clever
Question answering - e.g., AskJeeves
Paid advertising - e.g. ,GoTo
Steve Kirsch’s talk at SIGIR
Example Web Searches
user = A1D6F19DB06BD694
150052
152004
152036
152219
153747
153848
160232
160642
161042
161144
161414
161602
161308
161823
161840
date = 970916
lion
lions
lions lion
lion facts
roaring
lions roaring
africa lion
lions, tigers, leopards and cheetahs
lions, tigers, leopards and cheetahs cats
wild cats of africa
africa cat
africa lions
africa wild cats
mane
lion
excite log
161858
163041
163919
164040
165002
165100
165211
165311
170013
172131
172207
172241
172334
172443
172450
lion lions
lion facts
picher of lions
lion picher
lion pictures
pictures of lions
pictures of big cats
lion photos
video in lion
pictureof a lioness
picture of a lioness
lion pictures
lion pictures cat
lions
lions
Earlier IR
Accomplishments in DTAS
Bayesian IR
Answer Wizard: words --> p(Concepti | Words)
User Modeling
Lumiere, Office Assistant: user words, user
profile, user behavior (events), active data
structures --> p(User needs, Goals | Evidence)
Collaborative Filtering
Current DTAS IR Projects
Domain/Object Modeling
Text Classification (Sue, David, Eric, John, Mehran)
Combining content and collaborative matches
(Carl)
User/Task Modeling
Implicit query (Sue, Eric, Krishna)
Prefetching information (Eric)
Cost-benefit analysis of a search result (Jack)
User Modeling for IQ/IR
IQ: Model of user interests based on actions
Explicit search activity (query or profile)
Patterns of scroll / dwell on text
Copying and pasting actions
Interaction with multiple applications
“Implicit Query (IQ)”
Explicit Queries
or Profile
User’s
Short- and Long-Term
Interests / Needs
Scroll/Dwell on Text
Copy and Paste
Other Applications
Improvements:
Using Probabilistic Model
MSR-Cambridge (Steve Robertson)
Probabilistic Retrieval (e.g., Okapi)
Theory-driven derivation of matching function
Estimate: PQ(ri=Rel or NotRel | d=document)
Using Bayes Rule and assuming conditional
independence given Rel/NotRel
PQ (ri | d)  P(ri ) P(d | ri ) / P(d)
PQ(ri | d)  P(ri)i 1 P( xi | ri) / P(d)
t
Improvements:
Using Probabilistic Model
Good performance for uniform length
document surrogates (e.g., abstracts)
Enhanced to take into account term
frequency and document
“BM25” one of the best ranking function at TREC
Easy to incorporate relevance feedback
Now looking at adaptive filtering/routing
Improvements: Using NLP
Current search techniques use word forms
Improvements in content-matching will
come from:
-> Identifying relations between words
-> Identifying word meanings
Advanced NLP can provide these
http:/research.microspft.com/nlp
NLP System Architecture
Intelligent
Summarizing
Meaning Representation
Search and
Retrieval
Grammar
& Style
Checking
Document
Understanding
Discourse
Logical Form
Generation
MindNet
Portrait
Machine
Translation
Sketch
Indexing
Smart
Selection
Projects
Morphology
Dictionary
Word Breaking
NL Text
Technology
NL Text
“Truffle”: Word Relations
% Relevant In Top Ten Docs
70
63.7%
Relevant hits
60
Result:
50
40
2-3 times as many
relevant documents
in the top 10 with
Microsoft NLP
33.1%
30
20
21.5%
10
0
Engine X
X+
NLP
“MindNet”: Word Meanings
A huge knowledge
base
Automatically
created from
dictionaries
Words (nodes) linked
by relationships
7 million links and
growing
MindNet
 P1 Panel (WWW8) - Finding Anything in the
Billion-Page Web:
Are Algorithms the Answer?
 Moderator: Prabhakar Raghavan, IBM Almaden Research
Center,USA
 Panelists:
- Andrei Z. Broder, Compaq SRC
- Monika R. Henzinger, Compaq SRC
- Udi Manber, Yahoo! Inc.
- Brian Pinkerton, Excite Inc.
Web queries
average length 2.2 words
10% use syntax (often incorrectly)
1% use advanced search
noun phrase only
Precision more important than recall?
Computer Search Today: large central
indices
Human Search Today:
ask friends - 6dos
human, decentralized
History-Enriched Digital Objs
Hill, W., Hollan, J., Wroblewski,
D., McCandless, T.
Edit Wear and Read Wear: Their
Theory and Generalization.
ACM CHI'92 Human Factors in
Computing Systems
Proceedings. ACM Press (1992),
3-9.