nips2000a - Department of Computer Science and Engineering

Download Report

Transcript nips2000a - Department of Computer Science and Engineering

Machine learning for the Web:
Applications and challenges
Soumen Chakrabarti
Center for Intelligent Internet Research
Computer Science and Engineering
IIT Bombay
www.cse.iitb.ernet.in/~soumen
Traditional supervised learning
 Training instance
x1 , x2 ,, xn ; y 
Learner
 Test instance
x1 , x2 ,, xn 
 Independent variables x
mostly continuous,
maybe categorical
 Predicted variable y
discrete (classification)
or continuous
(regression)
Statistical
models,
inference
rules, or
separators
Learner
Prediction y
2
Traditional unsupervised learning
 No training / testing phases
 Input is a collection of
records with independent
attributes alone
x1 , x2 ,, xn 
 Measure of similarity
 Partition or cover instances
using clusters with large
“self-similarity” and small
“cross-similarity”
 Hierarchical partitions
Small crosssimilarity
Large selfsimilarity
3
Learning hypertext models
 Entities are pages, sites,
paragraphs, links, people,
bookmarks,
clickstreams…
occurs(term, page, cnt)
cites(page, page)
 Transformed into
simple models and
relations
•
•
•
•
Vector space/bag-of-words
Hyperlink graph
Topic directories
is-a(topic, topic)
Discrete time series
example(topic, page)
4
Challenges
 Large feature space in raw data
• Structured data sets: 10s to 100s
• Text (Web): 50 to 100 thousand
 Most features not completely useless
• Feature elimination / selection not perfect
• Beyond linear transformations?
 Models used today are simplistic
• Good accuracy on simple labeling tasks
• Lose a lot of detail present in hypertext to
fit known learning techniques
5
Challenges
 Complex, interrelated objects
• Not a structured tuple-like entity
• Explicit and implicit connections
• Document markup sub-structure
• Site boundaries and hyperlinks
• Placement in popular directories like Yahoo!
 Traditional distance measures are noisy
• How to combine diverse features? (Or, a
link is worth a ? words)
• Unreliable clustering results
6
This session
 Semi-supervised clustering
(Rich Caruana)
• Enhanced clustering via user feedback
 Kernel methods (Nello Cristianini)
• Modular learning systems for text and
hypertext
 Reference matching(Andrew McCallum)
• Recovering and cleaning implicit citation
graphs from unstructured data
7
This talk: Two examples
 Learning topics of hypertext documents
• Semi-supervised learning scenario
• Unified model of text and hyperlinks
• Enhanced accuracy of topic labeling
 Segmenting hierarchical tagged pages
•
•
•
•
Topic distillation (hubs and authorities)
Minimum description length segmentation
Better focused topic distillation
Extract relevant fragments from pages
8
Classifying interconnected entities
 Early examples:
• Some diseases have
complex lineage
dependency
• Robust edge
detection in images
?
 How are topics
interconnected in
hypertext?
 Maximum likelihood
graph labeling with
many classes
Finding edge
pixels in a
differentiated
image
?
.3 red
? .7 blue
?
?
?
0.6 0.4
0.3 0.7
9
Naïve Bayes classifiers
 Decide topic; topic c is picked with prior
probability (c); c(c) = 1
 Each c has parameters (c,t) for terms t
 Coin with face probabilities t (c,t) = 1
 Fix document length n(d) and toss coin
 Naïve yet effective; can use other algos
 Given c, probability of document is
 n( d ) 
 (c, t ) n ( d ,t )
Pr[ d | n(d ), c]  
{n(d , t )} td
10
Enhanced models for hypertext
 c=class, d=text,
N=neighbors
 Text-only model: Pr(d|c)
 Using neighbors’ text to
judge my topic:
Pr(d, d(N) | c)
 Better recursive model:
Pr(d, c(N) | c)
 Relaxation labeling over
Markov random fields
 Or, EM formulation
?
11
Hyperlink modeling boosts accuracy
40
35
30
%Error
 9600 patents from 12
classes marked by
USPTO
 Patents have text and
prior art links
 Expand test patent to
include neighborhood
 ‘Forget’ and reestimate fraction of
neighbors’ classes
(Even better for Yahoo)
25
20
15
10
5
0
0
50
100
%Neighborhood known
Text
Link
Text+Link
12
Hyperlink Induced Topic Search
Query
h
h
Keyword
Search
engine
a
h
a
a = ETh
h = Ea
‘Hubs’ and
‘authorities’
h
a
Response
a
Radius-1 expanded graph
13
“Topic drift” and various fixes
 Some hubs have
‘Thick’ links
‘mixed’ content
 Authority ‘leaks’
through mixed hubs
Activation
from good to bad
window
Query term
pages
 Clever: match query Vector-space Centroid
with anchor text to
document
Cut-off
model
favor some edges
radius
 B&H: eliminate
outlier documents
×
14
Document object model (DOM)
 Hierarchical graph <html><head>
<title>Portals</title>
model for semi</head><body><ul>
structured data
<li><a href=“…”>Yahoo</a></li>
<li><a href=“…”>Lycos</a></li>
 Can extract
</ul></body></html>
reasonable DOM
from HTML
html
 A fine-grained view
head
body
of the Web
title
ul
 Valuable because
li
li
page boundaries are
a
a
less meaningful now
15
A model for hub generation
 Global hub score
distribution 0 w.r.t.
given query
 Authors use DOM
nodes to specialize
0 into local I
 At a certain ‘cut’ in
the DOM tree, local
distribution directly
generates hub
scores
Global distribution
Model
frontier
Progressive
‘distortion’
Other pages
16
Optimizing a cost measure
Reference distribution 0
Distribution distortion cost is
v
0
KL(v || 0 )  log
v
 v 
   1
 v 
(for Poisson distribution)
Hv
Data encoding cost is roughly

 log Pr(h |  )
hH v
v
17
Modified topic distillation algorithm
Initialize DOM graph
Let only root set authority scores be 1
Repeat until reasonable convergence:
Authority-to-hub score propagation
MDL-based hub score smoothing
Hub-to-authority score propagation
Normalization of authority scores
Segment and rank micro-hubs
Present annotated results
 Will this (non-linear) system converge?
 Will segmentation help in reducing drift?
18
Convergence
1.00E-01
Mean auth score change
1.00E-02
1.00E-03
1.00E-04
1.00E-05
1.00E-06
1.00E-07
0
2
4
6
Iterations
8
10
 28 queries used in Clever and by B&H
 366k macro-pages, 10M micro-links
 Rank converges within 15 iterations
19
 ‘Expanded’ implies
authority diffusion
arrested
 As nodes outside
rootset start
participating in the
distillation…
• #Expanded increases
• #Pruned decreases
 Prevents authority
leaks via mixed hubs
Smoothing statistics
Effect of micro-hub segmentation
2500
2000
1500
1000
500
0
Expanded
1
Pruned
2
3
4
5
6
7 8 9
Iterations
20
Rank correlation with B&H
0.025
0.02
Our authority score
 Positively
correlated
 Some negative
deviations
 Pseudoauthorities
downgraded by
our algorithm
 These were
earlier favored
by mixed hubs
0.015
0.01
0.005
0
0
0.005
0.01
Authority score B&H
0.015
(Axes not to same scale)
21
Conclusion
 Hypertext and the Web pose new
modeling and algorithmic challenges
 Locality exists in many guises
 Diverse sources of information: text,
links, markup, usage
 Unifying models needed
 Anecdotes suggest that synergy can be
exploited
22