nips2000a - Department of Computer Science and Engineering
Download
Report
Transcript nips2000a - Department of Computer Science and Engineering
Machine learning for the Web:
Applications and challenges
Soumen Chakrabarti
Center for Intelligent Internet Research
Computer Science and Engineering
IIT Bombay
www.cse.iitb.ernet.in/~soumen
Traditional supervised learning
Training instance
x1 , x2 ,, xn ; y
Learner
Test instance
x1 , x2 ,, xn
Independent variables x
mostly continuous,
maybe categorical
Predicted variable y
discrete (classification)
or continuous
(regression)
Statistical
models,
inference
rules, or
separators
Learner
Prediction y
2
Traditional unsupervised learning
No training / testing phases
Input is a collection of
records with independent
attributes alone
x1 , x2 ,, xn
Measure of similarity
Partition or cover instances
using clusters with large
“self-similarity” and small
“cross-similarity”
Hierarchical partitions
Small crosssimilarity
Large selfsimilarity
3
Learning hypertext models
Entities are pages, sites,
paragraphs, links, people,
bookmarks,
clickstreams…
occurs(term, page, cnt)
cites(page, page)
Transformed into
simple models and
relations
•
•
•
•
Vector space/bag-of-words
Hyperlink graph
Topic directories
is-a(topic, topic)
Discrete time series
example(topic, page)
4
Challenges
Large feature space in raw data
• Structured data sets: 10s to 100s
• Text (Web): 50 to 100 thousand
Most features not completely useless
• Feature elimination / selection not perfect
• Beyond linear transformations?
Models used today are simplistic
• Good accuracy on simple labeling tasks
• Lose a lot of detail present in hypertext to
fit known learning techniques
5
Challenges
Complex, interrelated objects
• Not a structured tuple-like entity
• Explicit and implicit connections
• Document markup sub-structure
• Site boundaries and hyperlinks
• Placement in popular directories like Yahoo!
Traditional distance measures are noisy
• How to combine diverse features? (Or, a
link is worth a ? words)
• Unreliable clustering results
6
This session
Semi-supervised clustering
(Rich Caruana)
• Enhanced clustering via user feedback
Kernel methods (Nello Cristianini)
• Modular learning systems for text and
hypertext
Reference matching(Andrew McCallum)
• Recovering and cleaning implicit citation
graphs from unstructured data
7
This talk: Two examples
Learning topics of hypertext documents
• Semi-supervised learning scenario
• Unified model of text and hyperlinks
• Enhanced accuracy of topic labeling
Segmenting hierarchical tagged pages
•
•
•
•
Topic distillation (hubs and authorities)
Minimum description length segmentation
Better focused topic distillation
Extract relevant fragments from pages
8
Classifying interconnected entities
Early examples:
• Some diseases have
complex lineage
dependency
• Robust edge
detection in images
?
How are topics
interconnected in
hypertext?
Maximum likelihood
graph labeling with
many classes
Finding edge
pixels in a
differentiated
image
?
.3 red
? .7 blue
?
?
?
0.6 0.4
0.3 0.7
9
Naïve Bayes classifiers
Decide topic; topic c is picked with prior
probability (c); c(c) = 1
Each c has parameters (c,t) for terms t
Coin with face probabilities t (c,t) = 1
Fix document length n(d) and toss coin
Naïve yet effective; can use other algos
Given c, probability of document is
n( d )
(c, t ) n ( d ,t )
Pr[ d | n(d ), c]
{n(d , t )} td
10
Enhanced models for hypertext
c=class, d=text,
N=neighbors
Text-only model: Pr(d|c)
Using neighbors’ text to
judge my topic:
Pr(d, d(N) | c)
Better recursive model:
Pr(d, c(N) | c)
Relaxation labeling over
Markov random fields
Or, EM formulation
?
11
Hyperlink modeling boosts accuracy
40
35
30
%Error
9600 patents from 12
classes marked by
USPTO
Patents have text and
prior art links
Expand test patent to
include neighborhood
‘Forget’ and reestimate fraction of
neighbors’ classes
(Even better for Yahoo)
25
20
15
10
5
0
0
50
100
%Neighborhood known
Text
Link
Text+Link
12
Hyperlink Induced Topic Search
Query
h
h
Keyword
Search
engine
a
h
a
a = ETh
h = Ea
‘Hubs’ and
‘authorities’
h
a
Response
a
Radius-1 expanded graph
13
“Topic drift” and various fixes
Some hubs have
‘Thick’ links
‘mixed’ content
Authority ‘leaks’
through mixed hubs
Activation
from good to bad
window
Query term
pages
Clever: match query Vector-space Centroid
with anchor text to
document
Cut-off
model
favor some edges
radius
B&H: eliminate
outlier documents
×
14
Document object model (DOM)
Hierarchical graph <html><head>
<title>Portals</title>
model for semi</head><body><ul>
structured data
<li><a href=“…”>Yahoo</a></li>
<li><a href=“…”>Lycos</a></li>
Can extract
</ul></body></html>
reasonable DOM
from HTML
html
A fine-grained view
head
body
of the Web
title
ul
Valuable because
li
li
page boundaries are
a
a
less meaningful now
15
A model for hub generation
Global hub score
distribution 0 w.r.t.
given query
Authors use DOM
nodes to specialize
0 into local I
At a certain ‘cut’ in
the DOM tree, local
distribution directly
generates hub
scores
Global distribution
Model
frontier
Progressive
‘distortion’
Other pages
16
Optimizing a cost measure
Reference distribution 0
Distribution distortion cost is
v
0
KL(v || 0 ) log
v
v
1
v
(for Poisson distribution)
Hv
Data encoding cost is roughly
log Pr(h | )
hH v
v
17
Modified topic distillation algorithm
Initialize DOM graph
Let only root set authority scores be 1
Repeat until reasonable convergence:
Authority-to-hub score propagation
MDL-based hub score smoothing
Hub-to-authority score propagation
Normalization of authority scores
Segment and rank micro-hubs
Present annotated results
Will this (non-linear) system converge?
Will segmentation help in reducing drift?
18
Convergence
1.00E-01
Mean auth score change
1.00E-02
1.00E-03
1.00E-04
1.00E-05
1.00E-06
1.00E-07
0
2
4
6
Iterations
8
10
28 queries used in Clever and by B&H
366k macro-pages, 10M micro-links
Rank converges within 15 iterations
19
‘Expanded’ implies
authority diffusion
arrested
As nodes outside
rootset start
participating in the
distillation…
• #Expanded increases
• #Pruned decreases
Prevents authority
leaks via mixed hubs
Smoothing statistics
Effect of micro-hub segmentation
2500
2000
1500
1000
500
0
Expanded
1
Pruned
2
3
4
5
6
7 8 9
Iterations
20
Rank correlation with B&H
0.025
0.02
Our authority score
Positively
correlated
Some negative
deviations
Pseudoauthorities
downgraded by
our algorithm
These were
earlier favored
by mixed hubs
0.015
0.01
0.005
0
0
0.005
0.01
Authority score B&H
0.015
(Axes not to same scale)
21
Conclusion
Hypertext and the Web pose new
modeling and algorithmic challenges
Locality exists in many guises
Diverse sources of information: text,
links, markup, usage
Unifying models needed
Anecdotes suggest that synergy can be
exploited
22