Transcript adma08

Automatic Web Tagging and
Person Tagging
Using Language Models
- Qiaozhu Mei†, Yi Zhang‡
Presented by Jessica Gronski‡
† University
of Illinois at Urbana-Champaign
‡ University of California at Santa Cruz
Tagging a Web Document
• The dual problem of search/retrieval:
[Mei et al. 2007]
– Retrieval: short description (query)  relevant
documents
– Tagging: document  short description (tag)
• To summarize the content of documents
• To access the document in the future
retrieval
Text
Document
Query/Tag
tagging
2
Social Bookmarking of Web
Documents
Web documents
Social bookmarks (tags)
3
Existing Work on Social Bookmarking
• Social Bookmarking Systems
– Del.icio.us, Digg, Citeulike, etc.
• Enhance Social bookmarking systems
– Anti-spam [Koutrika et al 2007]
– Search& ranking tags [Hotho et al 2006]
• Utilize social bookmarks
– Visualization [Dubinko et al. 2006]
– Summarization [Boydell et al. 2007]
– Use tags to help web search:
[Heymann et al. 2008]; [Zhou et al. 2008]
4
Research Questions
• Can we automatically generate tags for web
documents?
– Meaningful, compact, relevant
• Can we generate tags for other web objects,
such as web users?
5
Applications of Automatic Tagging
• Summarizing documents/ web objects
• Suggest social bookmarks
• Refine queries for web search
– Finding good queries to a document
• Suggest good keywords for online advertising
6
Rest of the Talk
• A probabilistic approach to tag generation
– Candidate Tag Selection
– Web document representation
– Tag ranking
• Experiments
– Web documents tagging;
– web user tagging
• Summary
7
Our Method
User-Generated Corpus
(e.g., Del.icio.us, Wikipedia)
ipod nano, data mining, presidential campaign
index structure, statistics tutorial,
computer science…
Candidate tag pool
representation
data
statistics
tutorial
analysis
software
model
frequent
probabilistic
algorithm
…
Web Documents
0.1599
0.0752
0.0660
0.0372
0.0311
0.0310
0.0233
0.0188
0.0173
Multinomial word
Distribution
Ranking candidate tags
data mining
0.26
statistics tutorial
0.19
computer science 0.17
index structure
0.01
……
ipod nano
0.00001
presidential campaign 0.0
……
8
Candidate Tag Selection
• Meaningful, compact, user-oriented
• From social bookmarking data
– E.g., Del.icio.us
– Single tags  tags that other people used
– “phrases”  statistically significant bigrams
• From other user-generated web contents
– E.g., Wikipedia
– Titles of entries in wikipedia
9
Representation of Web Documents
• Multinomial distribution of words
(unigram language models)
– Commonly used in retrieval and
text mining
• Can be estimated from the
content of the document, or
from social bookmarks (our approach)
– What other people used to tag that
document
text
mining
data
probabilistic
independence
model
…
0.16
0.08
0.07
0.04
0.03
0.03
Baseline: Use the top
words in that distribution
to tag a document
10
Tag Ranking: A Probabilistic Approach
• Web documents d  a language model
• A candidate tag t  a language model from its
co-occurring tags
• Score and rank t by KL-divergence of these two
language models
p(w | t , C )
p( w | t )
f (t , d )   D(d || t )   p( w | d ) log
p( w | d )
w
Social Bookmark
Collection
11
Rewriting and Efficient Computation
p( w | t )
p(w | d )
w
p(w | t , C )
p( w | d )
p( w | t , C )
  p ( w | d ) log
  p ( w | d ) log
  p ( w | d ) log
p(w | C )
p(w | C ) w
p( w | t )
w
w
p ( w, t | C )
  p ( w | d ) log
 D(d || C )  Bias (t , C )
p
(
w
|
C
)
p
(
t
|
C
)
w
f (t , d )   p ( w | d ) log
PMI (t , w | C )
Bias of using C to represent
document d (e.g., del.icio.us)
Bias of using C to represent
candidate tag t
rank
f (t , d )  Ed [ PMI ( w, t | C )]
1. Can be pre-computed from corpus;
2. Only store those PMI(w,t|C) > 0
12
Tagging Web Users
• Summarize the interests and bias of a user
• Web user  a pseudo document
• Estimate a language model from all tags that he
used
• The rest is similar to web document tagging
13
Experiments
• Dataset:
– Two-week tagging records from Del.icio.us
Time Span
Bookmarks Distinct
Tags
Distinct
Users
02/13/07 ~ 02/26/07
579,652
20,138
111,381
– Candidate tags:
• Top 15,000 Significant 2-grams from del.icio.us;
• titles of all wikipedia entries (5,836,166 entries,
around 48,000 appeared in del.icio.us)
14
Tagging Web Documents
Urls
LM p(w|d)
Tag = Word TagMeaningful,
= bigram
Relevant, precise
relevant
http://kuler.adobe.com/
(158 bookmarks)
Too general,
sometimes not
relevant
http://www.youtube.com/
watch?v=6gmP4nk0EOE
(157 bookmarks)
color
color
design
colour
webdesign
palette
colorscheme
tools
data, not
adobe overfitcolours
phrases
graphics real picker
cor
flash
web2.0
video
But
sometimes
youtube
not meaningful
web
internet
xml
community
youtube
revver
vodcast
primer
comunidad
participation
ethnograpy
Meaningful,
relevant
Tag =
wikipedia
title
But partially
adobe color
color
covers good
color design
colour
color colour
palettetags
color colors
web color
colour design
colours
Meaningful,
inspiration palette cor
relevant,
webdesign color
rgb real
xml youtube
web2.0 youtube
video web2.0
web2.0 xml
online presentation
social video
youtube video
internet video
youtube
revver
research video
vodcast
primer
p2p TV
overfit
data, not
real phrases
15
Tagging Web Documents (Cont.)
Urls
http://pipes.yahoo.com
(386 bookmarks)
http://www.miniajax.com/
(349 bookmarks)
Too general,
sometimes not
relevant
LM p(w|d)
Tag = Word
Relevant, precise
yahoo
pipes
rss
feeds
web2.0
yahoo
mashup
mashup
feeds
rss
programming syndication
Meaningful,
pipes
mashups
relevant
ajax
javascript
web2.0
webdesign
programming
code
webdev
ajax
dhtml
javascript
moo.fx
dragdrop
phototype
autosuggest
But sometimes
not meaningful
Tag = bigram
Tag =
wikipedia title
feeds mashups
mashup pipes
web2.0 yahoo
rss web2.0
mashup rss
api feeds
pipes programming
ajax code
code javascript
javascript ajax
javascript web2.0
css ajax
javascript programming
overfit
data, not
real phrases
pipes
yahoo
mashups
rss
syndication
mashups
Meaningful,
blog
feeds real
relevant,
ajax
dhtml
javascript
moo.fx
javascript library
javascript framework
16
Tagging Web Users
Users
LM p(w|d)
photography
art
portraits
tools
web data, not
overfit
design
real
phrases
geek
User 1
humor
programming
photography
Partially covers blog
the interest webdesign
security
funny
User 2
Tag = bigram
Tag = wikipedia title
art photography
photography portraits
digital flickr
photoblog photography
art photo
flickr photography
weblog wordpress
art photography
photoblog
portraits
photography
landscapes
flickr
art contest
geek hack
humor programming
hack hacking
network programming
tweak
Meaningful,
hacking
relevant, real
security
geek humor
sysadmin
digitalcamera
networking programming
geek html
geek hacking
reference security
17
Tagging Web Users (Cont.)
Users
LM p(w|d)
Tag = bigram
Tag = wikipedia title
User 3
games
arg
tools
programming
sudoku
cryptography
software
arg games
games puzzles
games internet
arg code
games sudoku
code generator
community games
arg
games research
games
puzzles
storytelling
code generator
community games
User 4
web
reference
css
development
rubyonrails
tools
design
rubyonrails web
css development
brower development
development editor
development forum
development firefox
javascript tools
javascript
css
webdev
xhtml
Missed many
dhtml
good tags
css3
dom
18
Discussions
• Using top tags: too general, sometimes not relevant
• Ranking tags by labeling language models:
– Candidate = Social bookmarking words
• Pros: relevant, compact
• Cons: ambiguous, not so meaningful
– Candidate = Social bookmarking bigrams
• Pros: more meaningful, relevant
• Cons: overfiting the data, sometimes not real phrases
– Candidate = Wikipedia Titles:
• Pros: meaningful, relevant real phrases
• Cons: biased, missed potential good tags. (Bias(t, C))
19
Summary
• Automatic tagging of web documents and web
users
• A probabilistic approach based on labeling
language models
• Effective when the candidate tags are of high
quality
• Future work:
– A robust way of generating candidate tags
– Large scale evaluation
20
Thanks!
21