Spiro Title Slide

Transcript Spiro Title Slide

Extracting User Profiles from Large
Scale Data
David Konopnicki
Joint work with Michal Shmueli-Scheuer, Haggai Roitman,
David Carmel and Yosi Mass
IBM Haifa Research Lab
© 2009 IBM Corporation
User Browsing
Motivating Example
san-francisco
peer
michael
jackson alive
analysis
Keywords Modeling:
for each user, report the
most meaningful keywords
to describe her profile.
Large scale content
analysis for mass
amount of users.
Dashboard
Track statistics
about readers
interests
2
Update users
profiles
Profiles
database
Advertisement
System
© 2009 IBM Corporation
Contributions
 User Profiling Framework:
– User profile model
– KL approach to weight user profile
 Large scale implementation:
– MapReduce flow
 Experiments:
– Quality analysis
– Scalability analysis
3
© 2009 IBM Corporation
User Profiling Framework- Setting
<userID, docID>
<docID, content>
<u1,d1>
<d1,{bla,bla,bla}>
<u1,d2>
<d2,{foo,foo}>
<u2,d2>
logging
4
targeting
© 2009 IBM Corporation
User Profiling - Definitions
 Bag of words model (BOW)
p(u)  (wuj (t1 ), wuj (t2 ),, wuj (tm )), ti  Vocabulary
 Profile maintenance
~
p j (u)   j  p j 1 (u)  (1   j )  p j (u)
 User snapshot
D j (u)
 Community snapshot
Dj
5
© 2009 IBM Corporation
User Profiling - Intuition
 Find terms that are highly frequent in the user
snapshot and separate the most between the
user and the community snapshots
{ Travel, Tennis ,Sport }
6
© 2009 IBM Corporation
User Profiling – Naïve approach
 Term frequency: number of times a term t appears in document d- tf(t,d)
 Document frequency: the number of documents containing the term t – df(t,D)
wuj (t )  tf (t , D j (u))  idf (t , D j )  udf (t , D j (u))
frequent
average tf over the
user snapshot
7
separate
inverse document
frequency (df) of a
term in the
community
snapshot
probability to find a
term in the user
snapshot
© 2009 IBM Corporation
Kullback-Leibler (KL) Divergence
 Measures the difference between two probability distributions
P1 and P2 :
DKL ( P1 || P2 )  tV P1 (t ) log
 KL measures the distance between the Community
distribution and the User distribution
P1 (t )
P2 (t )
Community
User
 Each term is scored according to its contribution to
the KL distance between the community and the
user distributions.
 The top scored terms are then selected as the
user important terms.
8
© 2009 IBM Corporation
User Profiling – KL method
 Community marginal term distribution:
P(t || D j )  tf (t , D j )  cdf (t , D j )  N j
average tf over the
community snapshot
Probability to find a term
t in community snapshot
probability
normalization factor
 User marginal term distribution
P (t || D j (u ))  (1   )(

wuj (t )
tV
u
j
w (t )
)  P (t | D j )
=0.001
Relative initial weight
of term t
9
Smoothing with the
community snapshot
© 2009 IBM Corporation
MapReduce Flow
Mapper:
input: (d,text)
output ({t,d},1)
Reducer:
output ({t,d}, tf(t,d)) // Sum
HDFS
 d, text 
 u, d 
TF
|Dj(u)|
 d , t , tf (t , d ) 
 u, d , | D j (u) |
Mapper:
input: (u,d)
output (u,1)
Reducer:
output (u,|Dj(u)|) // Sum
HDFS
Mapper:
input: (t,tf(t,d),|Dj|)
output (t,{tf(t,d),|Dj|,1})
Reducer:
output (t, tf(t, Dj)) //Avg
 d , t, tf (t, d ),| Dj |
 u, t, d , tf (t, d ),| Dj (u) |
¯
TF
 t , tf (t , D j ),| D j |
UDF
DF
Mapper:
input: ({t,d},tf(t,Dj))
output (t,1})
Reducer:
output (t, {df(t,Dj),idf(t,Dj),cdf(t,Dj})
 t , tf (t , D j ), df (t , D j ),| D j |, idf (t , D j ), cdf (t , D j ) 
Mapper:
input: ({u,t,d},{tf(t,Dj(u)),|Dj(u)|})
output ({u,t,|Dj(u)},{1})
Reducer:
output ({u,t},{udf(t,Dj(u))})
 u, t, udf (t, D j (u)) 
HDFS
 t , tf (t , D j ), cdf (t , D j ) 
Nj
 t, N j 
HDFS
Mapper:
input: ({t},{tf(t,Dj),cdf(t,Dj)})
output (t,Nj})
Reducer: identity
 t , tf (t , D j ), cdf (t , D j ), N j 
P(t|Dj)
Mapper:
input: ({t},{tf(t,Dj),|Dj|,cdf(t,Dj),Nj})
output (t,P(t|Dj)})
10
Reducer:
identity
 t, P(t || Dj ) 
HDFS
© 2009 IBM Corporation
MapReduce Flow- cont.
HDFS
wuj (t )  tf (t , D j (u))  idf (t , D j )  udf (t , D j (u))
 u, t, d , tf (t, d ),udf (t, Dj (u)),idf (t, Dj ) 
w
 u, t , w 
HDFS
 u, t , w 
∑w
 u, t , tV w 
HDFS
P (t || D j (u ))  (1   )(

wuj (t )
tV
wuj (t )
)  P (t | D j )
 u, t , tV w, P(t || D j ), w,  
P(t|Dj(u))
 u, t, P(t || D j ), P(t || D j (u)) 
HDFS
~ u (t )  P(t | D (u)) log
w
j
j
P(t | D j (u))
~
w
 u, t, P(t || D j ), P(t || D j (u)) 
~
 u, t , w
P(t | D j )
HDFS
11
© 2009 IBM Corporation
Experimental Data- quality analysis
 Open Directory Project (ODP):
– Categories are associated with manual labels
– Considered as “ground-truth” in this work
– Examples:
• ODP: Science/Technology/Electronics: Manual label: “Electronics”
• ODP: Society/Religion/and/Spirituality/Buddhism: Manual label: “Buddhism”
 Data Collection:
– 100 different categories randomly selected from ODP
– 100 documents randomly selected per category
– A total collection size of about 10,000 Web pages
 Evaluation:
– A match is considered if the suggested label is identical, an inflection, or a Wordnet’s
synonym to the manual label
12
© 2009 IBM Corporation
Results
ODP Category Label
Top-5 KL important terms
Bowling
bowl, bowler, lane, bowl center, league
Buddhism
Buddhist, Buddhism, Buddha, Zen, dharma
Ice Hockey
hockey, nhl, hockey league, coach, head coach
Electronics
voltage, high voltage, circuit, laser, power supply
1
Mutual Information
Xi^2
tf-udf-idf
KL
0.9
0.8
In how many
cases, we got at
least one correct
term from the topK terms.
Match@K
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
10
K
KL outperforms all other approaches for features selection
13
© 2009 IBM Corporation
Experimental Data- scalability analysis
 Blogger.com
http://grannyalong.blogspot.com/
Blog entry
 Data Collection:
– We crawled 973,518 blog posts from March 2007 until January 2009
– Total collection size of 5.45GB, with ~120,000 users
 Cluster setting:
14
– 4-node commodity machines cluster (each machine with 4GB RAM, 60GB HD, 4 cores)
© 2009 IBM Corporation
– Hadoop 0.20.1
Number of User Profiles
Document ratio
User profile ratio
Time ratio
13
11
Ratio
9
Document Ratio
User profile Ratio
Time Ratio
7
5
3
1
#2
#3
#4
Dataset
Runtime ratio is correlated with the number of user profiles ratio
15
© 2009 IBM Corporation
Data Size
#user: chose 18,000 users
between March-Apr 2007
Running time [min.]
120
90
60
30
40
80
120
160
200
Number of documents [Thousands]
Runtime linearly increases with the increasing of data size
16
© 2009 IBM Corporation
Related Work
 Content-based user profiling:
– Profile contains a taxonomic hierarchy for the long-term model. The Taxonomy
is taken from the ODP. Short-term activities update the hierarchy.
– Adaptive user profile: Use words that appear in the Web pages and combine
them using tfidf, looking on some window and giving different weights
according to the recency of the browsing
 KL approach to user tasks:
– Filter new documents that are not related to the user based on his profile.
– Annotate a url with the most descriptive query term for a given user, based on
his profile.
 User targeting in large-scale systems:
– Behavioral targeting system over Hadoop MapReduce.
– Large scale CF technique for movies recommendations for users.
– Incremental algorithm to construct user profile based on monitoring and user
feedback which trades-off between complexity and quality of the profile.
17
© 2009 IBM Corporation
Conclusions & Future Work
 We proposed a scalable user profiling solution Implemented
on top of Hadoop MapReduce
 We showed quality and scalability results
 We plan to extend the user model into semantic model
 Extend the user profile to include structured data
18
© 2009 IBM Corporation
Thank You !
© 2009 IBM Corporation