Spiro Title Slide
Download
Report
Transcript Spiro Title Slide
Extracting User Profiles from Large
Scale Data
David Konopnicki
Joint work with Michal Shmueli-Scheuer, Haggai Roitman,
David Carmel and Yosi Mass
IBM Haifa Research Lab
© 2009 IBM Corporation
User Browsing
Motivating Example
san-francisco
peer
michael
jackson alive
analysis
Keywords Modeling:
for each user, report the
most meaningful keywords
to describe her profile.
Large scale content
analysis for mass
amount of users.
Dashboard
Track statistics
about readers
interests
2
Update users
profiles
Profiles
database
Advertisement
System
© 2009 IBM Corporation
Contributions
User Profiling Framework:
– User profile model
– KL approach to weight user profile
Large scale implementation:
– MapReduce flow
Experiments:
– Quality analysis
– Scalability analysis
3
© 2009 IBM Corporation
User Profiling Framework- Setting
<userID, docID>
<docID, content>
<u1,d1>
<d1,{bla,bla,bla}>
<u1,d2>
<d2,{foo,foo}>
<u2,d2>
logging
4
targeting
© 2009 IBM Corporation
User Profiling - Definitions
Bag of words model (BOW)
p(u) (wuj (t1 ), wuj (t2 ),, wuj (tm )), ti Vocabulary
Profile maintenance
~
p j (u) j p j 1 (u) (1 j ) p j (u)
User snapshot
D j (u)
Community snapshot
Dj
5
© 2009 IBM Corporation
User Profiling - Intuition
Find terms that are highly frequent in the user
snapshot and separate the most between the
user and the community snapshots
{ Travel, Tennis ,Sport }
6
© 2009 IBM Corporation
User Profiling – Naïve approach
Term frequency: number of times a term t appears in document d- tf(t,d)
Document frequency: the number of documents containing the term t – df(t,D)
wuj (t ) tf (t , D j (u)) idf (t , D j ) udf (t , D j (u))
frequent
average tf over the
user snapshot
7
separate
inverse document
frequency (df) of a
term in the
community
snapshot
probability to find a
term in the user
snapshot
© 2009 IBM Corporation
Kullback-Leibler (KL) Divergence
Measures the difference between two probability distributions
P1 and P2 :
DKL ( P1 || P2 ) tV P1 (t ) log
KL measures the distance between the Community
distribution and the User distribution
P1 (t )
P2 (t )
Community
User
Each term is scored according to its contribution to
the KL distance between the community and the
user distributions.
The top scored terms are then selected as the
user important terms.
8
© 2009 IBM Corporation
User Profiling – KL method
Community marginal term distribution:
P(t || D j ) tf (t , D j ) cdf (t , D j ) N j
average tf over the
community snapshot
Probability to find a term
t in community snapshot
probability
normalization factor
User marginal term distribution
P (t || D j (u )) (1 )(
wuj (t )
tV
u
j
w (t )
) P (t | D j )
=0.001
Relative initial weight
of term t
9
Smoothing with the
community snapshot
© 2009 IBM Corporation
MapReduce Flow
Mapper:
input: (d,text)
output ({t,d},1)
Reducer:
output ({t,d}, tf(t,d)) // Sum
HDFS
d, text
u, d
TF
|Dj(u)|
d , t , tf (t , d )
u, d , | D j (u) |
Mapper:
input: (u,d)
output (u,1)
Reducer:
output (u,|Dj(u)|) // Sum
HDFS
Mapper:
input: (t,tf(t,d),|Dj|)
output (t,{tf(t,d),|Dj|,1})
Reducer:
output (t, tf(t, Dj)) //Avg
d , t, tf (t, d ),| Dj |
u, t, d , tf (t, d ),| Dj (u) |
¯
TF
t , tf (t , D j ),| D j |
UDF
DF
Mapper:
input: ({t,d},tf(t,Dj))
output (t,1})
Reducer:
output (t, {df(t,Dj),idf(t,Dj),cdf(t,Dj})
t , tf (t , D j ), df (t , D j ),| D j |, idf (t , D j ), cdf (t , D j )
Mapper:
input: ({u,t,d},{tf(t,Dj(u)),|Dj(u)|})
output ({u,t,|Dj(u)},{1})
Reducer:
output ({u,t},{udf(t,Dj(u))})
u, t, udf (t, D j (u))
HDFS
t , tf (t , D j ), cdf (t , D j )
Nj
t, N j
HDFS
Mapper:
input: ({t},{tf(t,Dj),cdf(t,Dj)})
output (t,Nj})
Reducer: identity
t , tf (t , D j ), cdf (t , D j ), N j
P(t|Dj)
Mapper:
input: ({t},{tf(t,Dj),|Dj|,cdf(t,Dj),Nj})
output (t,P(t|Dj)})
10
Reducer:
identity
t, P(t || Dj )
HDFS
© 2009 IBM Corporation
MapReduce Flow- cont.
HDFS
wuj (t ) tf (t , D j (u)) idf (t , D j ) udf (t , D j (u))
u, t, d , tf (t, d ),udf (t, Dj (u)),idf (t, Dj )
w
u, t , w
HDFS
u, t , w
∑w
u, t , tV w
HDFS
P (t || D j (u )) (1 )(
wuj (t )
tV
wuj (t )
) P (t | D j )
u, t , tV w, P(t || D j ), w,
P(t|Dj(u))
u, t, P(t || D j ), P(t || D j (u))
HDFS
~ u (t ) P(t | D (u)) log
w
j
j
P(t | D j (u))
~
w
u, t, P(t || D j ), P(t || D j (u))
~
u, t , w
P(t | D j )
HDFS
11
© 2009 IBM Corporation
Experimental Data- quality analysis
Open Directory Project (ODP):
– Categories are associated with manual labels
– Considered as “ground-truth” in this work
– Examples:
• ODP: Science/Technology/Electronics: Manual label: “Electronics”
• ODP: Society/Religion/and/Spirituality/Buddhism: Manual label: “Buddhism”
Data Collection:
– 100 different categories randomly selected from ODP
– 100 documents randomly selected per category
– A total collection size of about 10,000 Web pages
Evaluation:
– A match is considered if the suggested label is identical, an inflection, or a Wordnet’s
synonym to the manual label
12
© 2009 IBM Corporation
Results
ODP Category Label
Top-5 KL important terms
Bowling
bowl, bowler, lane, bowl center, league
Buddhism
Buddhist, Buddhism, Buddha, Zen, dharma
Ice Hockey
hockey, nhl, hockey league, coach, head coach
Electronics
voltage, high voltage, circuit, laser, power supply
1
Mutual Information
Xi^2
tf-udf-idf
KL
0.9
0.8
In how many
cases, we got at
least one correct
term from the topK terms.
Match@K
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
10
K
KL outperforms all other approaches for features selection
13
© 2009 IBM Corporation
Experimental Data- scalability analysis
Blogger.com
http://grannyalong.blogspot.com/
Blog entry
Data Collection:
– We crawled 973,518 blog posts from March 2007 until January 2009
– Total collection size of 5.45GB, with ~120,000 users
Cluster setting:
14
– 4-node commodity machines cluster (each machine with 4GB RAM, 60GB HD, 4 cores)
© 2009 IBM Corporation
– Hadoop 0.20.1
Number of User Profiles
Document ratio
User profile ratio
Time ratio
13
11
Ratio
9
Document Ratio
User profile Ratio
Time Ratio
7
5
3
1
#2
#3
#4
Dataset
Runtime ratio is correlated with the number of user profiles ratio
15
© 2009 IBM Corporation
Data Size
#user: chose 18,000 users
between March-Apr 2007
Running time [min.]
120
90
60
30
40
80
120
160
200
Number of documents [Thousands]
Runtime linearly increases with the increasing of data size
16
© 2009 IBM Corporation
Related Work
Content-based user profiling:
– Profile contains a taxonomic hierarchy for the long-term model. The Taxonomy
is taken from the ODP. Short-term activities update the hierarchy.
– Adaptive user profile: Use words that appear in the Web pages and combine
them using tfidf, looking on some window and giving different weights
according to the recency of the browsing
KL approach to user tasks:
– Filter new documents that are not related to the user based on his profile.
– Annotate a url with the most descriptive query term for a given user, based on
his profile.
User targeting in large-scale systems:
– Behavioral targeting system over Hadoop MapReduce.
– Large scale CF technique for movies recommendations for users.
– Incremental algorithm to construct user profile based on monitoring and user
feedback which trades-off between complexity and quality of the profile.
17
© 2009 IBM Corporation
Conclusions & Future Work
We proposed a scalable user profiling solution Implemented
on top of Hadoop MapReduce
We showed quality and scalability results
We plan to extend the user model into semantic model
Extend the user profile to include structured data
18
© 2009 IBM Corporation
Thank You !
© 2009 IBM Corporation