Transcript slides
Discovery of Aggregate Usage
Profiles for Web Personalization
Bamshad Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa, Jim Wiltshire
School of Computer Science, Telecommunications, and Information Systems
DePaul University
Web Personalization
The Problem
dynamically serve customized content (pages, products, etc.) to users based on
their profiles, preferences, or expected interests
Current Approaches
rule-based filtering
usually relies on static profile for users in part obtained through explicit registration
collaborative filtering
usually requires explicit ratings from users on similar types of objects
content-based filtering: learn/store personal profiles locally or on server-side
based on content similarity of user profile to pages or product descriptions
Limitations of Current Technologies
user input may be subjective and prone to bias
explicit (and non-binary) user ratings may not be available
profiles may be static and can become outdated quickly
collaborative filtering: problems with scalability due to sparse data
content-based filtering: may miss other semantic relationships among objects
Usage-Based Web Personalization
Basic Idea
find aggregate user profiles by automatically discovering user access patterns
through Web usage mining (offline process)
data sources for mining include server logs, other click-stream data (e.g., productoriented user events), and site structure
match a user’s active session against the discovered profiles to provide dynamic
content (online process)
Advantages / Goals
profiles are based on objective information (how users actually use the site)
no explicit user ratings or interaction with users (to enter a profile, etc.)
helps preserve user privacy, by making effective use of anonymous data
usage data captures relationships missed by content-based approaches
can help enhance the effectiveness of collaborative or content-based filtering
techniques
Automatic Web Personalization:
Offline Process
Data Preparation
Usage Mining
Transaction Clustering
Pageview Clustering
Site Files
Data Cleaning
Session Identification
Pageview Identification
Transaction Identification
Support Filtering
Server Logs & Other
Click-Stream Data
Usage
Profiles
User Transaction File
Association-Rule
Discovery
Domain Knowledge
Frequent Itemsets
Automatic Web Personalization:
Online Process
Recommendation Engine
Input from the
batch process
Usage Profiles
Recommendations
Active Session
Web Server
Client Browser
Data Preparation Tasks
Preprocess and filter logs and other usage data
remove redundant references and create pageviews
domain knowledge to assign types to pageviews
handle references to scripts creating dynamic pages
map logs against site topology
Identify user sessions and transactions
heuristics based on IP, referrer, agent fields, and session time-outs used to identify
unique user sessions (may need to infer missing references)
intra-session transactions can be obtained based on a model of user behavior
(involves classifying references as “content” or “navigational” for each user)
weights are assigned to each pageview based on static pageview types as well as
some measure of user interest (e.g., duration of pageview)
Support filtering - remove very low/high support pageviews
Aggregate Usage Profiles
Characteristics of Aggregate Profiles
the goal is to effectively capture common usage patterns from potentially
anonymous click-stream data
profiles are represented as weighted collections of pageviews
weights represent the significance of pageviews within each profile
profiles are overlapping in order to capture common interests among different
groups/types of users
multiple profiles may contribute to the recommendation set for a given user
Example Profiles from the ACR (Assoc. for Consumer Research) Site:
1.00
0.67
0.67
0.67
0.67
0.67
Call for Papers
ACR News Special Topics
CFP: Journal of Psychology and Marketing I
CFP: Journal of Psychology and Marketing II
CFP: Journal of Consumer Psychology II
CFP: Journal of Consumer Psychology I
1.00
1.00
0.36
0.30
0.25
0.24
CFP: Winter 2000 SCP Conference
Call for Papers
CFP: ACR 1999 Asia-Pacific Conference
ACR 1999 Annual Conference
ACR News Updates
Conference Update
Methodologies for the Discovery
of Aggregate Profiles
Discovery of Profiles Based on Transaction Clusters
cluster user transactions - features are significant pageviews identified in the
preprocessing stage
derive usage profiles (set of pageview-weight pairs) based on characteristics of
each transaction cluster
Cluster Pageviews
directly compute overlapping clusters of pageviews based on co-occurrence
patterns across transactions
features are user transactions, so dimensionality poses a problem for traditional
clustering algorithms
we use Association-Rule Hypergraph Partitioning with an overlap factor
Profile Aggregation Based on
Clustering Transactions (PACT)
Input
set of relevant pageviews in preprocessed log
set of user transactions
T {t1 , t2 ,
each transaction is a pageview vector
P { p1 , p2 ,
, pn }
, tm }
t w( p1 , t ), w( p2 , t ),..., w( pn , t )
Transaction Clusters
each cluster contains a set of transaction vectors
for each cluster compute centroid as cluster representative
c u1c , u2c , , unc
Aggregate Usage Profiles
a set of pageview-weight pairs: for transaction cluster C, select each pageview pi
such that
uic
(in the cluster centroid) is greater than a pre-specified threshold
Hypergraph-Based Clustering
Construct a hypergraph from sets of related items
Each hyperedge represents a frequent itemset
Weight of each hyperedge can be based on the
characteristics of frequent itemsets or association rules
Recursively partition hypergraph so that each partition contains only highly
connected data items
Given a hypergraph G=(V,E) we find a k-way partitioning such that the weight of
the hyperedges that are cut is minimized
The fitness of partitions measured in terms of the ratio of weights of cut edges to
the weights of uncut edges within the partitions
The connectivity measures the percentage of edges within the partition with which
the vertex is associated -- used for filtering partitions
Vertices from partial edges can be added back to clusters based on a user-specified
overlap factor
Profiles Based on Hypergraph
Clusters of Pageviews
Input
input for clustering is the set of large itemsets from association rule module
each itemset is a hyperedge (weights are a function of the interest of the itemset)
Interest ( I )
support( I )
iI support(i)
Aggregate Profiles (Pageview Clusters)
hMETIS used as the underlying hypergraph partitioning algorithm
clustering program directly outputs a set of overlapping pageview clusters
the weight associated with pageview p in a cluster C is based on the connectivity
value of p in hypergraph partition:
{e | e C , p e}
conn( p, C )
{e | e C}
Recommendations Based on
Usage Profiles
Match current user’s activity against the discovered usage profiles
a sliding window over the active session to capture the current user’s “short-term”
history depth
usage profiles and the active session are treated as vectors
matching score is computed based on the similarity between vectors (e.g,
normalized cosine similarity)
Recommendations
each pageview is assigned a recommendation score based on
matching score to aggregate profiles
“information value” of the pageview based on domain knowledge (e.g., link
distance of the candidate recommendation to the active session)
recommendations are contributed by multiple matching aggregate profiles
Experimental Set-up
The Data Sets
Log data from the Association for Consumer Research Web site
18342 transactions, 62 pageview URLs (after filtering)
Data set divided into training and evaluation sets
Evaluation Methodology
Portion of each transaction (based on a specified window size) in evaluation set was
used to generate a recommendation set (based on a given recommendation
threshold)
For each transaction, the overall coverage of the recommendation set was divided by
the number of recommendations to produce an accuracy measure
The overall score was computed (for each threshold) by taking the average scores
over all transactions in the evaluation set
Average Visit Percentage
AVP measures the likelihood that a user who visits any page in a
Given profile, also visits other pages in that profile
Evaluation: Measuring
Recommendation Accuracy
Recommendation
accuracy results,
using a active
session window of
size 3.
Evaluation: Impact of Filtering
Comparison of PACT and
Hypergraph (using window
size 2) for filtered and
unfiltered data sets.
Filtering involved the
removal of top-level
navigational pages from
the data set, leaving only
deeper content-oriented
pages.
Conclusions
Usage-Based Web Personalization
results suggest that effective personalization can be achieved even with anonymous
and short-term click-stream data
possibly useful in the early stages of personalization when more detailed profiles
are not available for individual users
could be used effectively in conjunction with other methods based on contentbased or collaborative filtering
Which Method is Best?
PACT may be most appropriate when the goal is to provide a more general
personalization solution involving a variety of objects across the whole site
Hypergraph may be most appropriate when the goal is to provide a highly focused
set of recommendations for specific portions of the site
In practice, usage-based methods need to be combined with other techniques to
provide an integrated solution