Transcript slides

Discovery of Aggregate Usage
Profiles for Web Personalization
Bamshad Mobasher, Honghua Dai, Tao Luo, Miki Nakagawa, Jim Wiltshire
School of Computer Science, Telecommunications, and Information Systems
DePaul University
Web Personalization

The Problem
 dynamically serve customized content (pages, products, etc.) to users based on
their profiles, preferences, or expected interests

Current Approaches
 rule-based filtering
 usually relies on static profile for users in part obtained through explicit registration
 collaborative filtering
 usually requires explicit ratings from users on similar types of objects
 content-based filtering: learn/store personal profiles locally or on server-side
 based on content similarity of user profile to pages or product descriptions

Limitations of Current Technologies
 user input may be subjective and prone to bias
 explicit (and non-binary) user ratings may not be available
 profiles may be static and can become outdated quickly
 collaborative filtering: problems with scalability due to sparse data
 content-based filtering: may miss other semantic relationships among objects
Usage-Based Web Personalization

Basic Idea
 find aggregate user profiles by automatically discovering user access patterns
through Web usage mining (offline process)
 data sources for mining include server logs, other click-stream data (e.g., productoriented user events), and site structure
 match a user’s active session against the discovered profiles to provide dynamic
content (online process)

Advantages / Goals
 profiles are based on objective information (how users actually use the site)
 no explicit user ratings or interaction with users (to enter a profile, etc.)
 helps preserve user privacy, by making effective use of anonymous data
 usage data captures relationships missed by content-based approaches
 can help enhance the effectiveness of collaborative or content-based filtering
techniques
Automatic Web Personalization:
Offline Process
Data Preparation
Usage Mining
Transaction Clustering
Pageview Clustering
Site Files
Data Cleaning
Session Identification
Pageview Identification
Transaction Identification
Support Filtering
Server Logs & Other
Click-Stream Data
Usage
Profiles
User Transaction File
Association-Rule
Discovery
Domain Knowledge
Frequent Itemsets
Automatic Web Personalization:
Online Process
Recommendation Engine
Input from the
batch process
Usage Profiles
Recommendations
Active Session
Web Server
Client Browser
Data Preparation Tasks

Preprocess and filter logs and other usage data
 remove redundant references and create pageviews
 domain knowledge to assign types to pageviews
 handle references to scripts creating dynamic pages
 map logs against site topology

Identify user sessions and transactions
 heuristics based on IP, referrer, agent fields, and session time-outs used to identify
unique user sessions (may need to infer missing references)
 intra-session transactions can be obtained based on a model of user behavior
(involves classifying references as “content” or “navigational” for each user)
 weights are assigned to each pageview based on static pageview types as well as
some measure of user interest (e.g., duration of pageview)

Support filtering - remove very low/high support pageviews
Aggregate Usage Profiles

Characteristics of Aggregate Profiles
 the goal is to effectively capture common usage patterns from potentially
anonymous click-stream data
 profiles are represented as weighted collections of pageviews
 weights represent the significance of pageviews within each profile
 profiles are overlapping in order to capture common interests among different
groups/types of users
 multiple profiles may contribute to the recommendation set for a given user

Example Profiles from the ACR (Assoc. for Consumer Research) Site:
1.00
0.67
0.67
0.67
0.67
0.67
Call for Papers
ACR News Special Topics
CFP: Journal of Psychology and Marketing I
CFP: Journal of Psychology and Marketing II
CFP: Journal of Consumer Psychology II
CFP: Journal of Consumer Psychology I
1.00
1.00
0.36
0.30
0.25
0.24
CFP: Winter 2000 SCP Conference
Call for Papers
CFP: ACR 1999 Asia-Pacific Conference
ACR 1999 Annual Conference
ACR News Updates
Conference Update
Methodologies for the Discovery
of Aggregate Profiles

Discovery of Profiles Based on Transaction Clusters
 cluster user transactions - features are significant pageviews identified in the
preprocessing stage
 derive usage profiles (set of pageview-weight pairs) based on characteristics of
each transaction cluster

Cluster Pageviews
 directly compute overlapping clusters of pageviews based on co-occurrence
patterns across transactions
 features are user transactions, so dimensionality poses a problem for traditional
clustering algorithms
 we use Association-Rule Hypergraph Partitioning with an overlap factor
Profile Aggregation Based on
Clustering Transactions (PACT)

Input
 set of relevant pageviews in preprocessed log
 set of user transactions
T  {t1 , t2 ,
 each transaction is a pageview vector

P  { p1 , p2 ,
, pn }
, tm }
t  w( p1 , t ), w( p2 , t ),..., w( pn , t )
Transaction Clusters
 each cluster contains a set of transaction vectors
 for each cluster compute centroid as cluster representative
c  u1c , u2c , , unc

Aggregate Usage Profiles
 a set of pageview-weight pairs: for transaction cluster C, select each pageview pi
such that
uic
(in the cluster centroid) is greater than a pre-specified threshold
Hypergraph-Based Clustering

Construct a hypergraph from sets of related items
 Each hyperedge represents a frequent itemset
 Weight of each hyperedge can be based on the
characteristics of frequent itemsets or association rules

Recursively partition hypergraph so that each partition contains only highly
connected data items
 Given a hypergraph G=(V,E) we find a k-way partitioning such that the weight of
the hyperedges that are cut is minimized
 The fitness of partitions measured in terms of the ratio of weights of cut edges to
the weights of uncut edges within the partitions
 The connectivity measures the percentage of edges within the partition with which
the vertex is associated -- used for filtering partitions
 Vertices from partial edges can be added back to clusters based on a user-specified
overlap factor
Profiles Based on Hypergraph
Clusters of Pageviews

Input
 input for clustering is the set of large itemsets from association rule module
 each itemset is a hyperedge (weights are a function of the interest of the itemset)
Interest ( I ) 

support( I )
iI support(i)
Aggregate Profiles (Pageview Clusters)
 hMETIS used as the underlying hypergraph partitioning algorithm
 clustering program directly outputs a set of overlapping pageview clusters
 the weight associated with pageview p in a cluster C is based on the connectivity
value of p in hypergraph partition:
{e | e  C , p  e}
conn( p, C ) 
{e | e  C}
Recommendations Based on
Usage Profiles

Match current user’s activity against the discovered usage profiles
 a sliding window over the active session to capture the current user’s “short-term”
history depth
 usage profiles and the active session are treated as vectors
 matching score is computed based on the similarity between vectors (e.g,
normalized cosine similarity)

Recommendations
 each pageview is assigned a recommendation score based on
matching score to aggregate profiles
“information value” of the pageview based on domain knowledge (e.g., link
distance of the candidate recommendation to the active session)
 recommendations are contributed by multiple matching aggregate profiles
Experimental Set-up

The Data Sets
 Log data from the Association for Consumer Research Web site
 18342 transactions, 62 pageview URLs (after filtering)
 Data set divided into training and evaluation sets

Evaluation Methodology
 Portion of each transaction (based on a specified window size) in evaluation set was
used to generate a recommendation set (based on a given recommendation
threshold)
 For each transaction, the overall coverage of the recommendation set was divided by
the number of recommendations to produce an accuracy measure
 The overall score was computed (for each threshold) by taking the average scores
over all transactions in the evaluation set
Average Visit Percentage
AVP measures the likelihood that a user who visits any page in a
Given profile, also visits other pages in that profile
Evaluation: Measuring
Recommendation Accuracy
Recommendation
accuracy results,
using a active
session window of
size 3.
Evaluation: Impact of Filtering
Comparison of PACT and
Hypergraph (using window
size 2) for filtered and
unfiltered data sets.
Filtering involved the
removal of top-level
navigational pages from
the data set, leaving only
deeper content-oriented
pages.
Conclusions

Usage-Based Web Personalization
 results suggest that effective personalization can be achieved even with anonymous
and short-term click-stream data
 possibly useful in the early stages of personalization when more detailed profiles
are not available for individual users
 could be used effectively in conjunction with other methods based on contentbased or collaborative filtering

Which Method is Best?
 PACT may be most appropriate when the goal is to provide a more general
personalization solution involving a variety of objects across the whole site
 Hypergraph may be most appropriate when the goal is to provide a highly focused
set of recommendations for specific portions of the site
 In practice, usage-based methods need to be combined with other techniques to
provide an integrated solution