ICS 278: Data Mining Lecture 1: Introduction to Data Mining

Download Report

Transcript ICS 278: Data Mining Lecture 1: Introduction to Data Mining

Project Presentations
•
Thursday next week, each student will make a 4-minute presentation on
their project in class (with 1 or 2 minutes for questions)
•
Email me your Powerpoint or PDF slides, with your name (e.g.,
joesmith.ppt), before 10am next Thursday
•
Suggested content:
–
–
–
–
–
•
Definition of the task/goal
Description of data sets
Description of algorithms
Experimental results and conclusions
Be visual where possible! (i.e., use figures, graphs, etc)
Final project report will be due by 12 noon Tuesday of finals week – more
details to come later
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
ICS 278: Data Mining
Lecture 18: Analysis of Web User Data
Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Outline
• Basic concepts in Web mining
• Analyzing user navigation or clickstream data
• Predictive modeling of Web navigation behavior
– Markov modeling methods
• Analyzing search engine data
• Ecommerce aspects of Web log mining
– Automated recommender systems
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Further Reading
• Modeling the Internet and the Web, P. Baldi, P. Frasconi, P. Smyth,
Wiley, 2003.
• ACM Transactions on Internet Technology (ACM TOIT) – can be
accessed via ACM Digital Library (available from UCI IP addresses).
• Annual WebKDD workshops at the ACM SIGKDD conferences.
• Papers on Web page prediction
– Selective Markov models for predicting Web page accesses, M.
Deshpande, G. Karypis, ACM Transactions on Internet Technology, May
2004.
– Model-based clustering and visualization of navigation patterns on a
Web site, Cadez et al, Journal of Data Mining and Knowledge Discovery,
2003.
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Introduction to Web Mining
• Useful to study human digital behavior, e.g. search engine data can
be used for
– Exploration e.g. # of queries per session?
– Modeling e.g. any time of day dependence?
– Prediction e.g. which pages are relevant?
• Applications
– Understand social implications of Web usage
– Design of better tools for information access
– E-commerce applications
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Advertising Applications
•
Revenue of many internet companies is driven by advertising
•
Key problem:
– Given user data:
• Pages browsed
• Keywords used in search
• Demographics
– Determine the most relevant ads (in real-time)
– Currently about 50% of keyword searches can not be matched effectively to any
ads
– (other aspects include bidding/pricing of ads)
•
Another major problem: “click fraud”
•
Understanding the user is key to these types of applications
– Algorithms that can automatically detect when online advertisements are being
manipulated (this is a major problem for Internet advertising)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Data Sources for Web Mining
•
Web content
– Text and HTML content on Web pages, e.g., categorization of content
•
Web connectivity
– Hyperlink/directed-graph structure of the Web
– e.g., using PageRank to infer importance of Web pages
– e.g., using links to improve accuracy in classification of Web pages
•
Web user data
– Data on how users interact with the Web
• Navigation data, aka “clickstream” data
• Search query data (keywords for users)
• Online transaction data (e.g., purchases at an ecommerce store)
– Volume of data?
• Large portals (e.g., Yahoo!, MSN) report 100’s of millions of users per month
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Flowchart
of a typical
Web Mining
process
(From Cooley,
ACM TOIT,
2003)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
How our Web navigation is recorded…
•
Web logs
– Record activity between client browser and a specific Web server
– Easily available
– Can be augmented with cookies (provide notion of “state”)
•
Search engine records
– Text in queries, which pages were viewed, which snippets were clicked on, etc
•
Client-side browsing records
– Automatically recorded by client-side software
– Harder to obtain, but much more accurate than server-side logs
•
Other sources
– Web site registration, purchases, email, etc
– ISP recording of Web browsing
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Web Server Log Files
• Server Transfer Log:
– transactions between a browser and server are logged
–
–
–
–
IP address, the time of the request
Method of the request (GET, HEAD, POST…)
Status code, a response from the server
Size in byte of the transaction
• Referrer Log:
– where the request originated
• Agent Log:
– browser software making the request (spider)
• Error Log:
– request resulted in errors (404)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
W3C Extended Log File Format
Field
Date
Description
Date
Time
Client IP address
date
time
c-ip
User Name
Servis Name
Server Name
Server IP Address
Server Port
Method
URI Stem
URI Query
Protocol Status
Win32 Status
Bytes Sent
Bytes Received
Time Taken
Protocol Version
Host
cs-username
s-sitename
s-computername
s-ip
s-port
cs-method
cs-uri-stem
cs-uri-query
sc-status
sc-win32-status
sc-bytes
cs-bytes
time-taken
cs-version
cs-host
The date that the activity occurred
The time that the activity occurred
The IP address of the client that accessed your server
The name of the autheticated user who access your server, anonymous
users are represented by The Internet service and instance number that was accessed by a client
The name of the server on which the log entry was generated
The IP address of the server that accessed your server
The port number the client is connected to
The action the client was trying to perform
The resource accessed
The query, if any, the client was trying to perform
The status of the action, in HTTP or FTP terms
The status of the action, in terms used by Microsoft Windows
The number of bytes sent by the server
The number of bytes received by the server
The duration of time, in milliseconds, that the action consumed
The protocol (HTTP, FTP) version used by the client
Display the content of the host header
User Agent
Cookie
Referrer
cs(User Agent)
cs(Cookie)
cs(Referrer)
The browser used on the client
The content of the cookie sent or received, if any
The previous site visited by the user. This site provided a link to the current
site
s = server actions
c = client actions
cs = client-to-server actions
sc = server-to-client actions
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Example of Web Log entries
Apache web log:
205.188.209.10 - - [29/Mar/2002:03:58:06 -0800] "GET
/~sophal/whole5.gif HTTP/1.0" 200 9609
"http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0
(compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)"
216.35.116.26 - - [29/Mar/2002:03:59:40 -0800] "GET
/~alexlam/resume.html HTTP/1.0" 200 2674 "-" "Mozilla/5.0
(Slurp/cat; [email protected]; http://www.inktomi.com/slurp.html)“
202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET
/~tahir/indextop.html HTTP/1.1" 200 3510
"http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible;
MSIE 6.0; Windows NT 5.1)“
202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/animate.js
HTTP/1.1" 200 14261
"http://www.csua.berkeley.edu/~tahir/indextop.html" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1)“
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Routine Server Log Analysis
• Typical statistics/histograms that are computed
–
–
–
–
–
–
Most and least visited web pages
Entry and exit pages
Referrals from other sites or search engines
What are the searched keywords
How many clicks/page views a page received
Error reports, like broken links
• Many software products that produce standard reports of this type
of data
– Very useful for Web site managers
– But does not provide “deep” insights
– e.g., are there clusters/groups of users that use the site in different
ways?
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Visualization of Web Log Data over Time
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Descriptive Summary Statistics
• Histograms, scatter plots, time-series plots
– Very important!
– Helps to understand the big picture
– Provides “marginal” context for any model-building
• models aggregate behavior, not individuals
– Challenging for Web log data
• Examples
– Session lengths (e.g., power laws)
– Click rates as a function of time, content
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
L = number of page requests in a
single session
from visitors to www.ics.uci.edu
over 1 week in November 2002
(robots removed)
0
Empirical Frequency of L
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
1
10
2
10
Session Length L
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Best fit of simple power law model
0
Log P(L) = -a Log L + b
10
or
-1
P(L) = b L-a
Probability of L
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
1
10
2
10
3
10
Session Length L
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
0
10
POISSON
-1
10
Probability of L
GEOMETRIC
-2
10
INVERSE GAUSSIAN
-3
10
POWER-LAW
-4
10
-5
10
-6
10
0
10
1
10
2
10
Session Length L
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Web data measurement issues
• Important to understand how data is collected
• Web data is collected automatically via software logging tools
– Advantage:
• No manual supervision required
– Disadvantage:
• Data can be skewed (e.g. due to the presence of robot traffic)
• Important to identify robots (also known as crawlers, spiders)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
A time-series plot of ICS Website data
Number of page requests per hour as a function of time from page
requests in the www.ics.uci.edu Web server logs during the first week of
April 2002.
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Example: Web Traffic from Commercial Site
(slide from Ronny Kohavi, Amazon)
Weekends
Sept-11
Note significant drop in
human traffic, not bot
traffic
Internal
Performance bot
Registration
at Search
Engine sites
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Robot / human identification
•
•
Removal of robot data is important preprocessing step before clickstream
analysis
Robot page-requests often identified using a variety of heuristics
– e.g. some robots self-identify themselves in the server logs
• All robots in principle should visit robots.txt on the Web Server
• Also, robots should identify themselves via the User Agent field in page requests
• But other robots actively try to disguise that they are robots
– Patterns of access
•
•
•
•
•
Robots explore the entire website in breadth first fashion
Humans access web-pages more typically in depth-first fashion
Timing between page-requests can be more regular for robots (e.g., every 5 seconds)
Duration of sessions, number of page-requests per day: often unusually large (e.g.,
1000’s of page-requests per day) for robots.
Tan and Kumar (Journal of Data Mining and Knowledge Discovery, 2002)
provide a detailed description of using classification techniques to learn how
to detect robots
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Fractions of Robot Data
(from Tan and Kumar, 2002)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
From Tan and
Kumar, 2002
Overall
accuracies
of around 90%
were obtained
using decision
tree classifiers,
trained on sessions
of lengths 1, 2, 3, 4,..
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Page requests, caching, and proxy servers
• In theory, requester browser requests a page from a Web server
and the request is processed
• In practice, there are
–
–
–
–
Data Mining Lectures
Other users
Browser caching
Dynamic addressing in local network
Proxy Server caching
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Page requests, caching, and proxy servers
A graphical summary of how page requests from an individual user can be
masked at various stages between the user’s local computer and the Web
server.
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Page requests, caching, and proxy servers
• Web server logs are therefore not so ideal in terms of a complete
and faithful representation of individual page views
• There are heuristics to try to infer the true actions of the user: – Path completion (Cooley et al. 1999)
• e.g. If known B -> F and not C -> F, then session ABCF can be interpreted
as ABCBF
• Anderson et al. 2001 for more heuristics
• In general case, it is hard to know what exactly the user viewed
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Identifying individual users from Web server logs
• Useful to associate specific page requests to specific individual users
• IP address most frequently used
• Disadvantages
– One IP address can belong to several users
– Dynamic allocation of IP address
• Better to use cookies (or login ID if available)
– Information in the cookie can be accessed by the Web server to identify
an individual user over time
– Actions by the same user during different sessions can be linked
together
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Identifying individual users from Web server logs
• Commercial websites use cookies extensively
• 97 % of users have cookies enabled permanently on their browsers
(source: Amazon.com, 2003)
• However …
– There are privacy issues – need implicit user cooperation
– Cookies can be deleted / disabled
• Another option is to enforce user registration
– High reliability
– But can discourage potential visitors
– Large portals (such as Yahoo!) have high fraction of logged-in users
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Sessionizing
• Time oriented (robust)
– e.g., by gaps between requests
• not more than 20 minutes between successive requests
• this is a heuristic – but is a standard “rule” used in practice
• Navigation oriented (good for short sessions and when timestamps
unreliable)
– Referrer is previous page in session, or
– Referrer is undefined but request within 10 secs, or
– Link from previous to current page in web site
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Client-side data
• Advantages of collecting data at the client side:
– Direct recording of page requests (eliminates ‘masking’ due to caching)
– Recording of all browser-related actions by a user (including visits to
multiple websites)
– More-reliable identification of individual users (e.g. by login ID for
multiple users on a single computer)
•
Preferred mode of data collection for studies of navigation behavior on the
Web
•
Companies like ComScore and Nielsen use client-side software to track
home computer users
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Client-side data
• Statistics like ‘Time per session’ and ‘Page-view duration’ are more
reliable in client-side data
• Some limitations
– Still some statistics like ‘Page-view duration’ cannot be totally reliable
e.g. user might go to fetch coffee
– Need explicit user cooperation
– Typically recorded on home computers – may not reflect a complete
picture of Web browsing behavior
• Web surfing data can be collected at intermediate points like ISPs,
proxy servers
– Can be used to create user profile and target advertise
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Modeling Clickrate Data
• Data
– 200k Alexa users, client-side, over 24 hours
– ignore URLs requested
– goal is to build a time-series model that characterizes user click rates
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
300
NUMBER OF CLICKS
250
200
150
100
50
0
0
5
10
15
20
HOURS
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
120
NUMBER OF CLICKS
100
80
60
40
20
0
-20
-40
-60
5
5.5
6
6.5
7
HOURS
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
300
NUMBER OF CLICKS
250
200
150
100
50
0
0
5
10
15
20
HOURS
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Markov-Poisson Model
(Scott and Smyth, 2003)
• Doubly stochastic process
– Locally constant Poisson rate
– indexed by M Markov states
• Fit a model with M = 3 states
• absence of a Web session
• Web session with slow click rate: 1 minute rate
• Web session with rapid click rate: 10 second rate
– Used hierarchical Bayes on individuals
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Early studies from 1995 to 1997
•
Earliest studies on client-side data are Catledge and Pitkow (1995) and
Tauscher and Greenberg (1997)
•
In both studies, data was collected by logging Web browser commands
•
Population consisted of faculty, staff and students
•
Both studies found
– clicking on the hypertext anchors as the most common action
– using ‘back button’ was the second common action
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Early studies from 1995 to 1997
• high probability of page revisitation (~0.58-0.61)
• Lower bound because the page requests prior to the start of the studies are
not accounted for
• Humans are creatures of habit?
• Content of the pages changed over time?
• strong recency (page that is revisited is usually the page that was
visited in the recent past) effect
• Correlates with the ‘back button’ usage
• Similar repetitive actions are found in telephone number dialing etc
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
The Cockburn and McKenzie study from 2002
•
Earlier studies were outdates
•
Web has changed dramatically in the past few years
•
Cockburn and McKenzie (2002) provides a more up-to-date analysis
•
Study found revisitation rates higher than past 94 and 95 studies (~0.81)
– Analyzed the daily history.dat files produced by the Netscape browser for 17
users for about 4 months
– Population studied consisted of faculty, staff and graduate students
– Time-window is three times that of past studies
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
The Cockburn and McKenzie study from 2002
• Revisitation rate less biased than the previous studies?
• Human behavior changed from an exploratory mode to a utilitarian
mode?
– The more pages user visits, the more are the requests for new pages
– The most frequently requested page for each user can account for a
relatively large fraction of his/her page requests
• Useful to see the scatter plot of the distinct number of pages
requested per user versus the total pages requested
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
The Cockburn and McKenzie study from 2002
The number of distinct pages visited versus page vocabulary size of each
of the 17 users in the Cockburn and McKenzie (2002) study (log-log plot)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
The Cockburn and McKenzie study from 2002
Bar chart of the ratio of the number of page requests for the most frequent
page divided by the total number of page requests, for 17 users in the
Cockburn McKenzie (2002) study
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Outline
• Basic concepts in Web log data analysis
• Predictive modeling of Web navigation behavior
– Markov modeling methods
• Analyzing search engine data
• Ecommerce aspects of Web log mining
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Markov models for page prediction
• General approach is to use a finite-state Markov chain
– Each state can be a specific Web page or a category of Web pages
– If only interested in the order of visits (and not in time), each new
request can be modeled as a transition of states
• Issues
– Self-transition
– Time-independence
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Markov models for page prediction
•
For simplicity, consider order-dependent, time-independent finite-state
Markov chain with M states
•
Let s be a sequence of observed states of length L. e.g. s =
ABBCAABBCCBBAA with three states A, B and C. st is state at position t
(1<=t<=L). In general,
L
P( s )  P( s1 ) P( st | st 1 ,..., s1 )
t 2
•
first-order Markov assumption
•
This provides a simple generative model toL produce sequential data
P( st | st 1 ,..., s1 )  P( st | st 1 )
P ( s )  P ( s1 ) P ( st | st 1 )
t 2
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Markov models for page prediction
•
•
If we denote Tij = P(st = j|st-1 = i), we can define a M x M transition matrix
Properties
– Strong first-order assumption
– Simple way to capture sequential dependence
•
If each page is a state and if W pages, O(W2), W can be of the order 105 to
106 for a CS dept. of a university
•
To alleviate, we can cluster W pages into M clusters, each assigned a state
in the Markov model
•
Clustering can be done manually, based on directory structure on the Web
server, or automatic clustering using clustering techniques
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Markov models for page prediction
•
•
•
Tij = P(st = j|st-1 = i) represents the probability that an individual user’s
next request will be from category j, given they were in category i
We can add E, an end-state to the model
E.g. for three categories with end state:  P(1 | 1) P(2 | 1) P(3 | 1) P( E | 1) 


P
(
1
|
2
)
P
(
2
|
2
)
P
(
3
|
2
)
P
(
E
|
2
)


T 
P(1 | 3) P(2 | 3) P(3 | 3) P( E | 3) 


 0
0
0
1 

•
E denotes the end of a sequence, and start of a new sequence
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Markov models for page prediction
•
•
First-order Markov model assumes that the next state is based only on the
current state
Limitations
– Doesn’t consider ‘long-term memory’
•
We can try to capture more memory with kth-order Markov chain
P(st | st 1 ,.., s1 )  P(st | st 1 ,.., st k )
•
Limitations
– Inordinate amount of training data O(Mk+1)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Parameter estimation for Markov model transitions
•
Smoothed parameter estimates of transition probabilities are
T
ij

nij  qij
ni  
•
If nij = 0 for some transition (i, j) then instead of having a parameter
estimate of 0 (ML), we will have qij /( ni   ) allowing prior knowledge to
be incorporated
•
If nij > 0, we get a smooth combination of the data-driven information (nij)
and the prior
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Parameter estimation for Markov models
• One simple way to set prior parameter is
– Consider alpha as the effective sample size
– Partition the states into two sets, set 1 containing all states directly
linked to state i and the remaining in set 2
– Assign uniform probability r/K to all states in set 2 (all set 2 states are
equally likely)
– The remaining (1-r) can be either uniformly assigned among set 1
elements or weighted by some measure
– Prior probabilities in and out of E can be set based on our prior
knowledge of how likely we think a user is to exit the site from a
particular state
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Predicting page requests with Markov models
•
Deshpande and Karypis (2004) propose schemes to prune kth-order Markov
state space
– Provide systematic but modest improvements
•
Another way is to use empirical smoothing techniques that combine
different models from order 1 to order k (Chen and Goodman 1996)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Mixtures of Markov Chains
•
Cadez et al. (2003) and Sen and Hansen (2003) replace the first-order
Markov chain:
P( st | st 1 ,..., s1 )  P( st | st 1 )
with a mixture of first-order Markov chains
K
P( st | st 1 ,..., s1 )   P( st | st 1 , c  k )P(c  k )
k 1
where c is a discrete-value hidden variable taking K values Sk P(c = k) = 1
and
P(st | st-1, c = k) is the transition matrix for the kth mixture component
•
One interpretation of this is user behavior consists of K different navigation
behaviors described by the K Markov chains
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Modeling Web Page Requests with Markov chain
mixtures
•
MSNBC Web logs
– Order of 2 million individual users per day
– different session lengths per individual
– difficult visualization and clustering problem
•
WebCanvas
– uses mixtures of Markov chains to cluster individuals based on their observed
sequences
– software tool: EM mixture modeling + visualization
•
Next few slides are based on material in:
I. Cadez et al, Model-based clustering and visualization of navigation patterns on a
Web site, Journal of Data Mining and Knowledge Discovery, 2003.
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
From Web logs to sequences
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
User 1
User 2
User 3
User 4
User 5
…
Data Mining Lectures
2
3
7
1
5
3
3
7
5
1
…
2
3
7
1
1
2
1
7
1
5
3
1
7
1
3
1
7
5
3
1
1
1
3
1
3
3
7
1
7
5
1
1
1
1
1
1
Analysis of Web User Data
3
3
Padhraic Smyth, UC Irvine
Clusters of Finite State Machines
Cluster 1
A
B
Cluster 2
A
B
D
E
D
E
A
B
Cluster 3
Data Mining Lectures
D
E
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Learning Problem
•
Assumptions
– data is being generated by K different groups
– Each group is described by a stochastic finite state machine (SFSM)
• aka, a Markov model with an end-state
•
Given
– A set of sequences from different users of different lengths
•
Learn
– A “mixture” of K different stochastic finite state machines
•
Solution
– EM is very easy: fractional counts of transitions
– efficient and accurate, scales as O(KN)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Sketch of EM Algorithm for Mixtures of Markov Chains
•
•
Model = mixture of K Markov chains (K fixed)
Input data = N categorical sequences (can be variable length)
•
Initialization:
•
E-step:
•
–
–
–
Generate random initial transition matrices for each of the K groups
Compute p( sequence i | model k), for i=1,..N, k = 1,…K
Use Bayes rule to compute p(model k | sequence i)
•
Yields membership probabilities for each sequence
M-step:
–
–
Estimate the transition probabilities for each cluster, given membership probabilities
Consists of “fractional counting of transitions”,
•
e.g., sequence with probability 0.8 in cluster k, results in transition counts weighted by 0.8
•
Repeat E and M steps until convergence
•
Complexity of each iteration is O(K N L) where L is the average sequence length
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Prediction with Markov mixtures
P(st+1 | s[1,t] ) =
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Prediction with Markov mixtures
P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] )
= S P(st+1 | k , s[1,t] ) P(k | s[1,t] )
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Prediction with Markov mixtures
P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] )
= S P(st+1 | k , s[1,t] ) P(k | s[1,t] )
= S P(st+1 | k , st ) P(k | s[1,t] )
Prediction of
kth component
Membership, based
on sequence history
=> Predictions are a convex combination of
K different component transition matrices,
with weights based on sequence history
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Experimental Methodology
•
Model Training:
– fit 2 types of models
• mixtures of histograms (multinomials)
• mixtures of finite state machines
– Train on a full day’s worth of MSNBC Web data
•
Model Evaluation:
– “one-step-ahead” prediction on unseen test data
• Test sequences from a different day of Web logs
– compute log P(user’s next click | previous clicks, model)
– Using equation on the previous slide
–
logP score:
• Rewarded if next click was given high P by the model
• Punished if next click was given low P by the model
– negative average of logP scores ~ “predictive entropy”
• Has a natural interpretation
• Lower bounded by 0 bits (perfect prediction)
• Upper bounded by log M bits, where M is the number of categories
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Predictive Entropy Out-of-Sample
4
Negative log-likelihood [bits/token]
3.8
3.6
3.4
3.2
3
2.8
2.6
Mixtures of Multinomials
2.4
Mixtures of SFSMs
2.2
2
Data Mining Lectures
20
40
60
80
100
120
140
Number of
mixture
[K]
Analysis
of Web Usercomponents
Data
160
180
200
Padhraic Smyth, UC Irvine
log count(R)
RUN LENGTH DISTRIBUTIONS WITHIN MARKOV CLUSTERS
10
5
5
0
0
log count(R)
0
10
5
10
15
20
5
0
0
10
10
15
20
5
0
0
log count(R)
10
log count(R)
10
5
10
15
20
10
15
20
Cluster 2: Category 7
-2
5
10
Cluster 3: Category 1
-2
5
10
-2
Cluster 4: Category 1
0
0
10
5
10
Cluster 5: Category 12
5
0
0
0
10
15
R = Run Length
Data Mining Lectures
20
0
1
2
3
R = Run Length
Analysis of Web User Data
4
5
5
20
30
40
10
15
20
Cluster 4: Category 3
0
5
5
0
10
5
0
10
Cluster 3: Category 13
10
0
Cluster 5: Category 9
40
0
0
40
30
2
0
30
0
4
5
20
20
0
5
10
10
Cluster 2: Category 8
4
5
0
0
2
10
Cluster 4: Category 2
5
0
5
0
0
10
Cluster 3: Category 12
Cluster 1: Category 8
4
2
0
5
5
Cluster 1: Category 14
10
Cluster 2: Category 1
0
log count(R)
10
Cluster 1: Category 13
5
10
Cluster 5: Category 6
0
1
2
3
4
R = Run Length
Padhraic Smyth, UC Irvine
Timing Results
2500
2000
N=150,000
Time [sec]
1500
N=110,000
1000
N = 70,000
500
0
-500
0
20
40
60
80
100
120
140
160
180
200
Number of mixture components [K]
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
WebCanvas
•
Software tool for Web log visualization
–
–
–
–
•
uses Markov mixtures to cluster data for display
extensively used within Microsoft
also applied to non-Web data (e.g., how users navigate in Word, etc)
Algorithm and visualization are in latest release of SQLServer (the
“sequence mining” tool)
Model-based visualization
– random sample of actual sequences
– interactive tiled windows displayed for visualization
– more effective than
• planar graphs
• traffic-flow movie in Microsoft Site Server v3.0
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Insights from WebCanvas for MSNBC data
• From msnbc.com site adminstrators….
– significant heterogeneity of behavior
– relatively focused activity of many users
• typically only 1 or 2 categories of pages
– many individuals not entering via main page
– detected problems with the weather page
– missing transitions (e.g., tech <=> business)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Possible Extensions
• Adding time-dependence
– adding time-between clicks, time of day effects
• Uncategorized Web pages
– coupling page content with sequence models
• Modeling “switching” behaviors
– allowing users to switch between behaviors
– Could use a topic-style model: users = mixtures of behaviors
– e.g., Girolami M & Kaban A., Sequential Activity Profiling: Latent Dirichlet
Allocation of Markov Chains, Journal of Data Mining and Knowledge
Discovery, Vol 10, 175-196.
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Related Work
• Mixtures of Markov chains
– special case: Poulsen (1990)
– general case: Ridgeway (1997), Smyth (1997)
• Clustering of Web page sequences
– non-probabilistic approaches (Fu et al, 1999)
• Markov models for prediction
– Anderson et al (IJCAI, 2001):
• mixtures of Markov outperform other sequential models for page-request
prediction
– Sen and Hansen 2003
– Zukerman et al. 1999
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Outline
• Basic concepts in Web log data analysis
• Predictive modeling of Web navigation behavior
– Markov modeling methods
• Analyzing search engine data
• Ecommerce aspects of Web log mining
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Analysis of Search Engine Query Logs
# of Queries
Source
Time Period
Lau & Horvitz
4690
Excite
Sep 1997
Silverstein et al
1 Billion
AltaVista
6 weeks in Aug & Sep
1998
Spink et al
(series of studies)
1 Million for each time
period
Excite
Sep 1997
Dec 1999
May 2001
Xie & O’Hallaron
110,000
Vivisimo
35 days Jan & Feb 2001
1.9 Million
Excite
8 hrs in a day, Dec 1999
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Main Results
• Average number of terms in a query ranges from a low of 2.2 to a
high of 2.6
• The most common number of terms in a query was 2
• The majority of users don’t refine their query
– The number of users who viewed only a single page increased from 29%
(1997) to 51% (2001) (Excite)
– 85% of users viewed only first page of search results (AltaVista)
• 45% (2001) of queries are about Commerce, Travel, Economy, People
(was 20% in 1997)
• All four studies produced a generally consistent set of findings about
user behavior in a search engine context
– most users view relatively few pages per query
– most users don’t use advanced search features
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Xie and O Halloran Study (2002)
- Query Length
Distributions (bars)
- Poisson Model
(dots & lines)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Power-law Characteristics of Common Queries
Power-Law
in log-log space
•
Frequency f(r) of Queries with Rank r
– 110000 queries from Vivisimo
– 1.9 Million queries from Excite
•
There are strong regularities in terms of patterns of behavior in how we
search the Web
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Outline
• Basic concepts in Web log data analysis
• Predictive modeling of Web navigation behavior
– Markov modeling methods
• Analyzing search engine data
• Ecommerce aspects of Web log mining
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Ecommerce Data
•
Page request Web logs combined with
•
Main focus here is to increase revenue
•
This is a very rich source of problems for data mining
•
Additional Reading
–
–
–
–
–
–
–
–
–
–
–
–
–
Data Mining Lectures
Purchase (market-basket) information
User address information (if they make a purchase)
Demographics information (can be purchased)
Emails to/from the customer
Search query information
Product ratings information
Data mining widely used by online commerce companies like Amazon
What products should we advertise to this person?
Can we do dynamic pricing?
If a person buys X should we also suggest Y?
Who are our best customers?
etc
Kohavi, Mason, Parekh, Zheng, Lessons and challenges from mining ecommerce retail data,
Machine Learning Journal, 2004
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Recommender Systems
• “Vote data” = n x m sparse binary matrix
– m columns = “products”, e.g., books for purchase or movies for viewing
– n rows = users
– Interpretation:
•
•
•
•
Implicit Ratings: v(i,j) = user i’s rating of product j (e.g. on a scale of 1 to 5)
Explicit Purchases: v(i,j) = 1 if user i purchased product j
entry = 0 if no purchase or rating
We will refer to non-zero entries generically as “votes”
• Automated recommender systems
– Given votes by a user on a subset of items, recommend other items
that the user may be interested in
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Examples of Recommender Systems
•
Books and movies purchasing:
– Amazon.com, Cdnow.com, etc
•
Movie recommendations:
– Netflix
– MovieLens (movielens.umn.edu)
•
Digital library recommendations
– CiteSeer (Popescul et al, 2001):
• m = 177,000 documents
• N = 33,000 users
• Each user accessed 18 documents on average (0.01% of the database -> very sparse!)
•
Web page recommendations
– E.g., Alexa toolbar (www.alexa.com)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Treatment of Zero’s in Ratings Data
• Ratings data (e.g., rating movies on Netflix)
– User voluntarily assigns scores to movies viewed
• e.g., 5 for best and 1 for worst
• Interpretation of a score of 0
– The user has not seen this movie
– The user has seen the movie but has not rated it
– A 0 score is not necessarily the same as “missing” but often treated that
way
• In much research work on recommender systems, ratings data is
converted into binary votes
– e.g., ratings from >3 mapped to a “vote” of 1, <3 “mapped” to 0
– Not ideal since now the 0 score can represent low ratings or “unrated”
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Different recommender algorithms
• Nearest-neighbor/collaborative filtering algorithms
• Cluster-based algorithms
• Probabilistic model-based algorithms
• Details discussed in class….
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Additional Aspects of Recommender Systems
•
Dimension reduction
•
Content-based recommender systems
– Techniques like SVD can be used to perform predictions in a lower-dimensional
space
– In many cases there is additional information about the items
• E.g., reviews and synposes of movies
– A different approach to recommender algorithms is to make predictions on new
items based on properties of rated items
– This approach can be combined with collaborative/user data
• Particularly useful (e.g.) when many items have no ratings
• e.g., Decoste et al (IUI, 2005) report that 85% of movies have no ratings in a Yahoo!
recommender system
•
Additional data on users, e.g., demographic data
•
Sequential aspect of recommendations
– May be useful, e.g., in clustering users
– e.g., novel Markov Decision Process approach by Shani et al, JMLR, 2005
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
General Issues
• The “cold start” problem
– How to make accurate recommendations for new users
• Sparsity of data
• Computational issues
– For real-time applications need to be able to make recommendations
very quickly
– Significant engineering involved, many tricks
• Algorithm evaluation
– Not always clear what the evaluation metric (score) should be
– See next slide
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Evaluation of Recommender Systems
•
Research papers use historical data to evaluate and compare different
recommender algorithms
– predictions typically made on items whose ratings are known
– e.g., leave-1-out method,
• each positive vote for each user in a test data set is in turn “left out”
• predictions on left-out items made given rated items
– e.g., predict-given-k method
• Make predictions on rated items given k=1, k=5, k=20 ratings
– See Herlocker et al (2004) for detailed discussion of evaluation
•
Approach 1: measure quality of rankings
• Score = weighted sum of true votes in top 10 predicted items
•
Approach 2: directly measure prediction accuracy
• Mean-absolute-error (MAE) between predictions and actual votes
• Typical MAE on large data sets ~ 20% (normalized)
– E.g., on a 5-point scale predictions are within 1 point on average
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Evaluation of Recommender Systems
•
Cautionary note:
– It is not clear that prediction on historical data is a meaningful way to evaluate
recommender algorithms, especially for purchasing
– Consider:
• User purchases products A, B, C
• Algorithm ranks C highly given A and B, gets a good score
• However, what if the user would have purchased C anyway, i.e., making this
recommendation would have had no impact? (or possibly a negative impact!)
– What we would really like to do is reward recommender algorithms that lead the
user to purchase products that they would not have purchased without the
recommendation
• This can’t be done based on historical data alone
– Requires direct “live” experiments (which is often how companies evaluate
recommender algorithms)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Additional Reading on Recommender Systems
•
GroupLens research group, http://www.grouplens.org/
– Papers, demo systems, data sets
•
Breese et al, Empirical analysis of predictive algorithms for collaboration
filtering, 1998
•
Schafer et al, Recommender systems in e-commerce, 1999
•
Sarwar et al, Analysis of recommendation algorithms for e-commerce, 2000
•
Herlocker et al, Evaluating collaborative filtering recommender systems,
ACM TOIS, 2004
•
Shani et al, An MDP-based recommender system, 2005
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine