PPT - Donald Bren School of Information and Computer Sciences

Download Report

Transcript PPT - Donald Bren School of Information and Computer Sciences

CS 277: Data Mining
Analysis of Web User Data
Padhraic Smyth
Department of Computer Science
University of California, Irvine
Data Sources for Web Mining
•
Web content
– Text and HTML content on Web pages, e.g., categorization of content
•
Web connectivity
– Hyperlink/directed-graph structure of the Web
– e.g., using PageRank to infer importance of Web pages
– e.g., using links to improve accuracy in classification of Web pages
•
Web user data (focus of this set of slides)
– Data on how users interact with the Web
• Navigation data, aka “clickstream” data
• Search query data (keywords for users)
• Online transaction data (e.g., purchases at an ecommerce store)
– Volume of data?
• Large portals (e.g., Yahoo!, MSN) report 100’s of millions of users per month
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Introduction to Web Mining
Useful to study user behavior on the Web
e.g. search engine data can be used for
– Exploration: e.g. # of queries per session?
– Modeling: e.g. time of day dependence?
– Prediction: e.g. which pages are relevant?
Applications
– Understand social implications of Web usage
– Design of better tools for information access
– E-commerce applications
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
“Computational Advertising”
•
Revenue of many internet companies is driven by advertising
•
Key problem:
– Given user data:
• Pages browsed
• Keywords used in search
• Demographics
– Determine the most relevant ads (in real-time)
– About 50% of keyword searches can not be matched effectively to any ads
– (other aspects include bidding/pricing of ads)
•
Another major problem: “click fraud”
•
New research area of “computational advertising”
– Algorithms that can automatically detect when online advertisements are being
manipulated (this is a major problem for Internet advertising)
– See link to Stanford class by Andrei Broder on class Web site
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Networks of Instant Messagers
Leskovec and Horvitz, 2007
• Network Data
– 240 IM users over 1 month
– 1 billion conversations per day
– 1.3 billion edges in the graph
Linking Demograpics with IM Usage
from Leskovec and Horvitz, 2007
Login and Logout Durations for Instant Messenger Sessions
from Leskovec and Horvitz, 2007
Regularities such as power-laws are
abundant in Web data
Highly non-Gaussian
Aggregate behavior – very predictable
Individual behavior – much less so
Query Data
(source: Dan Russell, Google)
Research Question:
Predict the age and
gender of an individual
given their query history
More difficult:
Predict how many
people are using one
account, and their ages
and genders
Eye-Tracking: The Golden Triangle for Search
from Hotchkiss, Alston, Edwards, 2005; EnquiroResearch
Research Question:
If a user clicks on the 5th
item on the list should we
treat items 1 to 4 as
“negatives” for learning?
Number of Terms per Query
(from Xie and O Halloran, 2002)
- Observed query Length Distributions (bars)
- Poisson Model
(lines)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Power-law Characteristics of Common Queries
(from Xie and O Halloran, 2002)
Approximately
Power-Law
•
Frequency f(r) of Queries with Rank r
– 110000 queries from Vivisimo
– 1.9 Million queries from Excite
•
Presence of many rare queries makes prediction/ad-matching difficult
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Sample of 17 Million Unique Queries from eBay
Log
Query
Frequency
From N. Parikh and N. Sundaresan
Inferring Semantic Query Relations from Collective User Behavior
Proceedings of CIKM, 2008
Log Rank of Query in Sample
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
From: M. Richardson, Microsoft Research
Learning about the World through Long-Term Query Logs
ACM Transactions on the Web, October 2008
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
CASE STUDY: DETECTING FLU OUTBREAKS FROM
SEARCH ENGINE DATA
J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, L. Brilliant
Detecting influenza epidemics using search engine query data
Nature, Februrary 2009
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Detecting Flu Outbreaks
• Problem:
– Influenza epidemics cause 250k to 500k deaths per year (worldwide)
– New strains can emerge quickly and be very dangerous
– A key factor in reducing risk is quick detection of an outbreak
• How are flu outbreaks currently detected?
– Center for Disease Control (CDC) gathers counts of “influenza-like
illness” (ILI) physician visits
– Surveillance data published nationally and regionally, weekly basis
– Problem: 1 to 2 week reporting lag
• Other approaches
– Monitor over-the-counter flu medication sales
– Monitor calls to health advice lines
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Using Search Queries
• Idea:
– Discover influenza related search queries and use these to predict ILI
counts
– Motivation: should be much faster than 1-2 week lag of CDC data
• Data:
– Look at all search queries in Google from 2003 to 2008
•
•
•
•
Several hundred billion individual searches in the United States
Keep track of only the 50 million most common queries
Keep a weekly count for each query
Also keep counts of each query by geographic region
(requires use of geo-location from IP addresses: >95% accurate)
So counts for 50 million queries x 170 weeks x 9 regions
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Building a Predictive Model
• Target variable to be predicted:
– For each week, for each region
I(t) = percentage physician visits that are ILI (as compiled by CDC)
• Input variable:
Q(t) = sum of top n highest correlated queries
/ total number of queries that week
• “Model learning”:
log( I(t) / [1 – I(t)] ) = a log ( Q(t)/ [1 – Q(t) ] ) + noise
(This is really just calibrating log-odds Q(t) to predict log-odds I(t))
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Addition of terms that generally correlate
with flu season, but not with specific outbreaks,
e.g., “high school basketball”
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Evaluating the Model
• Model fit to weekly data between 2003 and 2007
– 128 weeks
• Predictions made on 42 weeks of unseen test data in 2007-2008
– 9 regions
– Correlations per region of predicted ILI with actual ILI
• Mean = 0.97
• Min = 0.92
• Max = 0.99
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Key point in these
graphs is that the CDC
data was lagging
Google predictions by
1 to 2 weeks
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
MEASURING WEB USER DATA
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
How our Web navigation is recorded…
•
Web logs
– Record activity between client browser and a specific Web server
– Easily available
– Can be augmented with cookies (provide notion of “state”)
•
Search engine records
– Text in queries, which pages were viewed, which snippets were clicked on, etc
•
Client-side browsing records
– Automatically recorded by client-side software
– Harder to obtain, but much more accurate than server-side logs
•
Other sources
– Web site registration, purchases, email, etc
– ISP recording of Web browsing
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Web Server Log Files
• Server Transfer Log:
– transactions between a browser and server are logged
–
–
–
–
IP address, the time of the request
Method of the request (GET, HEAD, POST…)
Status code, a response from the server
Size in byte of the transaction
• Referrer Log:
– where the request originated
• Agent Log:
– browser software making the request (spider)
• Error Log:
– request resulted in errors (404)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
W3C Extended Log File Format
Field
Date
Description
Date
Time
Client IP address
date
time
c-ip
User Name
Servis Name
Server Name
Server IP Address
Server Port
Method
URI Stem
URI Query
Protocol Status
Win32 Status
Bytes Sent
Bytes Received
Time Taken
Protocol Version
Host
cs-username
s-sitename
s-computername
s-ip
s-port
cs-method
cs-uri-stem
cs-uri-query
sc-status
sc-win32-status
sc-bytes
cs-bytes
time-taken
cs-version
cs-host
The date that the activity occurred
The time that the activity occurred
The IP address of the client that accessed your server
The name of the autheticated user who access your server, anonymous
users are represented by The Internet service and instance number that was accessed by a client
The name of the server on which the log entry was generated
The IP address of the server that accessed your server
The port number the client is connected to
The action the client was trying to perform
The resource accessed
The query, if any, the client was trying to perform
The status of the action, in HTTP or FTP terms
The status of the action, in terms used by Microsoft Windows
The number of bytes sent by the server
The number of bytes received by the server
The duration of time, in milliseconds, that the action consumed
The protocol (HTTP, FTP) version used by the client
Display the content of the host header
User Agent
Cookie
Referrer
cs(User Agent)
cs(Cookie)
cs(Referrer)
The browser used on the client
The content of the cookie sent or received, if any
The previous site visited by the user. This site provided a link to the current
site
s = server actions
c = client actions
cs = client-to-server actions
sc = server-to-client actions
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Example of Web Log entries
Apache web log:
205.188.209.10 - - [29/Mar/2002:03:58:06 -0800] "GET
/~sophal/whole5.gif HTTP/1.0" 200 9609
"http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0
(compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)"
216.35.116.26 - - [29/Mar/2002:03:59:40 -0800] "GET
/~alexlam/resume.html HTTP/1.0" 200 2674 "-" "Mozilla/5.0
(Slurp/cat; [email protected]; http://www.inktomi.com/slurp.html)“
202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET
/~tahir/indextop.html HTTP/1.1" 200 3510
"http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible;
MSIE 6.0; Windows NT 5.1)“
202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/animate.js
HTTP/1.1" 200 14261
"http://www.csua.berkeley.edu/~tahir/indextop.html" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1)“
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Routine Server Log Analysis
• Typical statistics/histograms that are computed
–
–
–
–
–
–
Most and least visited web pages
Entry and exit pages
Referrals from other sites or search engines
What are the searched keywords
How many clicks/page views a page received
Error reports, like broken links
• Many software products that produce standard reports of this type
of data
– Very useful for Web site managers
– But does not provide “deep” insights
– e.g., are there clusters/groups of users that use the site in different
ways?
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Visualization of Web Log Data over Time
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Descriptive Summary Statistics
• Histograms, scatter plots, time-series plots
– Very important!
– Helps to understand the big picture
– Provides “marginal” context for any model-building
• models aggregate behavior, not individuals
– Challenging for Web log data
• Examples
– Session lengths (e.g., power laws)
– Click rates as a function of time, content
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Web data measurement issues
• Important to understand how data is collected
• Web data is collected automatically via software logging tools
– Advantage:
• No manual supervision required
– Disadvantage:
• Data can be skewed (e.g. due to the presence of robot traffic)
• Important to identify robots (also known as crawlers, spiders)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
L = number of page requests in a
single session
from visitors to www.ics.uci.edu
over 1 week in November 2002
(robots removed)
0
Empirical Frequency of L
10
-1
10
-2
10
-3
10
-4
10
-5
10
-6
10
0
10
1
10
2
10
Session Length L
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
A time-series plot of ICS Website data
Number of page requests per hour as a function of time from page
requests in the www.ics.uci.edu Web server logs during the first week of
April 2002.
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Example: Web Traffic from Commercial Site
(slide from Ronny Kohavi)
Weekends
Sept-11
Note significant drop in
human traffic, not bot
traffic
Internal
Performance bot
Registration
at Search
Engine sites
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Robot / human identification
•
•
Removal of robot data is important preprocessing step before clickstream
analysis
Robot page-requests often identified using a variety of heuristics
– e.g. some robots self-identify themselves in the server logs
• All robots in principle should visit robots.txt on the Web Server
• Also, robots should identify themselves via the User Agent field in page requests
• But other robots actively try to disguise that they are robots
– Patterns of access
•
•
•
•
•
Robots explore the entire website in breadth first fashion
Humans access web-pages more typically in depth-first fashion
Timing between page-requests can be more regular for robots (e.g., every 5 seconds)
Duration of sessions, number of page-requests per day: often unusually large (e.g.,
1000’s of page-requests per day) for robots.
Tan and Kumar (Journal of Data Mining and Knowledge Discovery, 2002)
provide a detailed description of using classification techniques to learn how
to detect robots
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Fractions of Robot Data
(from Tan and Kumar, 2002)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
From Tan and
Kumar, 2002
Overall
accuracies
of around 90%
were obtained
using decision
tree classifiers,
trained on sessions
of lengths 1, 2, 3, 4,..
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Page requests, caching, and proxy servers
• In theory, requester browser requests a page from a Web server
and the request is processed
• In practice, there are
–
–
–
–
Data Mining Lectures
Other users
Browser caching
Dynamic addressing in local network
Proxy Server caching
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Page requests, caching, and proxy servers
A graphical summary of how page requests from an individual user can be
masked at various stages between the user’s local computer and the Web
server.
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Page requests, caching, and proxy servers
• Web server logs are therefore not so ideal in terms of a complete
and faithful representation of individual page views
• There are heuristics to try to infer the true actions of the user: – Path completion (Cooley et al. 1999)
• e.g. If known B -> F and not C -> F, then session ABCF can be interpreted
as ABCBF
• Anderson et al. 2001 for more heuristics
• In general case, it is hard to know what exactly the user viewed
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Identifying individual users from Web server logs
• Useful to associate specific page requests to specific individual users
• IP address most frequently used
• Disadvantages
– One IP address can belong to several users
– Dynamic allocation of IP address
• Better to use cookies (or login ID if available)
– Information in the cookie can be accessed by the Web server to identify
an individual user over time
– Actions by the same user during different sessions can be linked
together
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Identifying individual users from Web server logs
• Commercial websites use cookies extensively
• 97 % of users have cookies enabled permanently on their browsers
(source: Amazon.com, 2003)
• However …
– There are privacy issues – need implicit user cooperation
– Cookies can be deleted / disabled
• Another option is to enforce user registration
– High reliability
– But can discourage potential visitors
– Large portals (such as Yahoo!) have high fraction of logged-in users
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Sessionizing
• Time oriented (robust)
– e.g., by gaps between requests
• not more than 20 minutes between successive requests
• this is a heuristic – but is a standard “rule” used in practice
• Navigation oriented (good for short sessions and when timestamps
unreliable)
– Referrer is previous page in session, or
– Referrer is undefined but request within 10 secs, or
– Link from previous to current page in web site
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Client-side data
• Advantages of collecting data at the client side:
– Direct recording of page requests (eliminates ‘masking’ due to caching)
– Recording of all browser-related actions by a user (including visits to
multiple websites)
– More-reliable identification of individual users (e.g. by login ID for
multiple users on a single computer)
•
Preferred mode of data collection for studies of navigation behavior on the
Web
•
Companies like ComScore and Nielsen use client-side software to track
home computer users
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Client-side data
• Statistics like ‘Time per session’ and ‘Page-view duration’ are more
reliable in client-side data
• Some limitations
– Still some statistics like ‘Page-view duration’ cannot be totally reliable
e.g. user might go to fetch coffee
– Need explicit user cooperation
– Typically recorded on home computers – may not reflect a complete
picture of Web browsing behavior
• Web surfing data can be collected at intermediate points like ISPs,
proxy servers
– Can be used to create user profile and target advertise
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Example of client-side data from Alexa, each “dot” is a URL request in a browser
Time (over a 24-hour period)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
CASE STUDY:
CLUSTERS OF MARKOV CHAINS FOR MODELING
USER NAVIGATION ON A WEBSITE
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Markov Models for Modeling User Navigation
• General approach is to use a finite-state Markov chain
– Each state can be a specific Web page or a category of Web pages
– If only interested in the order of visits (and not in time), each new
request can be modeled as a transition of states
• Issues
– Self-transition
– Time-independence
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Modeling Web Page Requests with Markov chain
mixtures
•
MSNBC Web logs (circa 2000)
– Order of 2 million individual users per day
– different session lengths per individual
– difficult visualization and clustering problem
•
WebCanvas
– uses mixtures of Markov chains to cluster individuals based on their observed
sequences
– software tool: EM mixture modeling + visualization
•
Next few slides are based on material in:
I. Cadez et al, Model-based clustering and visualization of navigation patterns on a
Web site, Journal of Data Mining and Knowledge Discovery, 2003.
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
From Web logs to sequences
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
User 1
User 2
User 3
User 4
User 5
…
Data Mining Lectures
2
3
7
1
5
3
3
7
5
1
…
2
3
7
1
1
2
1
7
1
5
3
1
7
1
3
1
7
5
3
1
1
1
3
1
3
3
7
1
7
5
1
1
1
1
1
1
Analysis of Web User Data
3
3
Padhraic Smyth, UC Irvine
Graphical Model for Markov Chains
A
ci
Pages
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine:
Multiple Users…One Common Markov Chain
A
ci
Pages
Users
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine:
Multiple Users…One Chain per User
A
ci
Pages
Users
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine:
One Chain per Cluster of Users
Cadez, Meek, Heckerman, Smyth, 2003
A
Clusterj
ci
Pages
Users
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine:
Likelihood of a Sequence in a Markov Chain Model
P(sequence) = P0(first state) x Pij ( aij )nij
Number of transitions from i to j
Transition probability from i to j
Maximum likelihood estimation:
- given the nij’s, find the aij’s that maximize P(sequence)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Probability Estimation in Markov Chains
p(state j | state i) = aij = nij / ni
Number of times in state i
Number of transitions from i to j
With smoothing (or priors)
aij = (nij + aij ) / (ni + Sm aim )
where aij / Sm aim is the prior probability of transitioning from i to j
and where Sm aim is the strength (equivalent sample size) of the prior
Could set aij = aj (same priors for transitioning out of each state) or
aij = a (same priors for transitioning in and out of each state)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Probability Estimation with Multiple Sequences
T sequences
Assume sequences are independent
nijt = number of times going from i to j in sequence t, t = 1,… T
p(state j | state i) = aij = St nijt / St nit
Number of transitions from i to j in sequence t
Number of times in state i in sequence t
With smoothing:
aij = (St nijt + aij ) / (St nit + Sm aim )
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Clusters of Markov Chains
• Assume our data is generated by K different Markov chains
– Each chain has its own parameters A, P0
– This is a a mixture of Markov chains
• T sequences – but we don’t know which sequence came from which
cluster
• Chicken-and-egg problem
– If we knew which cluster each sequence belonged to, we could group
sequences by cluster, and estimation of cluster parameters is easy
– If we knew the parameters for each Markov chain, we could figure out
(by Bayes rule) the most likely cluster for each sequence
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Clusters of Probabilistic State Machines
Cluster 1
A
B
Cluster 2
C
E
A
B
C
E
Motivation:
approximate the heterogeneity of Web surfing behavior
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine:
EM Algorithm for Markov Clusters
•
EM = expectation-maximization
•
E-step
– For each sequence, and given current parameter estimates for each cluster,
estimate p(cluster k | sequence), k = 1, … K
•
M-step
– Given p(cluster k | sequence), estimate Ak and Pk0 for each cluster
Algorithm:
•
•
•
Start with an initial random guess at parameters or p(cluster k | sequence)
Iteration = pair of EM steps
Halt iterations when parameters or p(cluster k | sequence) are not changing
Guaranteed to converge to a local maximum of the likelihood (under general conditions)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
E Step of EM Algorithm
T sequences
P(cluster k | sequence t)
proportional to P(sequence t | cluster k) P(cluster k)
Number of times going
from state i to state j in sequence t
P(sequence t | cluster k)
= pkt =
Pk0(first state) x Pij ( akij )nijt
Current parameter estimates for cluster k
Compute these “membership” probabilities for each sequence and each cluster
k. Yields a T x K matrix of membership probabilities
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
M Step of EM Algorithm
For each transition probability parameter in each cluster k=1,..K
akij = ( St nijt pkt ) / ( St nit pkt )
Transition probability i->j
for cluster k
Transitions from i to j in sequence t
are “fractionally weighted”
by pkt, probability that sequence t
came from cluster k
Number of times in state i in sequence t
are “fractionally weighted”
by pkt, probability that sequence t
came from cluster k
Compute Pk0(first state) in a similar fashion
Also estimate P(cluster k) = (1/T)
Data Mining Lectures
St
pkt
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Time Complexity
• E-Step
– T sequences, average length L
– K clusters
– Compute T x K matrix
• Each entry takes O(L) time to compute
– O(TKL) overall
• M-step
– For each of M2 transition probabilities
• Sum over each sequence T
– O(T M2 )
– other parameters take less time
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Experimental Methodology
•
Model Training:
– fit 2 types of models
• mixtures of histograms (multinomials)
• mixtures of finite state machines
– Train on a full day’s worth of MSNBC Web data
•
Model Evaluation:
– “one-step-ahead” prediction on unseen test data
• Test sequences from a different day of Web logs
– compute log P(user’s next click | previous clicks, model)
– Using equation on the previous slide
–
logP score:
• Rewarded if next click was given high P by the model
• Punished if next click was given low P by the model
– negative average of logP scores ~ “predictive entropy”
• Has a natural interpretation
• Lower bounded by 0 bits (perfect prediction)
• Upper bounded by log M bits, where M is the number of categories
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Predictive Entropy Out-of-Sample
4
Negative log-likelihood [bits/token]
3.8
3.6
3.4
3.2
3
2.8
2.6
Mixtures of Multinomials
2.4
Mixtures of SFSMs
2.2
2
Data Mining Lectures
20
40
60
80
100
120
140
160
180
Number
of mixture components [K]Padhraic Smyth, UC Irvine
Analysis of Web User Data
200
log count(R)
RUN LENGTH DISTRIBUTIONS WITHIN MARKOV CLUSTERS
10
5
5
0
0
log count(R)
0
10
5
10
15
20
5
0
0
10
10
15
20
5
0
0
log count(R)
10
log count(R)
10
5
10
15
20
10
15
20
Cluster 2: Category 7
-2
5
10
Cluster 3: Category 1
-2
5
10
-2
Cluster 4: Category 1
0
0
10
5
10
Cluster 5: Category 12
5
0
0
0
10
15
R = Run Length
Data Mining Lectures
20
0
1
2
3
R = Run Length
Analysis of Web User Data
4
5
5
20
30
40
10
15
20
Cluster 4: Category 3
0
5
5
0
10
5
0
10
Cluster 3: Category 13
10
0
Cluster 5: Category 9
40
0
0
40
30
2
0
30
0
4
5
20
20
0
5
10
10
Cluster 2: Category 8
4
5
0
0
2
10
Cluster 4: Category 2
5
0
5
0
0
10
Cluster 3: Category 12
Cluster 1: Category 8
4
2
0
5
5
Cluster 1: Category 14
10
Cluster 2: Category 1
0
log count(R)
10
Cluster 1: Category 13
5
10
Cluster 5: Category 6
0
1
2
3
R = Run Length
Padhraic Smyth, UC Irvine
4
Timing Results
2500
2000
N=150,000
Time [sec]
1500
N=110,000
1000
N = 70,000
500
0
-500
0
20
40
60
80
100
120
140
160
180
200
Number of mixture components [K]
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
WebCanvas
•
Software tool for Web log visualization
–
–
–
–
•
uses Markov mixtures to cluster data for display
extensively used within Microsoft
also applied to non-Web data (e.g., how users navigate in Word, etc)
Algorithm and visualization are in SQLServer (the “sequence mining” tool)
Model-based visualization
– random sample of actual sequences
– interactive tiled windows displayed for visualization
– more effective than
• planar graphs
• traffic-flow movie in Microsoft Site Server v3.0
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
HICSS Keynote Talk, Jan 2008
© Padhraic Smyth, UC Irvine:
Insights from WebCanvas for MSNBC data
• From msnbc.com site adminstrators….
– significant heterogeneity of behavior
– relatively focused activity of many users
• typically only 1 or 2 categories of pages
– many individuals not entering via main page
– detected problems with the weather page
– missing transitions (e.g., tech <=> business)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Possible Extensions
• Adding time-dependence
– adding time-between clicks, time of day effects
• Uncategorized Web pages
– coupling page content with sequence models
• Modeling “switching” behaviors
– allowing users to switch between behaviors
– Could use a topic-style model: users = mixtures of behaviors
– e.g., Girolami M & Kaban A., Sequential Activity Profiling: Latent Dirichlet
Allocation of Markov Chains, Journal of Data Mining and Knowledge
Discovery, Vol 10, 175-196.
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Related Work
• Mixtures of Markov chains
– special case: Poulsen (1990)
– general case: Ridgeway (1997), Smyth (1997)
• Clustering of Web page sequences
– non-probabilistic approaches (Fu et al, 1999)
• Markov models for prediction
– Anderson et al (IJCAI, 2001):
• mixtures of Markov outperform other sequential models for page-request
prediction
– Sen and Hansen 2003
– Zukerman et al. 1999
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Further Reading
• Modeling the Internet and the Web, P. Baldi, P. Frasconi, P. Smyth,
Wiley, 2003.
• ACM Transactions on Internet Technology (ACM TOIT)
• Annual WebKDD workshops at the ACM SIGKDD conferences.
• Papers on user-navigation models
– Selective Markov models for predicting Web page accesses, M.
Deshpande, G. Karypis, ACM Transactions on Internet Technology, May
2004.
– Model-based clustering and visualization of navigation patterns on a
Web site, Cadez et al, Journal of Data Mining and Knowledge Discovery,
2003.
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
BACKUP SLIDES
(CAN BE IGNORED)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Prediction with Markov mixtures
P(st+1 | s[1,t] ) =
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Prediction with Markov mixtures
P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] )
= S P(st+1 | k , s[1,t] ) P(k | s[1,t] )
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Prediction with Markov mixtures
P(st+1 | s[1,t] ) = S P(st+1 , k | s[1,t] )
= S P(st+1 | k , s[1,t] ) P(k | s[1,t] )
= S P(st+1 | k , st ) P(k | s[1,t] )
Prediction of
kth component
Membership, based
on sequence history
=> Predictions are a convex combination of
K different component transition matrices,
with weights based on sequence history
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Analysis of Search Engine Query Logs
# of Queries
Source
Time Period
Lau & Horvitz
4690
Excite
Sep 1997
Silverstein et al
1 Billion
AltaVista
6 weeks in Aug & Sep
1998
Spink et al
(series of studies)
1 Million for each time
period
Excite
Sep 1997
Dec 1999
May 2001
Xie & O’Hallaron
110,000
Vivisimo
35 days Jan & Feb 2001
1.9 Million
Excite
8 hrs in a day, Dec 1999
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Main Results
• Average number of terms in a query ranges from a low of 2.2 to a
high of 2.6
• The most common number of terms in a query was 2
• The majority of users don’t refine their query
– The number of users who viewed only a single page increased from 29%
(1997) to 51% (2001) (Excite)
– 85% of users viewed only first page of search results (AltaVista)
• 45% (2001) of queries are about Commerce, Travel, Economy, People
(was 20% in 1997)
• All four studies produced a generally consistent set of findings about
user behavior in a search engine context
– most users view relatively few pages per query
– most users don’t use advanced search features
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Markov models for page prediction
•
For simplicity, consider order-dependent, time-independent finite-state
Markov chain with M states
•
Let s be a sequence of observed states of length L. e.g. s =
ABBCAABBCCBBAA with three states A, B and C. st is state at position t
(1<=t<=L). In general,
L
P( s )  P( s1 ) P( st | st 1 ,..., s1 )
t 2
•
first-order Markov assumption
•
This provides a simple generative model toL produce sequential data
P( st | st 1 ,..., s1 )  P( st | st 1 )
P ( s )  P ( s1 ) P ( st | st 1 )
t 2
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Markov models for page prediction
•
•
First-order Markov model assumes that the next state is based only on the
current state
Limitations
– Doesn’t consider ‘long-term memory’
•
We can try to capture more memory with kth-order Markov chain
P(st | st 1 ,.., s1 )  P(st | st 1 ,.., st k )
•
Limitations
– Inordinate amount of training data O(Mk+1)
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine
Mixtures of Markov Chains
•
Cadez et al. (2003) and Sen and Hansen (2003) replace the first-order
Markov chain:
P( st | st 1 ,..., s1 )  P( st | st 1 )
with a mixture of first-order Markov chains
K
P( st | st 1 ,..., s1 )   P( st | st 1 , c  k )P(c  k )
k 1
where c is a discrete-value hidden variable taking K values Sk P(c = k) = 1
and
P(st | st-1, c = k) is the transition matrix for the kth mixture component
•
One interpretation of this is user behavior consists of K different navigation
behaviors described by the K Markov chains
Data Mining Lectures
Analysis of Web User Data
Padhraic Smyth, UC Irvine