Web usage mining

Download Report

Transcript Web usage mining

Web Usage Mining
(Clickstream Analysis)
Mark Levene
(Follow the links to learn more!)
Reminder - W3C Extended Log File Format
Field
Date
Description
Date
Time
Client IP address
date
time
c-ip
User Name
Servis Name
Server Name
Server IP Address
Server Port
Method
URI Stem
URI Query
Protocol Status
Win32 Status
Bytes Sent
Bytes Received
Time Taken
Protocol Version
Host
cs-username
s-sitename
s-computername
s-ip
s-port
cs-method
cs-uri-stem
cs-uri-query
sc-status
sc-win32-status
sc-bytes
cs-bytes
time-taken
cs-version
cs-host
The date that the activity occurred
The time that the activity occurred
The IP address of the client that accessed your server
The name of the autheticated user who access your server, anonymous
users are represented by The Internet service and instance number that was accessed by a client
The name of the server on which the log entry was generated
The IP address of the server that accessed your server
The port number the client is connected to
The action the client was trying to perform
The resource accessed
The query, if any, the client was trying to perform
The status of the action, in HTTP or FTP terms
The status of the action, in terms used by Microsoft Windows
The number of bytes sent by the server
The number of bytes received by the server
The duration of time, in milliseconds, that the action consumed
The protocol (HTTP, FTP) version used by the client
Display the content of the host header
User Agent
Cookie
Referrer
cs(User Agent)
cs(Cookie)
cs(Referrer)
s = server actions
c = client actions
cs = client-to-server actions
sc = server-to-client actions
The browser used on the client
The content of the cookie sent or received, if any
The previous site visited by the user. This site provided a link to the current
site
Analog – Web Log File Analyser
• Gives basic statistics such as
–
–
–
–
–
–
number of hits
average hits per time period
what are the popular pages in your site
who is visiting your site
what keywords are users searching for to get to you
what is being downloaded
• Log data does not disclose the visitor’s identity
• What do analog’s reports mean?
• Report for www.dcs.bbk.ac.uk/~mark
Applications of Usage Mining
•
•
•
•
•
Pre-fetching and caching web pages
eCommerce and clickstream analysis
Web site reorganisation
Personalisation
Recommendation of links and products
Identification of User
• By IP address
– Not so reliable as IP can be dynamic
– Different users may use same IP
• Through cookies
– Reliable but user may remove cookies
– Security and privacy issues
• Through login
– Users have to register
Sessionising
• Time oriented (robust)
– By total duration of session
• not more than 30 minutes
– By page stay times (good for short sessions)
• not more than 10 minutes per page
• Navigation oriented (good for short sessions and
when timestamps unreliable)
– Referrer is previous page in session, or
– Referrer is undefined but request within 10 secs, or
– Link from previous to current page in web site
Mining Navigation Patterns
• Each session induces a user trail through the
site
• A trail is a sequence of web pages followed by a
user during a session, ordered by time of
access.
• A pattern in this context is a frequent trail.
• Co-occurrence of web pages is important, e.g.
shopping-basket and checkout.
• Use a Markov chain model.
Trails inferred from Log data
(Each session results in a trail)
ID
Trail
1
A1 > A2 > A3
2
A1 > A2 > A3
3
A1 > A2 > A3 > A4
4
A5 > A2 > A4
5
A5 > A2 > A4 > A6
6
A5 > A2 > A3 > A6
Construct Markov Chain from Data
• Add a unique start state.
– the start state has a transition to all visited
web pages in the site.
• Add a unique final state.
– the last page in each trail has a transition to
the final state.
• The transition probabilities are obtained
from counting click-throughs.
• The Markov chain built is called absorbing
since we always end up in the final state.
The Markov Chain from the Data
Support and Confidence
• Support s in [0,1) – accept only trails
whose initial probability is above s.
– Setting support to be above the average clickthrough is reasonable.
• Confidence c in [0,1) – accept only trails
whose probability is above c.
– The probability of a trail is obtained by
multiplying the transition probabilities of the
links in the trail.
Mining Frequent Trails
• Find all trails whose initial probability is
higher than s, and whose trail probability is
above c.
• Use depth-first search on the Markov
chain to compute the trails.
• The average time needed to find the
frequent trails is proportional to the
number of web pages in the site.
Frequent Trails
Support = 0.1 and Confidence = 0.3
Trail
A1 > A2 > A3
A5 > A2 > A3
A2 > A3
A1 > A2 > A4
A5 > A2 > A4
A2 > A4
A4 > A6
Probability
0.67
0.67
0.67
0.33
0.33
0.33
0.33
Frequent Trails
Support = 0.1 and Confidence = 0.5
Trail
Probability
A1 > A2 > A3
0.67
A5 > A2 > A3
0.67
A2 > A3
0.67
Content Mining
• Incorporate the categories that users are
navigating through so we may better
understand their activities.
– E.g. what type of book is the user interested
in; this may be used for recommendation.
• Classify users according to behaviour.
– Is the user’s intent to browse, search or buy?
• Cluster users with common interests.
Pre-fetching and Caching Pages
• Learn access patterns to predict future
accesses.
• Pre-fetch predicted pages to reduce
latency.
• Can use Markov model and base the
prediction on history of access.
• Also cache results of popular search
engine queries.
ECommerce Click stream Analysis
• What is the user’s intention: browse,
search or buy?
• Measure time spent on site - site
stickiness
• Repeat visits – it has been shown that
repeat visitors spend less time on the site;
can be explained by learning.
• Measure visit-to-purchase conversion
ratio, and predict purchase likelihood.
Supplementary Analyses to
Improve eCommerce Web Sites
• Detecting visits from crawlers as opposed to human
visitors.
• Form error analysis, e.g. login errors, mandatory fields
not filled, incorrect format.
• When and why do people exit the site, e.g. visitor puts
item in cart but exists before reaching the checkout.
• Analysis of local search engine logs – correlate with site
behaviour.
• Product recommendations based on association rules
(people who bought x also bought y).
• Geographic analysis – where are the customers?
• Demographic analysis – who are the customers?
Adaptive web sites
• Modify the web site according to user
access.
– Automatic synthesis of index pages (hubs that
contain links on a specific topic)
– Based on a clustering algorithm that uses the
co-occurrence frequencies of pages from the
log data.
– Finds a concept that best describes each
cluster.