Log Mining (for fun and profit)

Download Report

Transcript Log Mining (for fun and profit)

Log Mining
CSE 454
Eytan Adar
November 28, 2007
So far….
• Building massive services…
– Crawling data
– Processing data (mining/machine learning
extractions/etc)
– Indexing data
– Serving data
• Now what?
Behavior
• Hopefully at this point you actually have
users
• Users interact, use, and add content
• As an (information) side-effect
– Leave traces behind
• We would like to make use of this
– Understand our demographics
– Improve the service
Logging Web Activity (Review)
• Most servers support “common logfile format” or “extended logfile
format”
18.1.13.12 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
• Apache lets you customize format
• Every HTTP event is recorded
–
–
–
–
–
Page requested
Remote host
Browser type
Referring page
Time of day
• Cookies
• Other instrumented information can be passed in URLs (e.g. URL rewriting)
Simple Stats
• Building a basic analytics site
Use over time
How many users
Where they are
Where they
come from
What they look at
Leveraging the Data
• More advanced modeling of behavior
• Improve user interface (designer perspective)
–
–
–
–
–
What are people looking at?
What/where are they clicking on?
Where do the enter a site? Leave?
Repeated behaviors?
Is today different than yesterday? Did my redesign
have an impact?
– How are my ads doing? Where should I put them?
User Tracing
• Trace a user through a website
• Commercial vendors
– SPSS
– SAS
– WebTrends
– ClickTracks
– (see http://www.kdnuggets.com/software/web-mining.html for a
ton more)
Clickdensity Maps
User Tracing
• Trace users through a website
VisitorVille
User Tracing
• Trace a user through a website
VisitorVille
User Tracing
• Trace a user through a website
VisitorVille
User Tracing
• Trace a user through a website
WebQuilt, Berkeley (Proxy based solution)
User Tracing
• Trace a user through a website
Ed Chi, Xerox PARC
User Tracing
• Tracking users is tricky
• Why?
• What’s a user?
– Proxies, multiple accounts, cookies lost, robots
• What’s a session?
–
–
–
–
Back button, caching
The “bathroom” problem
Are they doing something new?
Entering and leaving
• De-heading
• Are they really done
• Bookmarks
User Tracing
• Tricks of the trade
– Cookies help
– Force cache flushing
– Javascript (“bugs”)
– Time based session delimiters
• Fixed (30 minutes)
• Adaptive (Calculate based on inter-arrival times)
– Referral logs
Leveraging the Data
• More advanced modeling of behavior
• Improve user interface (designer
perspective)
• Automatically modify the service
– Guidance (good next place to go…)
– Personalization (you would like…)
– Better index (try this query…)
– Security
General Techniques
• Association rules
– If (1.html & 2.html  3.html)
– Standard ML algorithms
• Repeated Patterns
– 1.html  2.html  3.html is common path
– Statistics, motifs, and “sequence” alignment
• Clustering
– Digraph, users to pages
PageRank + Behavior
• Implicit links, find where people go
P1
P2
P3
P4
• Calculate ranking based not only on real
links but also implicit ones
Guiding Many User
• Suggesting where to go based on previous
trails/footprints
WebWatcher
Guiding Users
• Suggesting where to go based on previous
trails/footprints
• Do things dynamically
– (if a.html + b.html suggest c.html)
• MINPATH (Anderson et al.)
– Mobile users don’t want to “surf”
– Learn the paths
– Suggest shortcut links
• Personalized site maps (Toolan/Kusmerick)
Query Logs
• Specific (important) type of web log analysis
• Users presented with SERPs (Search Engine
Result Page)
• Behavior logged as:
[user info] date “query” #results result-clicked clickthrough
Same analysis issues
• What’s a user?
– Somewhat easier (lots of instrumentation)
• What’s a session?
– Users are hitting the back button a lot
– Users also re-search a lot
• Maybe even 40%
• But, we also want search sessions…
– “Seattle Basketball”  “Sonics”  “Seattle
SuperSonics”
New Analysis Issues
• Query refinement – tracking sessions is
harder
Type
Example
Capitalization
Air France  air france
Word order
New York Department of State  Department of
State New York
Stop words
Atlas of Missouri  Missouri Atlas
Words Swaps
American Embassy  American Consulate
Abbreviations
British Airways  BA
Misspellings
Yahoo  Yahho
Extra Words/Phrases
Six Flags  Six Flags New York
Reformulations
United Nations Secretary General  Kofi Annan
Synonyms
Practical Jokes  Pranks
Generating Sessions
• Some of it is easy…
– Normalize
– Re-order
– Drop stopwords
• Hard part for us is the same thing that’s
hard for users
– Spelling mistakes, better queries, etc.
– Advantage of scale…
Spelling Mistakes
• People frequently make the same typos
– Yahho
• Many correct themselves
– Yahoo
• Task find pairs of queries that always
come one after the other
– Yahho  Yahoo
Synonyms/Reformulations
R1
UN Sec. General
http://foo.bar...
Q1
R2
Kofi
http://a.b.c..
R3
http://1.2.3...
R4
http://6.7.8...
Q2
Synonyms/Reformulations
• Combine ideas  Suggest queries
Q1
Q2
Q3
Q4
Improving Query Results
• What’s a good result?
• Heuristic of the last click
– Users fail in lots of ways
– Usually succeed in one
– Find the most popular, last clicked result in a
session
• Problem: most popular last click is almost
always result #1
– Can “test the waters” by occasionally
swapping results
Automatic Improvements
• The DirectHit algorithm
R1
85%
Q1
5%
R2
http://foo.bar...
http://a.b.c..
2%
1%
R3
http://1.2.3...
R4
http://6.7.8...
Automatic Improvements
• The DirectHit algorithm
R1
2%
Q1
5%
R2
http://foo.bar...
http://a.b.c..
85%
1%
R3
R4
http://1.2.3...
http://6.7.8...
Move to
top
Security/Spam Issues
• Issues in taking account what users do?
– Robots
– Malicious users
• These are aggressively removed
– Too many queries, too quickly
• Personalization helps
– Limits impact to one person/small group
Automatic Improvements
• Personalization
User
1, Q1
User
2, Q1
User
3, Q1
Search Engine
Results
Automatic Improvements
• Personalization
User
1, Q1
User Model
Results
User
2, Q1
User Model
Results
Search Engine
User
3, Q1
User Model
Results
Personalization
• Learning the user model
– User tells us what they’re interested in
(categories of pages or specific pages)
– We infer what they’re interested in
• Pages with Apple AND Farm
• Pages with Apple AND Computer
– User model “boosts” certain word scores
• Remember TFIDF?
• Other things to put in the model?
• Optimization: group users
Questions?