Log Mining (for fun and profit)
Download
Report
Transcript Log Mining (for fun and profit)
Log Mining
CSE 454
Eytan Adar
November 28, 2007
So far….
• Building massive services…
– Crawling data
– Processing data (mining/machine learning
extractions/etc)
– Indexing data
– Serving data
• Now what?
Behavior
• Hopefully at this point you actually have
users
• Users interact, use, and add content
• As an (information) side-effect
– Leave traces behind
• We would like to make use of this
– Understand our demographics
– Improve the service
Logging Web Activity (Review)
• Most servers support “common logfile format” or “extended logfile
format”
18.1.13.12 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
• Apache lets you customize format
• Every HTTP event is recorded
–
–
–
–
–
Page requested
Remote host
Browser type
Referring page
Time of day
• Cookies
• Other instrumented information can be passed in URLs (e.g. URL rewriting)
Simple Stats
• Building a basic analytics site
Use over time
How many users
Where they are
Where they
come from
What they look at
Leveraging the Data
• More advanced modeling of behavior
• Improve user interface (designer perspective)
–
–
–
–
–
What are people looking at?
What/where are they clicking on?
Where do the enter a site? Leave?
Repeated behaviors?
Is today different than yesterday? Did my redesign
have an impact?
– How are my ads doing? Where should I put them?
User Tracing
• Trace a user through a website
• Commercial vendors
– SPSS
– SAS
– WebTrends
– ClickTracks
– (see http://www.kdnuggets.com/software/web-mining.html for a
ton more)
Clickdensity Maps
User Tracing
• Trace users through a website
VisitorVille
User Tracing
• Trace a user through a website
VisitorVille
User Tracing
• Trace a user through a website
VisitorVille
User Tracing
• Trace a user through a website
WebQuilt, Berkeley (Proxy based solution)
User Tracing
• Trace a user through a website
Ed Chi, Xerox PARC
User Tracing
• Tracking users is tricky
• Why?
• What’s a user?
– Proxies, multiple accounts, cookies lost, robots
• What’s a session?
–
–
–
–
Back button, caching
The “bathroom” problem
Are they doing something new?
Entering and leaving
• De-heading
• Are they really done
• Bookmarks
User Tracing
• Tricks of the trade
– Cookies help
– Force cache flushing
– Javascript (“bugs”)
– Time based session delimiters
• Fixed (30 minutes)
• Adaptive (Calculate based on inter-arrival times)
– Referral logs
Leveraging the Data
• More advanced modeling of behavior
• Improve user interface (designer
perspective)
• Automatically modify the service
– Guidance (good next place to go…)
– Personalization (you would like…)
– Better index (try this query…)
– Security
General Techniques
• Association rules
– If (1.html & 2.html 3.html)
– Standard ML algorithms
• Repeated Patterns
– 1.html 2.html 3.html is common path
– Statistics, motifs, and “sequence” alignment
• Clustering
– Digraph, users to pages
PageRank + Behavior
• Implicit links, find where people go
P1
P2
P3
P4
• Calculate ranking based not only on real
links but also implicit ones
Guiding Many User
• Suggesting where to go based on previous
trails/footprints
WebWatcher
Guiding Users
• Suggesting where to go based on previous
trails/footprints
• Do things dynamically
– (if a.html + b.html suggest c.html)
• MINPATH (Anderson et al.)
– Mobile users don’t want to “surf”
– Learn the paths
– Suggest shortcut links
• Personalized site maps (Toolan/Kusmerick)
Query Logs
• Specific (important) type of web log analysis
• Users presented with SERPs (Search Engine
Result Page)
• Behavior logged as:
[user info] date “query” #results result-clicked clickthrough
Same analysis issues
• What’s a user?
– Somewhat easier (lots of instrumentation)
• What’s a session?
– Users are hitting the back button a lot
– Users also re-search a lot
• Maybe even 40%
• But, we also want search sessions…
– “Seattle Basketball” “Sonics” “Seattle
SuperSonics”
New Analysis Issues
• Query refinement – tracking sessions is
harder
Type
Example
Capitalization
Air France air france
Word order
New York Department of State Department of
State New York
Stop words
Atlas of Missouri Missouri Atlas
Words Swaps
American Embassy American Consulate
Abbreviations
British Airways BA
Misspellings
Yahoo Yahho
Extra Words/Phrases
Six Flags Six Flags New York
Reformulations
United Nations Secretary General Kofi Annan
Synonyms
Practical Jokes Pranks
Generating Sessions
• Some of it is easy…
– Normalize
– Re-order
– Drop stopwords
• Hard part for us is the same thing that’s
hard for users
– Spelling mistakes, better queries, etc.
– Advantage of scale…
Spelling Mistakes
• People frequently make the same typos
– Yahho
• Many correct themselves
– Yahoo
• Task find pairs of queries that always
come one after the other
– Yahho Yahoo
Synonyms/Reformulations
R1
UN Sec. General
http://foo.bar...
Q1
R2
Kofi
http://a.b.c..
R3
http://1.2.3...
R4
http://6.7.8...
Q2
Synonyms/Reformulations
• Combine ideas Suggest queries
Q1
Q2
Q3
Q4
Improving Query Results
• What’s a good result?
• Heuristic of the last click
– Users fail in lots of ways
– Usually succeed in one
– Find the most popular, last clicked result in a
session
• Problem: most popular last click is almost
always result #1
– Can “test the waters” by occasionally
swapping results
Automatic Improvements
• The DirectHit algorithm
R1
85%
Q1
5%
R2
http://foo.bar...
http://a.b.c..
2%
1%
R3
http://1.2.3...
R4
http://6.7.8...
Automatic Improvements
• The DirectHit algorithm
R1
2%
Q1
5%
R2
http://foo.bar...
http://a.b.c..
85%
1%
R3
R4
http://1.2.3...
http://6.7.8...
Move to
top
Security/Spam Issues
• Issues in taking account what users do?
– Robots
– Malicious users
• These are aggressively removed
– Too many queries, too quickly
• Personalization helps
– Limits impact to one person/small group
Automatic Improvements
• Personalization
User
1, Q1
User
2, Q1
User
3, Q1
Search Engine
Results
Automatic Improvements
• Personalization
User
1, Q1
User Model
Results
User
2, Q1
User Model
Results
Search Engine
User
3, Q1
User Model
Results
Personalization
• Learning the user model
– User tells us what they’re interested in
(categories of pages or specific pages)
– We infer what they’re interested in
• Pages with Apple AND Farm
• Pages with Apple AND Computer
– User model “boosts” certain word scores
• Remember TFIDF?
• Other things to put in the model?
• Optimization: group users
Questions?