Web Analytics - School of Information

Download Report

Transcript Web Analytics - School of Information

WIRED - Web Analytics Week
• Web Logs overview
• Web Analytics
- Understanding Queries
- Tracking Users
• Web Log Reliability
• Web Log Data Mining & KDD
Web Analytics
• Evaluation of Web Information Retrieval (& Web
Information Seeking)
• What can we learn?
- IR systems use
- Web server administration
• Who are the users?
- Types of users
- User situations
• How does it affect or help IR?
Web Server Overview
• Any application that can serve files using the HTTP
protocol
-
Text, HTML, XHTML, XML…
Graphics
CGI, applets, serlets
other media & MIME types
• Apache or MS IIS that serve primarily Web pages
• Servers create ASCII text log files showing:
- Date, time, bytes transferred, (cache status)
- Status/error codes, user IP address, (domain name)
- Server method, URI, misc comments
Web Log Overview
• Access Log
- Logs information such as page served or time
served
• Referer Log
- Logs name of the server and page that links to
current served page
- Not always
- Can be from any Web site
• Agent Log
- Logs browser type and operating system
• Mozilla
• Windows
What can we learn from Web logs?
• Every time a Web browser requests a file, it
gets logged
- Where the user came from
- What kind of browser used to access the server
- Referring URL
• Every time a page gets served, it gets logged
- Request time, serve time, bytes transferred, URI,
status code
Web Log Analysis in Action
• UT Web log reports
(Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00).
Successful requests: 39,826,634 (39,596,364)
Average successful requests per day: 5,690,083 (5,656,623)
Successful requests for pages: 4,189,081 (4,154,717)
Average successful requests for pages per day: 598,499 (593,530)
Failed requests: 442,129 (439,467)
Redirected requests: 1,101,849 (1,093,606)
Distinct files requested: 479,022 (473,341)
Corrupt logfile lines: 427
Data transferred: 278.504 Gbytes (276.650 Gbytes)
Average data transferred per day: 39.790 Gbytes (39.521 Gbytes)
Problems with Web Servers
•
•
•
•
•
Actual user or intent not known
Paths difficult to determine
Infrequent access challenging to uncover
No State Information
Server Hits not Representative
- Counters inaccurate
•
•
•
•
DOS, Floods, Bandwidth can Stop “intended” usage
Robots, etc.
ISP Proxy servers
“5.3 Unsound inferences from data that is logged”
Haigh & Megarity, 1998.
Web Server Configuration
•
•
•
•
Unique file & directory names = “at a glance analysis”
Hierarchical directory structure
Redirect CGI to find referrer
Use a database
- store web content
- record usage data with context of content logged
• Create state information with programming
- Servlets, ActiveX, Javascript
- Custom server or log format
• Log rollover, report frequency, special case testing
Log File Format
• Extended Log File Format - W3C Working
Draft WD-logfile-960323
192.117.240.3 - - [24/Jul/1998:00:00:04 -0400]
"GET /10/3/a3-160-e.html HTTP/1.0" 200 2308 "http://www.amicus.nlcbnc.ca/wbin/resanet/itemdisp/l=0/d=1/r=1/e=0/h=10/i=11683503"
"Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)"
• Every server generates slightly different logs
- Versions & operating system issues
- Admin tweaks to log formats
• Extended Log Format most common
- WWW Consortium Standards (= apache)
Let’s Look at some logs
• http://www.ischool.utexas.edu/analogmonthly.html
• http://www.ischool.utexas.edu/analogweekly.html
Log Analysis Tools
•
•
•
•
•
•
•
•
•
Analog
Webalizer
Sawmill
WebTrends
AWStats
WWWStat
GetStats
Perl Scripts
Data Mining & Business Intelligence tools
WebTrends
• A whole industry of analytics
• Most popular commercial application
Measuring Web Site Usage
• Now that the Web is a primary source,
understanding its use is critical
• Little external cues that the Web site is being
used
• What - pages and their content/subject
• How - browsers
• Who - userid or IP
• When - trends, daily, weekly, yearly
• Where - the user is and what page they came
from
What you can’t measure?
• Who the user is
- Always
- If the user’s needs have changed
• If they’re using the information
- Browsing vs. Reading vs. Acting on the
information
• Changes to site and how they affect each user
• Pages not used at all - and why
Analysis of a Very Large Search Log
• What kinds of patterns can we find?
• Request = query and results page
• 280 GB – Six Weeks of Web Queries
- Almost 1 Billion Search Requests, 850K valid, 575K queries
- 285 Million User Sessions (cookie issues)
- Large volume, less trendy
- Why are unique queries important?
• Web Users:
- Use Short Queries in short sessions - 63.7% one request
- Mostly Look at the First Ten Results only
- Seldom Modify Queries
• Traditional IR Isn’t Accurately Describing Web Search
• Phrase Searching Could Be Augmented
• Silverstein, Henzinger, Marais, Moricz (1998)
Analysis of a Very Large Search Log
• 2.35 Average Terms Per Query
- 0 = 20.6% (?)
- 1 = 25.8%
- 2 = 26.0%
=
72.4%
• Operators Per Query
- 0 = 79.6%
• Terms Predictable
• First Set of Results Viewed Only = 85%
• Some (Single Term Phrase) Query Correlation
- Augmentation
- Taxonomy Input
- Robots vs. Humans
Web Analytics and IR?
• Knowing access patterns of users
• Lists of search terms
- Numbers of words
- Words, concepts to add (synonyms)
- Types of queries
• Success of searching a site
- Was a result link clicked on?
- How many pp/user after a search?
• Is a new or better search interface needed?
Real Life Information Retrieval
• 51K Queries from Excite (1997)
• Search Terms = 2.21
• Number of Terms
- 1 = 31% 2 = 31% 3 = 18%
(80% Combined)
• Logic & Modifiers (by User)
- Infrequent
- AND, “+”, “-”
• Logic & Modifiers (by Query)
- 6% of Users
- Less Than 10% of Queries
- Lots of Mistakes
• Uniqueness of Queries
- 35% successive
- 22% modified
- 43% identical
Real Life Information Retrieval
• Queries per user 2.8
• Sessions
- Flawed Analysis (User ID)
- Some Revisits to Query (Result Page Revisits)
• Page Views
- Accurate, but not by User
• Use of Relevance Feedback (more like this)
- Not Used Much (~11%)
• Terms Used Typical & frequent
• Mistakes
- Typos
- Misspellings
- Bad (Advanced) Query Formulation
•
Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998)
KDD for Extracting Knowledge
• Knowledge extraction, information discovery, information
extraction, data archeology, data pattern processing, OLAP, HV
statistical analysis
• Sounds as if “knowledge” is there to be
found.
• User and usage context help find the
knowledge
• Hypothesis before analysis
• Why KDD, why now?
- Data storage, analysis costs
- Visualization
KDD Process
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
• Database for structured data and queries
- How structured, alorithms for queries
- How results can be understood and visualized
- Iterative & Interactive, hypothesis driven &
hypothesis generating
KDD Efforts
• Data Cleaning
• Formulating the Questions
• “Finding useful features to represent the
data” p30
• Models:
-
Classification to fit data into pre-defined classes
Regressions to fit predictions & values
Clustering to class sets found in data
Summarization to briefly describe data
Dependency discovery of variable relationships
Sequence analysis for time or interaction patterns
Data Prep for Mining the WWW
• Processing the data before mining
• WEBMINER system - site toplogy
-
Cleaning
User identification
Session identification (episodes)
Path completion
QuickTime™ and a
TIF F (LZW ) decompressor
are needed to see this picture.
Web Usage Mining
• VL Verification
• Data Mining to Discover Patterns of Use
- Pre-Processing
- Pattern Discovery
- Pattern Analysis
• Site Analysis, Not User Analysis
• Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.N. - 2000
Web Usage Discovery
- Content
• Text
• Graphics
• Features
- Structure
• Content Organization
• Templates and Tags
- Usage
• Patterns
• Page References
• Dates and Times
- User Profile
• Demographics
• Customer Information
Web Usage Collection
• Types of Data
- Web Servers
- Proxies
- Web Clients
• Data Abstractions
-
Sessions
Episodes
Clickstreams
Page Views
• The Tools for Web Use Verification
Web Usage Preprocessing
• Usage Preprocessing
- Understanding the Web Use Activities of the Site
- Extract from Logs
• Content Preprocessing
- Converting Content Into Formats for Processing
- Understanding Content (Working with Dev Team)
• Structure Preprocessing
- Mining Links and Navigation from Site
- Understanding Page Content and Link Structures
Web Usage Pattern Discovery
• Clustering for Similarities
- Pages
- Users
- Links
• Classification
-
Mapping Data to Pre-defined Classes
Rule Discovery
Rule Rules
Computation Intensive
Many Paths to the Similar Answers
• Pattern Detection
- Ordering By Time
- Predicting Use With Time
Web Usage Mining as Evaluation?
• Mining Goals
- Improved Design
- Improved Delivery
- Improved Content
•
•
•
•
•
Personalization (XMod Data)
System Improvement (Tech Data)
Site Modification (IA Data)
Business Intelligence (Market Data)
Usage Characterization (User Behavior Data)
Web Analytics Wrap-up
•
•
•
•
What can we learn about users?
What can we learn about services?
How can we help users improve their use?
How can IR models benefit from this
analysis?
• What kind of improvements in Web IR
systems and their interfaces can be take from
this?