by User - School of Information

Download Report

Transcript by User - School of Information

Web Servers & Log Analysis
• What can we learn from looking at Web server logs?
-
What server resources were requested
When the files were requested
Who requested them (where IP address = who)
How they requested them (browser types & OS)
• Some assumptions
- A request for a resource means the user did receive it
- A resource is viewable & understandable to each user
- Users are identified within a loose set of parameters
• How does knowing request patterns affect or help IA?
Types of Web Server Logs
• Proxy-based
- Web access servers to control access or cache
popular files
• Client-based
- Local cache files
- Browser History file(s)
• Network-based
- Routers, firewalls & access points
• Server-based
- Web servers to serve content
Using Web Servers
• The Apache Software Foundation
• Microsoft Internet Information Server
(Services)
• These applications “Serve”
-
Text - HTML, XML, plain text
Graphics - jpeg, gif, png
CGI, servlets, XMLHttpRequest & other logic
other MIME types such as movies & sound
• Most servers can log these files
- Daily, weekly or monthly
- Can not always log CGI or related logic
(specifically or “out of the box”)
How Servers Work
•
Hypertext Transfer Protocol - http
1.
2.
3.
4.
5.
•
A file is requested from the browser
The request is transferred via the network
The server receives the request (& logs it)
The server provides the file (& logs it)
The browser displays the file
Almost all Web servers work this way
Types of Server Logs
• Access Log
- Logs information such as page served or time
served
• Referer Log
- Logs name of the server and page that links to
current served page
- Not always
- Can be from any Web site
• Agent Log
- Logs browser type and operating system
• Mozilla
• Windows
Log File Format
• Extended Log File Format - W3C Working
Draft WD-logfile-960323
• key advantage:
- computer storage cost decreases while paper cost
rises
• every server generates slightly different logs
Extended Log File Formats
• WWW Consortium Standards
• Will automatically record much of what is
programmatically done now.
-
faster
more accurate
standard baselines for comparison
graphics standards
What is a log file?
• A delimited, text file with information about
what the server is doing
- IP Address or Domain name
- Date/Time
- Method used & Page Requested
- Protocol, Response Code & Bytes Returned
- Referring Page (sometimes)
- UserAgent & Operating System
p0016c74ea.us.kpmg.com - - [01/Sep/2004:08:17:21 -0500]
"GET /images/sanchez.jpg HTTP/1.1" 200 "http://www.ischool.utexas.edu/research/" "Mozilla/4.0
(compatible; MSIE 6.0; Windows XP)"
In search of Reliable Data
• Not as Foolproof as Paper
- You can see when someone is reading a page
- You can know the page is turned
- You can know the book is checked out
• No State Information
- The same person or another person could be
reading pages 1 then page 2
- You really can’t tell how many users you have
• Server Hits not perfectly Representative
- Counters inaccurate
- Caching & Robots can influence + & -
• Floods/Bandwidth can Stop “intended” usage
What is a “hit”?
• Technically, a hit is simply any file requested
from the server
- That is logged
- That represents (usually) part of a request to “see”
a whole Web page
• Hits combine to represent a “page view”
• Page views combine to represent an
“episode” or “session”
- Episode is one activity or question a user perfoms
or requests on a Web site
- Session is a series of episodes that embodies all
the interactions a user undertakes using a Web site
(per time, based on averages around 30 min.)
Making Servers More Reliable
• Keep system setups simple
- unique file and directory names
- clear, consistent structure
• Configure CMS for logging/serving
• Use an FTP server for file transfer
- frees up logs and server!
• Judicious use of links
• Wise MIME types
- some hard/impossible to log
Clever Web Server Setup
• Redirect CGI to find referrer
• Use a database
- store web content
- record usage data
• create state information with programming
- NSAPI
- ActiveX
• Have contact information
• Have purpose statements
Managing Log Files
•
•
•
•
Backup
Store Results or Logs?
Beginning New Logs
Posting Results
Log Analysis Tools
•
•
•
•
•
•
•
•
•
Analog
Webalizer
Sawmill
WebTrends
AWStats
WWWStat
GetStats
Perl Scripts
Data Mining & Business Intelligence tools
WebTrends
• A whole industry of analytics
• Most popular commercial application
Log Analysis Cumulative Sample
• Program started at Tue-03-Dec-2005 01:20 local time.
• Analysed requests from Thu-28-Jul-2004 20:31 to Mon02-Dec-1996 23:59 (858.1 days).
• Total successful requests: 4 282 156 (88 952)
• Average successful requests per day: 4 990 (12 707)
• Total successful requests for pages: 1 058 526 (17 492)
• Total failed requests: 88 633 (1 649)
• Total redirected requests: 14 457 (197)
• Number of distinct files requested: 9 638 (2 268)
• Number of distinct hosts served: 311 878 (11 284)
• Number of new hosts served in last 7 days: 7 020
• Corrupt logfile lines: 262
• Unwanted logfile entries: 976
• Total data transferred: 23 953 Mbytes (510 619 kbytes)
• Average data transferred per day: 28 582 kbytes (72 946
kbytes)
How about the iSchool Web site?
• Our server files are collected constantly
-
Daily
Weekly
Monthly
Even yearly
• What does a quick look tell us?
- How well is the server working?
• Uptime, server errors, logging errors
- How popular is our site?
• Number of hits, popular files
- Who is visiting the site?
• Countries, types of companies
- What searches led people here?
UT & its Web server logs
• UT Web log reports
(Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00).
Successful requests: 39,826,634 (39,596,364)
Average successful requests per day: 5,690,083 (5,656,623)
Successful requests for pages: 4,189,081 (4,154,717)
Average successful requests for pages per day: 598,499 (593,530)
Failed requests: 442,129 (439,467)
Redirected requests: 1,101,849 (1,093,606)
Distinct files requested: 479,022 (473,341)
Corrupt logfile lines: 427
Data transferred: 278.504 Gbytes (276.650 Gbytes)
Average data transferred per day: 39.790 Gbytes (39.521 Gbytes)
Neat Analysis Tricks
• use a search engine to find references
- “link:www.ischool.utexas.edu/~donturn”
• key to using unique names
- use many engines
• update times different
• blocking mechanisms are different
• use Web searches (or Yahoo, Bloglines…)
- look for references
- look for IP addresses of users
Neat Tricks, cont.
• Walking up the Links
- follow URL’s upward
• Reverse Sort
- look for relations
• Use your own robot to index
- Test
Web Surveys, an alternative
• Surveys actually ask users what they did,
what they sought & if it helped
• GVU, Nielsen and GNN
- Qualitative questions
• phone
• web forms
- Self-selected sample problems
• random selection
• oversample
Analysis of a Very Large Search Log
•
•
What kinds of patterns can we find?
Request = query and results page
• 280 GB – Six Weeks of Web Queries
- Almost 1 Billion Search Requests, 850K valid, 575K queries
- 285 Million User Sessions (cookie issues)
-
Large volume, less trendy
Why are unique queries important?
• Web Users:
- Use Short Queries in short sessions - 63.7% one request
- Mostly Look at the First Ten Results only
- Seldom Modify Queries
• Traditional IR Isn’t Accurately Describing Web Search
• Phrase Searching Could Be Augmented
• Silverstein, Henzinger, Marais, Moricz (1998)
Analysis of a Very Large Search Log
• 2.35 Average Terms Per Query
- 0 = 20.6% (?)
- 1 = 25.8%
- 2 = 26.0%
=
72.4%
• Operators Per Query
- 0 = 79.6%
• Terms Predictable
• First Set of Results Viewed Only = 85%
• Some (Single Term Phrase) Query Correlation
- Augmentation
- Taxonomy Input
- Robots vs. Humans
Real Life Information Retrieval
• 51K Queries from Excite (1997)
• Search Terms = 2.21
• Number of Terms
- 1 = 31% 2 = 31% 3 = 18%
(80% Combined)
• Logic & Modifiers (by User)
- Infrequent
- AND, “+”, “-”
• Logic & Modifiers (by Query)
- 6% of Users
- Less Than 10% of Queries
- Lots of Mistakes
• Uniqueness of Queries
- 35% successive
- 22% modified
- 43% identical
Real Life Information Retrieval
• Queries per user 2.8
• Sessions
- Flawed Analysis (User ID)
- Some Revisits to Query (Result Page Revisits)
• Page Views
- Accurate, but not by User
• Use of Relevance Feedback (more like this)
- Not Used Much (~11%)
• Terms Used Typical & frequent
• Mistakes
- Typos
- Misspellings
- Bad (Advanced) Query Formulation
•
Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998)
Downie & Web Usage
• Server logs are like library usage
• User-based analyses
- who
- where
- what
• File-based analyses
- amount
• Request analyses
- conform (loosely) to Zipf’s Law
• Byte-based analyses
Web use analysis & IA?
• Another tool to begin to understand how people use
your Web provided resources
• With a small amount of setup, you can learn a large
amount
• Server use can be integrated into site usage for users
-
Lists of popular pages & more interlinking pages
Adding search terms that found the page to related pages
Adjust metadata to reflect searches that find pages
Add pages to the site index or site map
• First-cut usability information
- Pages 1 & 2 were accessed, but not 3 - Why?
- Navigation usage, link ordering and design understanding
- Knowing what browsers & OS helps tailor design and media
types
BREAK!
• No Presentation this week
- Next week: Asset management, content
management & version control
• Break up media development work
• Examine current pages, style sheets &
designs
• Set up next set of pair & individual
deliverables
Media Development work
• We need to find & create graphics for the new
site
• Content about:
-
Austin
UT
iSchool
People at the iSchool
Students at work in the iSchool (classes, labs)
• Screen grab from videos
• Search the Web for copyright free images
• Take our own pictures
Current Pages & Designs
• First version of main iSchool page template
and CSS complete
• Secondary page template & CSS complete
- Some secondary pages already built
• Index page template set
• Site map page initially set
- Big Map
- Main pages map
Next steps
• In class
-
Test & evaluate current CSS and templates
Improvise secondary home page based on initial design
Examine new Alumni section
Examine new Course Listing page
• For homework
- Complete secondary page migration to new design
- Rotate design work
• Alumni
• Site Map
• Home page design ideas
- Picture/Media creation work