IUG2009AndrewWongx - HKUST Institutional Repository

Download Report

Transcript IUG2009AndrewWongx - HKUST Institutional Repository

Presented by
Andrew Wong
9th Annual IUG meeting at HKU Library
8 December 2009
•
•
•
•
•
•
•
Definitions
Motivations
Architecture of Logs Miner
Logs Miner User Interface
Logs Miner reports
Benefits
Future development
2
Web data mining
-- “application of data mining methodologies,
techniques, and models to variety of data forms,
structures, and usage patterns that comprise the
World Wide Web”
(Markov, Z. & Larose, D. T. 2007)
Three scopes of Web data mining:
Web content mining
Web structure mining
Web log mining
3
Web log mining
•
•
•
Discover user access patterns from Web usage logs
Is also called web usage mining
Three processing stages:
1. Pre-processing
2. Pattern discovery
3. Pattern analysis
4
•
•
•
•
Identify and classify different group of patrons
Understand search patterns by different group of
patrons
Adapt web-user interfaces to suit users need
Statistical data for collection management
5
• Web logs provide huge information on user
action
lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0
(Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR
3.5.30729)“
lbnxyz.ust.hk - - [16/Nov/2009:12:03:27 +0800] "GET /catalog/?s=brandy&feed=rss HTTP/1.1" 304 - "" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feedid=10486796160015392754)"
lbz222.ust.hk - - [16/Nov/2009:12:03:30 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5“
lbz333.ust.hk - - [16/Nov/2009:12:03:33 +0800] "GET /catalog/?s=brandy HTTP/1.1" 304 - "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"
lbz444ust.hk - - [16/Nov/2009:12:03:35 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-"
"Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"
6
lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283
"-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102
Firefox/3.5.5 (.NET CLR 3.5.30729)“
Fields
Value
Remote host field
lbz000.ust.hk
Date/Time field
[16/Nov/2009:12:03:26 +0800]
HTTP request
“GET /catalog/ HTTP/1.1“
Status code field
200
Transfer Volume (Bytes)
Field
20283
User agent field
"Mozilla/5.0 (Windows; U; Windows NT 5.1;
en-US; rv:1.9.1.5) Gecko/20091102
Firefox/3.5.5 (.NET CLR 3.5.30729)“
7
Common Log Format – usually used by Apache Web server
logs, Apache Tomcat Logs
e.g. Library web server, INNOPAC, SmartCAT, Institutional Repository
lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT
5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“
Microsoft IIS Log Format
e.g. ILLiad, Class Registration Form
2009-07-20 01:22:44 GET /ce/ - 66.249.71.201 HTTP/1.1
Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - 401 1891 0
Include:
• Remote host field
• Date field
•Time field
• HTTP request field
• Status code field
• Transfer Volume (Bytes)
• Referrer field
• User agent field
8
Microsoft Streaming Server
e.g. Streaming video
143.89.160.133 2009-09-02 10:21:20 - /arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv 0 6 5
200 {3300AD50-2C39-46c0-AE0A-41B7139D4722} 11.0.5721.5251 en-US
WMFSDK/11.0.5721.5251_WMPlayer/11.0.5721.5268 - wmplayer.exe 11.0.5721.5145 Windows_XP 5.1.0.2600 Pentium
3816 216613290 2830093 rtsp TCP - - - 2244972 2244972 398 398 0 0 0 0 0 0 1 1 100 143.89.105.168 lbms07.ust.hk 1 0 245 file://C:\wmhome\hkust\arc-open\oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv
mms://stream.ust.hk/arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv OUDPA-2008-SobelAdventures_in_Science_Writing.wmv - - 0
Fields only for streaming server:
• Video codec
• Audio codec
• Duration
• Client’s player
9
Tools used to analyze web access logs
• AccessWatch v1.33
• Analog 6.0
• Pwebstats
• RefStats 1.2
• INNOPAC Millennium Web Report – Search
Statistics
Others:
• AWStats
• Sawmill Analytics
• Webalizer
10
•
•
Create a portal for storing, analyzing all
different web access logs.
Interface for querying web access logs to
generate dynamic statistical report
11
•
Ability to analyze different log formats including
Apache NCSA combined log files, IIS log files (W3C),
streaming servers log files
•
Feasible to analyze non-standardized log format
•
Support works from command line and from a
browser as CGI
• Build a web interface to query the data (Logs Miner)
• Pre-process the raw log data, running large scale
query in cron job
12
•
Unlimited log file size
•
Report number of unique visit and visit
•
Provides Plug-in to expand the functionality
•
Open source
13
•
•
Web logs files: raw data must be contained web
logs components such as client IP address, status
code, HTTP Request field……
Any OS platform which supporting PERL
14
•
•
•
•
•
PC-level workstations
CentOS release 5.4
Apache web server 2.0
PERL v.5.8.8
AWStats 6.9
15
Logs
Miner UI
AWStats
Raw logs:
Library web server,
INNOPAC,
SmartCAT,
Institutional
repository,
Digital archives …..
Preprocessing
AWStats
reports
Access
statistics
Customized
report
Pattern discovery, pattern analysis
16
• A portal for mining web access log data and
retrieve information about usages of multiple
web applications.
• Built on top of AWStats, an open source logs
analyzer.
• Currently set up to analyze more than 20
library servers and applications including
Library Web Server, INNOPAC, Institutional
Repository, Digital Archives, SmartCAT, ILLiad,
Streaming Video Server, etc.
17
URL: https://lbnx16.ust.hk/mining
Includes 20+ applications
Provides three types of report
Filtered by URL or Host
Generates Yearly or monthly report
Query box which supporting
regular expression
18
URL: https://lbnx16.ust.hk/mining
Tips for construct
query string
19
•
•
AWStats reports
Access statistics
- filtered by URL / Host
•
Customized reports
20
21
22
Report the number of
- number of unique visitors
- number of visits
- These number are exclude the visit from the Robot
23
24
Created by plugins:
geoip
25
Work in progress
HKUST's iPhone Application
for receiving Library
information and searching on
SmartCAT
26
Query box which supporting
regular expression
27
28
29
Database title:
Cambridge Journals Online
URL:
http://library.ust.hk/cgi/db/cambridge.pl?subscribedTo
Server name:
library.ust.hk (Library web server)
Parameters
/cgi/db/cambridge.pl?subscribedTo
Include pattern:
cgi\/db\/cambridge\.pl.+
30
31
32
Document
Long, Jiafu 2005, Autoinhibition of X11/Mint scaffold proteins
revealed by the closed ……
URL:
http://repository.ust.hk/dspace/bitstream/1783.1/2496/1/nsmb958.pdf
Server name:
repository.ust.hk (HKUST Institutional Repository)
Parameters
/dspace/bitstream/2496/1/nsmb958.pdf
Include pattern:
\/1783\.1\/2496\/1\/nsmb958\.pdf
33
34
35
Number of access on Library web page from Library public
workstations
Library web page
URL:
http://library.ust.hk/
Server name:
library.ust.hk (Library web server)
Client’s name
convention
OPAC workstation (lbb[nnn].ust.hk)
IC workstation
(lbc[nnn].ust.hk)
Computer Lab
(lba[nnn].ust.hk
Include pattern:
lb(a|b|c)[\d]+\.ust.hk\.hk
36
37
38
Number of access on Digital Archives from HKUST campus but
exclude HKUST Library Staff
Digital university archives
URL:
http://archives.ust.hk/
Server name:
archives.ust.hk (Digital Archives)
Client’s name convention
Library staff workstation (lbz[nnn].ust.hk)
39
Include pattern:
^.+\.ust\.hk$
Exclude pattern:
lbz.+\.ust.hk\.hk
40
41
•
•
•
A virtual visit is defined as a user’s request on the
library’s website in order to use one of the services
provided by the library.
One Key Performance Indicator – Virtual visits per
capita
Includes main web applications:
- Library web server
- Innopac
- SmartCAT (Next generation Catalogs)
- HKUST Institutional Repository
- Digital Archives
- HKUST ILLiad
42
Report the number of
• Visits
- a unique IP accesses a page, and requests other pages without an
hour between any of the requests
43
Request
within an
hour
Request
within
an hour
Count as a visit
Request
within
an hour
44
Applications
unique visit
visit
page
visit/visitor
pages/visit
Library web server
413,324
1,018,811
60,78,913
2.46
5.96
IR
94,596
133,458
632,256
1.41
4.73
Digital Archives
1497
3,511
90,489
2.34
25.77
E-Journal
21,833
42,768
376,473
1.95
8.8
E-theses
25,848
34,956
116,664
1.35
3.33
HKUST ILLiad
8,039
18,548
138,109
2.3
7.44
SmartCat
4,202
9,398
288,787
2.23
30.72
Streaming Videos
778
1,233
4,073
1.58
3.30
Total
570,117
1,262,683
7,725,764
2.21
6.11
2.21
6.11
Virtual Visit in 2009
1,262,683
45
•
Built-in customized reports to provide a full picture
of page visit figures of similar pages
From HKUST Library Web Server (http://library.ust.hk)
• Sitemap
• Databases List
• Course Guides
• Database Guides
• Subject Guides
46
SubSet:
• Sitemap
• Databases List
• Course Guides
• Database Guides
• Subject Guides
47
HKUST library web sitemap
48
49
Add more customized reports template
•
•
•
E-Journal list
Library Forms
……
50
•
•
•
•
Central place for storing, processing and analyzing
Web Logs data
Combined usage data from different server logs
Statistics report can be generated dynamically.
Flexible querying interface enabling users to
construct their own statistical reports in real-time
51
•
•
•
From web access logs, individual client’s action can
be tracked
Protected by firewall, file permission, user
authentication
Logs Miner User Interface can be only accessed from
library network
IMPORTANT: As data retrieved in your searches or reports may contain
usage patterns of our users, please be careful not to re-distribute such
information outside of the HKUST Library.
52
•
•
Include more web applications such as
HKUST PowerSearch server (federated
search to Library’s subscription
resources)
Create more customized report
template such as E-journal list
53
Han, J., & Kamber, M. 2006. Data mining :Concepts and techniques (2nd ed.).
Amsterdam: Morgan Kaufmann.
Liu, H., & Keselj, V. 2007. Combined mining of web server logs and web contents for
classifying user navigation patterns and predicting users' future requests. Data
knowledge engineering, 61(2): 304.
Markov, Z., & Larose, D. T. 2007. Data mining the web :Uncovering patterns in web
content, structure, and usage. Hoboken, N.J.: Wiley-Interscience.
54
Email address: [email protected]
55