Discovering Web Access Patterns and Trends by Applying

Download Report

Transcript Discovering Web Access Patterns and Trends by Applying

Discovering Web Access Patterns
and Trends by Applying OLAP
and Data Mining Technology on
Web logs
Data Engineering Lab
성유진
Abstract
 Web
server log files analysis
• server performance improvement
• system performance improvement
• customer targeting in electronic commerce

problem and difficulty
• large raw log data processing is not easy
• data reduce
• size and time
• current weglogminer
• slow, inflexible, difficult to maintain
• only frequency count  not enough

WebLogMiner
•
•
•
•
Virtual University/data mining WeblogMiner
OLAP and data mining technique
multi-dimensional data cube
scalability, interactivity, variety, flexibility
Design of a Web log Miner
 Web
log server log file information
• domain name of the request / user name / date and time of the
request / the method of the request(GET, POST) / the name of the file
requested / the result of the request(success, failure, error, etc) / size
of the data sent back / the URL of the referring page / identification of
the client agent
• Example
210.114.3.64 - - [01/Jul/1998:17:34:05 0900]
"GET/~yjsung/sign.htmlHTTP/1.1" 200 740
210.114.3.64 -- [01/Jul/1998:17:38:44-0900]
"POST/cgi-bin/yjsung/signHTTP/1.1" 200 352
 POST : 브라우저가 채워진 양식을 서버에 전달 할 때
GET : 서버로부터의 데이터 요청 시
• Cache information
• frequent backtracking and reload : deficient design
– client site log
• Access count
• not always the measure of interestingness
– 특정 document를 access하기 위해 반드시 거쳐야하는 사이트
• Time and Date
• evaluate user interest by time spent
• Domain name
• Sequence of requests can predict next request 
improve traffic
WebLogMiner 4 Stages
.Filtering the data, creating relational DB
2. Data cube construction
3. OLAP is used
4. Data mining technique are used
1.DATABASE CONSTRUCTION FROM SERVER
LOG FILES
 Data Cleansing and Transformation
• filter out page graphics(sound and video) but 보존
• two types
• without knowledge about site
– (time day, month, year등으로의 transformation은 서버 정보 없
이 가능)
• with knowledge about site :
– associating server request to intended action needs site structure
• relation database
• cleaned data and new implicit data is added
2.MULTI-DIMENSIONAL WEB LOG DATA
CUBE CONSTRUCTION AND MANIPULATION
 Data Cube
• group by operator in SQL is used to compute
aggregates on a set of attributes
sum of sales by P, C: for each product, give a breakdown on how much of it
was sold to each customer
• CUBE is the n-dimensional generalization of group-by
• gives remarkable flexibility to manipulate and view the
data
• allow OLAP operation such as drill-down, roll-up,
slice and dice
•Attributes
- URL
- domain name
- size of resource,
- time
.
.
.
3.DATA MINING ON WEB LOG DATA CUBE
AND WEB LOG DATABASE
 Data Characterization
• find rule that summarize user defined data set
☞ the traffic on a web server for a given type of media in a
particular time of day

Class comparison
• discover discriminant rules
☞ compare requests from two different web browsers

Association
• discover the patterns that access to different resources
consistently occurring together

Prediction
☞ access to a new resource on a given day can be
prediected based on accesses to similar old resources on

Classification
• can be used to develop a better understanding of each
class in the web log database, and perhaps restructure a
web sit or customize answers to requests based on
classes of requests

Time-series analysis • to analyze data along time sequences to discover timerelated interesting patterns …
☞ disclose the patterns and trends of the improvement of
services of the web server

Focus will be on time-series analysis because web
log records are highly time-related
Experiments with the web log
miner


Virtual-U:six different major component:
Goal - understand the usage and user behavior
patterns

Data Cleaning and transformations
• all entries were mapped one on one into
relational database
• field site, user action are added.
• Problem
– extraneous information => define those entries and
eliminate them
– multiple server requests by same user action
– same server request by multiple user actions
 Multi-dimensional
data cube construction
manipulation
• summarization(group-bys on different dimensions)
• request/domain /event/session/bandwidth/
error/referring organization /browser summary
 Examples
Figure2) OLAP analysis of Web log
Fig3) Typical event
sequence and user
behavior pattern
analysis
Fig4) Web traffic
analysis of Web log
•Fig6) Event trees of month one to four
Discussion and Conclusion
 WebLogMiner
• OLAP and data mining technique
• multi-dimensional data cube
• major strength
• scalability, interactivity, variety, flexibility

Current log file의 문제점
• web server should collect more information
• new structure is needed ==> would simplify
pre-processing