Path completion

Download Report

Transcript Path completion

Design and Implementation of
a Web Log Preprocessing
System Supporting Path
Completion
Batchimeg AI lab.
2005.04.19
Outline
Introduction
 Background
 Related work
 Purposed System
 Experiment and Result
 Conclusion and Future work

AI lab.
Introduction
Web Log Mining Process
Viewing
news
Saved Web Log Data in
Web Server
Web log preprocessing
E-Mail
Logged data
download
shopping
Auction
My research area:
Web Site
Visitor
- IP
-OS, Agent
- Time
- URL
- Refer page
- Date
-Cookie
- Method
- Status
- UserID
- bytes
-…
Preprocessing
DB
Pattern Analysis
• Visualization tools
• Knowledge Query
• Intelligent Agents
Pattern Discovery
Data Analysis
AI lab.
Background (1/4)

Log format :
210.126.19.93 - - [23/Jan/2005:13:37:12 -0800]
“GET /modules.php?name=News&file=friend&op=FriendSend&sid=8225 HTTP/1.1" 200 2705
"http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1)“ …
– Client IP - 210.126.19.93
– Date - 23/Jan/2005
– Accessed time - 13:37:12
– Method - GET (to request page ), POST, HEAD (send to server)
– Protocol - HTTP/1.1
– Status code - 200 (Success), 401,301,500 (error)
– Size of file - 2705
– Agent type - Mozilla/4.0
– Operating system - Windows NT
285014 lines record
http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225 →
→ http://www.olloo.mn/modules.php?name=News&file=friend&op=FriendSend&sid=8225
A visitor (210.126.19.93) after to view the news who send it to friend.
AI lab.
Background (2/4) - User identification, Session
Identification
User Identification is identifying each user accessing Web site
User IP+Browser (UserID+IP+OS or cookie)=> Identify the users
Session identification is to find each user’s access pattern and frequency path.
Cleaning
Log
User
Identification
IP
IP, Browser
User Identification
Session
Identification
Session
Identification
Path
Completion
Browser
202.131.3.100
Mozilla/5.0(Windows NT)
202.131.3.100
Mozilla/4.0 (Win2000)
210.126.19.93
Mozilla/4.0(Windows NT)
Formatting
Visited pages
A,B,C,D,F,A,L
A,B,G,L
N,O
202.131.3.100
Mozilla/5.0(Windows NT)
A,B,C,D,F
202.131.3.100
Mozilla/5.0(Windows NT)
A,L
202.131.3.100
Mozilla/4.0 (Win2000)
A,B,G,L
210.126.19.93
Mozilla/5.0(Windows NT)
N,O
AI lab.
Background (3/4) Server Log and Caching
Missed Page Views at Server
If client must request every web page from the server  slower.
The solution to this problem is caching.
Clients and Proxy Servers save local copies of pages  back” and “forward
… Request
Client
P3
Request
P4
Request
P3
Cache
Request
P6
Send 5
Server
Send P4
Send P4
Never logged by server
AI lab.
Background (4/4) - Path completion
Not all requested pages are recorded in Web log. Due to caching problem.
Cleaning
Log
User
Identification
Session
Identification
C.html
B.html
D.html
E.html
Before ..
Path completion
H.html
J.html
I.html
K.html
G.html
L.html
Formatting
F.html
Topological Structure
A.html
Path
Completion
After
A,B,C,D,F
A,B,C,D,C,B,F
A,L
A,L
A,B,G,I
A,B,A,G,I
N,O
N,O
M.html
O.html
Q.html
N.html
P.html
AI lab.
Related work
Related
works
Using
Topological
Structure
Removing
images
Removing
robot text
User /Session
Identification
Path
completion
R. Cooley [12]
O
O
O
Login, IP, Agent
O
1996
[8]
Olympics site
X
O
X
Cookie
X
Yan,
Jacobsen [5]
X
O
X
IP, Agent
X
Pitkow [7]
O
O
X
Session ID
O
Shahabi [2]
X
O
X
Session ID
O
Chen, Park [3]
X
O
X
Login, IP
X
X – not used
O – used
AI lab.
Construct the site
topological
structure by web
log data in server
Purposed System(1/7)
(preprocessing)
Web site’s topological structure
(find the hyperlink relation
between web pages)

User Identification, session
Identification, (identify each user,
find each user’s access pattern)
After session
Identification and path
completion  User
grouping  User Identify


Data cleaning
(eliminate irrelevant info)
Path
completion
Why preprocessing?
User

Grouping


Result
Preprocessing can take up
to 60-80% of the times
spend analyzing the data.
Incomplete preprocessing
task can easily result
invalid pattern and wrong
AI lab.
conclusions.
Purposed System (2/7)
Make the site topological structure
 Helps solving data preprocessing and analysis:
- user identification
- path completion
Goal of purposed system

Discover Similar user group, Relevant page group and
Frequency accessing paths
AI lab.
Purposed System (3/7)
begin
No
Not end of
Log file
No
Algorithm of Topological Structure
Yes
Find “http”
data
Yes
Enter URL to
URL_Queue
No
No
Is there other
Record?
Yes
URL Queue
Not empty
Yes
Get head,
define depth
To add link to
the Topo_Str_DB
end
Make Topological
Structure
AI lab.
Purposed System (4/7)- Make the topological
structure
Topological Structure
- input: URL  path and link
- output: complete sitemap (tree)
link, path, depth and referrersqueue

0. Index.html (A)
1. L.html (referrer)
2. Sport/Team/football.html
2. Sport/News/Mongolia.html
1. Sport.html
2. Sport/Team/
3. Sport/Team/football.html
2. Sport/Advice/
.
.
.
Depth
Index.html (A)
0
Sport.html
L.html
1
Sport/News/Mongolia.html
olloo.mn/L.html
olloo.mn/L.html  Sport/Team/football.html
olloo.mn/L.html  Sport/News/Mongolia.html
olloo.mn/Sport.html
olloo.mn/Sport.html /Team/football.html
olloo.mn/Sport.html /Advice/
2
3
Sport/Team/
Sport/Advice
X
Sport/Team/football.html
AI lab.
Purposed System (5/7) - User Identification

Flow chart of User Identification algorithm
Begin
Yes
N
o
Not end of log DB
No
N
o
IF current IP’s
Agent and OS same
N
o
End
Yes
IP not in IPSet
Save the IP,
Agent and OS
Yes
Is there other
Records?
Yes
Assign to
the User Set,
Increase User
counter
.. for
similar user
group
AI lab.
Purposed System (6/7)- Session
identification

Flow chart of Session Identification algorithm
Begin
N
o
not end of
log DB
Yes
No
IP not in User Set?
No
No
No
End
Is there other
Records?
Yes
Start new
Session
Yes
time taken >25.5?
A page
append to the
session
refer page empty?
Yes
Yes
go to path
Completion
AI lab.
Purposed System (7/7) - Path completion

Flow chart of Path completion algorithm
Begin
N
o
Not end of
Session set
No
Yes
A page in
a Session contains next page
in that session
Yes
check to
the next page
Search that
page from
site map
Complete
the path
End
AI lab.
Experiment (1/4)

www.olloo.mn Raw log data
URLs in Web server log
AI lab.
Experiment (2/4)
Topological Structure
AI lab.
Experiment (3/4)
Cleaning result
60000
50000
40000
Size (K)
30000
20000
10000
0
Before clean
Data cleaning
After clean
2005.01.03
2005.01.10
2005.01.17
2005.01.31
2005.02.19
2005.02.26
2005.03.14
2003.03.31
2003.04.05
AI lab.
Experiment (4/4)
AI lab.
Result
User group
Path completion
This result can be more helpful to discover Similar user group, Relevant
page group, Frequency accessing paths in WUM.
AI lab.
Interface of Path Completion
Preprocessing System (PCPS)

Start the new project.
AI lab.
Interface of Path Completion
Preprocessing System (PCPS)

Giving the project name and folder
AI lab.
Interface (Re Interface of Path Completion
Preprocessing System (PCPS) sult)

Add the log file to project
AI lab.
Interface of Path Completion
Preprocessing System (PCPS)

Choose the log file to add
AI lab.
Interface of Path Completion
Preprocessing System (PCPS)

Asking to remove the image files
(files) Should
to analyze…
(files) Should
to clean …
AI lab.
Interface of Path Completion
Preprocessing System (PCPS)

Cleaned log and information
The pages and files that wanted to analyze
AI lab.
Interface of Path Completion
Preprocessing System (PCPS)

Topological Structure
AI lab.
Interface of Path Completion
Preprocessing System (PCPS)
Browser
AI lab.
Interface of Path Completion
Preprocessing System (PCPS)

System
AI lab.
Comparing other preprocessing
approach to Purposed System
Related
works
Creation of
Topol.
Structure
Using
Topological
Structure
Removing
images
Removing
robot text
User /Session
Identification
Path
completion
R. Cooley [12]
X
O
O
O
Login, IP,
Agent
O
1996
[8]
Olympics site
X
X
O
X
Cookie
X
Yan,
Jacobsen [5]
X
X
O
X
IP, Agent
X
Pitkow [7]
X
O
O
X
Session ID
O
Shahabi [2]
X
X
O
X
Session ID
O
Chen, Park [3]
X
X
O
X
Login, IP
X
Purposed
System
O
O
O
O
IP,Agent,
Grouping
O
O- used, X – not used
AI lab.
Conclusion
Approach
Identified number
of access
Identified
number of
Users
Identified number
of Session
Not used path
completion
18019
2812
10407
Purposed
System
18019
3061
11019
• My work focus on preprocessing of Web log mining and enhance the
discovering patterns.
3061 – 2812 = 249 users neglected.
• This paper presented some new approach and practicable algorithm.
• This approach can be better precision than some existence approaches.
AI lab.
Reference
[1] R. Cooley, B. Mobasher, and J. Srivastava Department of Computer Science and
Engineering University of Minnesota Minneapolis, MN 55455, USA “Web mining:
Information and Pattern Discovery on the World Wide Web” 1998
[2] C. Shahabi and F.B. Kashani, “A Framework for Efficient and Anonymous Web Usage
Mining Based on Client-Side Tracking,”2001
[3] M.S. Chen, J.S. Park, P.S Yu. Data mining for path traversal patterns in a Web
environment. 1996
[4] H. Mannila, H. Toivonen. Discovering generalized episodes using minimal occurrence.
1996
[5] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. From user access patterns to dynamic
hypertext linking. 1996.
[6]. J. Pitkow. In search of reliable usage data on the WWW. 1997.
[7]. J. Pitkow, P. Pirolli and R. Rao. Silk. Extracting usable structures from the Web. 1996
[8]. S. Elo-Dean and M. Viveros. Data mining the IBM official 1996 Olympics Web site.
[9]. Open Market Inc. Open Market Web reporter. http://www.openmarket.com,1996.
[10]. net.Genesis. net.analysis desktop http://www.netgen.com,1996
[11]. Doru Tanasa, Brigitte Trousse “Advanced data preprocessing for intersites Web Usage
Mining “2004
[12]. R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from
Web Data, PhD thesis, Dept. of Computer Science, Univ. of Minnesota, 2000.
AI lab.