Evolving dynamic web pages using web mining

Download Report

Transcript Evolving dynamic web pages using web mining

Evolving dynamic
web pages using web
mining
Kartik Menon
Smart Engineering Systems Laboratory
Engineering Management Department
University of Missouri-Rolla
•
•
•
•
•
•
•
•
•
•
•
•
Overview
Goal
Web Mining
General Principle behind web mining
Web Data
Web Access Pattern Clustering
Evolving web pages using cluster information
Clustering Techniques
Fuzzy C means
Experimental Set-up
Results
Conclusion and Future work
Questions
Goal
Cluster similar web access
traversal patterns and train the
system to understand the needs
and demands of different users
accessing the website and use
this information to evolve web
pages.
Web Mining
• Web Mining
Learning about different users
accessing a web page.
• The needs and requirements of the
user
• Web Access Traversal Patterns
• Links which are more popular than
others
• For example www.yahoo.com
»
»
»
»
Emails
Search engine
News
Greeting cards
General Principle behind
web mining
• Gather web data from Web Log
servers
• Cluster web traversal patterns
• Evolve web pages
Web Data
• What information is important for
Mining
– Links traversed (URL’s requested)
– Documents downloaded
– Time spent on the web page as
compared total time spent
– Web Traffic
– GET or POST messages
Web Access Pattern
Clustering
•
•
•
•
Find users with similar web access patterns
Grouping and separating users
Concise representation of a system's behavior
Generalize about user needs and interests
Evolving Web Pages
using cluster information
• The cluster information can be used
– To know about users
– Modify the web page
– Web personalization
– Evolving Web pages
Clustering Techniques
• Neural Nets
– Kohonen’s Self Organizing Maps (SOMs)
• Statistical
– K-Means
• Fuzzy Logic
– Fuzzy C Means
– Fuzzy ISODATA
Fuzzy C Means
• Is a data clustering technique where each data point
belongs to a cluster to some degree that is specified
by a membership function
• If
–
–
–
–
–
–
X is a set of n data sample vectors
U is a partition of X in c part,
V are cluster centers
d^2 is an inner product induced norm
u grade of membership of xk to the cluster i between 0 and 1
m is a parameter to increase or decrease the fuzziness
Fuzzy C Means (contd)
n
J m ( U, V ) 
c

(u ik ) m d 2 ( x k , v i )
k 1 i 1
N
vi 
m
u
 i ( k ) xij ui ( k ) 
i 1
N
u
i 1
1
 di(k ) 



j 1 
 d i ( j ) 
c
m
i(k )
dik  | xk vi|
2
2
( m 1)
Experimental Set-up
• Target the website http://campus.umr.edu.
• Mine the web log files for web data.
• The main problem is to convert the web sites
accessed into numeric values.
• Identify all the URLs from where you can go from this
web page
• Number these URLs from 1 to N where N is the Nth
URL which can be accessed
• Assign fuzzy weights (w(j)) to each URL that can be
accessed
• A Boolean variable s(j) is defined which is set to 1 if
the jth URL is accessed by the user else s(j) is set to
null.
Experimental Set-up
(contd.)
• Define the data point x as the number
corresponding to the for all the sites accessed
by the user in that particular user session.
• Apply fuzzy c-means by calculating Euclidean
distance between the data sample as dij=|xj-ci|
where xj being the data point and ci being the
center of cluster i.
http://.campus.umr.edu(0)
/students(0.1)
/staff(0.2)
/registrar(0.11)
/registrar/star(0.111)
www.umr.edu/~career(0.120)
/registrar/courseinfo(0.112)
/faculty(0.4)
/parents(0.3)
/departments(0.13)
/fairs(0.121)
/academic.html#art_science (0.131)
/community(0.5)
/jobtrack/*(0.122)
/academic.html#engineering(0.132)
IP Address
URL’s Accessed by the
user
131.151.9.999
http://campus.umr.edu, /students,
/departments,
/departments/academic.html#arts_science
181.147.7.970
http://campus.umr.edu, /students, /registrar,
/registrar/star
181.147.7.972
http://campus.umr.edu, /students,
http://web.umr.edu/~career, /jobtrak/*
181.148.7.979
http://campus.umr.edu, /students,
http://web.umr.edu/~career, /fairs
Results :
For 2 and 3 clusters
Results :For 2 and 3 clusters
(contd)
Web Page Evolution
• Use the clustered information as
an input to modify the web page so that
users having similar access patterns get
same web page as compared to others
• Adjust the placement of links
• Remove certain links (if possible)
Conclusions
•
•
•
•
Fuzzy c-means is an easy way of
clustering similar web access patterns
for different user sessions
The use of Euclidean distance was very helpful to
learn more about these web access patterns.
The experiment provided easy results and plots
which was highly interpretable
We observe that that fuzzy c-means provided stable
results for the different data sets we took.
Future Work
• Use other clustering algorithms
and compare
• Developing self evolving web sites - sites that
improve themselves by learning from user access
patterns
• The results which we got using the fuzzy clustering
algorithms could be used to recommend the web
master of the http://campus.umr.edu
• Increase the popularity of the web page by
tailoring it more to the needs of the users
accessing it
Questions ???