Generating Intelligent Links to Web Pages by Mining Access

Download Report

Transcript Generating Intelligent Links to Web Pages by Mining Access

Generating Intelligent Links to Web
Pages by Mining Access Patterns of
Individuals and the Community
Benjamin Lambert
Omid Fatemieh
CS598CXZ
Spring 2005
Outline
•
•
•
•
•
Motivation
The main idea of the project
Accomplished tasks
Remaining tasks
Discussion
The Problem
• The problem we would like to solve is:
– How can we best assist a person browsing
the Web by providing links to the pages
that they are looking for.
• There are many reasons we might want
to do this (e.g. pages hidden in a large
Web site, broken links, seminar
announcements, etc.)
Previous Work
• This problem has been studied a lot and
people have used many approaches.
• The two main ways of solving this are:
– Modeling user behavior (Markov models, HMMs,
etc.)
– Data mining for common browsing patterns
• Despite all the work that has been done,
many other techniques have not been tried.
Markov Model Approaches
• These primarily model a user enough to
suggest which link on the page they
are looking at they should click.
• This is not useful unless there are many
links on a page (e.g.www.perl.com)
Data-Mining Approaches
• These are better able to find pages that
are several links away from the current
page.
– Suppose we see a sequence of requests
for pages A, B, C, D, E occurring
frequently, we may consider adding a
shortcut from A to E.
New Ideas for Solving This
Problem
• Using recent activity to make
recommendations.
• Using the contents of Web pages to
make recommendations.
• Combining data mining and user
modeling approaches.
• Using a machine learning approach
Data
• Data: Web server logs
– CS department Web logs from Dec 6, 2004, to Feb 28, 2005
(thanks to Chuck Thompson)
– NASA Kennedy Space Center collected over July and
August 1995 (available freely online)
• The logs are long lists of Web page requests, each
request is represented by:
–
–
–
–
The requester’s IP address
The time and date requested
The page requested
Etc.
Data Cleaning
• First, for privacy reasons, data had to be “sanitized”
and the actual IP addresses were removed before we
can have access to it.
• Requests for .gif, .jpg, .css, etc. files should be
discarded.
– Only looking at the extension of the requested file in not
enough e.g. "GET /research/areas.php?area=proglang
HTTP/1.1“ has no extension.
• Requests from crawlers. (robots.txt)
• Unsuccessful GETs.(code 200 only, not 404)
• Refreshes (consecutive requests for the same page)
Recommendations by a First
Order Markov Model
• We wrote Perl scripts to parse and store the
clean data
• We implemented a recommending model
using simple first order Markov Models
– This provides the user with links to the most
frequently clicked links on the current page
Results for First Order Markov
Model
• Evaluation was performed on the existing logs
• If the next click in a browsing session is the
recommended page, it is a hit, otherwise it is a miss.
• Hit ratio for when only one page is recommended:
– CS logs:
• Number of testing records: approx. 500,000
• Hit ratio: 18.7%
– NASA logs:
• Number of testing records: approx. 2 million for one month
• Hit ratio: 30%
• Other researchers have performed evaluation
similarly. In some cases, a hit is considered to be
when any recommended page is browsed to.
Using Recent Activity
• Suppose there is an important event
somewhere in the Siebel Center at 4pm.
– Many people might go to
http://www.cs.uiuc.edu to find the location
between 3:45 and 4:05!
– It would be good to automatically discover
this and generate the link for users
Dynamic Markov Model
• To model such recent browsing activity, we
need a more sophisticated model that more
heavily weights recent browsing activity.
• To do this, we implemented an “online”
recommending model using “dynamic first
order Markov Models”
• We set a threshold t
– Only the requests within the past t minutes affect
the model
Dynamic Markov Model
Results
• This is too simplistic to work.
• Most successful recommendation are
for major browsing patterns that do not
change over time:
– /info/prospective.php -> /graduate/admissions.php
• Accuracy decreases as t decreases
• We would need to recognize that the
user is looking for ephemeral pages.
Using the Web Page Contents
(To Do)
• Can we use the content of the
previously browsed pages to
recommend some links to the user?
– E.g., if the last 10 pages the user has
browsed contain the word IR, recommend
Prof. Zhai’s web page.
• Perhaps we can use a machine learning
algorithms to cast this as a multi-class
classification problem.
Hybrid approaches (To Do)
• How to combine user-modeling with
pattern mining?
• How to best combine individual user
patterns (personalizations) with
collective patterns (recommender
systems)?
Other Things To Do
• Incorporate pattern mining
• Experimentally evaluate new models
and combinations
• Actual Implementation (CGI scripts and
cookies)
• Higher order Markov Models
Other Paradigms for Making
Recommendations (Future
Work)
• Recommendations as:
– An AI planning problem?
– An optimization problem?
– Others?
Discussion
•
•
•
•
Ideas about the model?
Other paradigms to consider?
How can we incorporate content?
Suggestions?
Thank You.