Towards Adaptive Websites: A Conceptual Framework and Case

Download Report

Transcript Towards Adaptive Websites: A Conceptual Framework and Case

Towards Adaptive Websites:
A Conceptual Framework and Case
Study
Mike Perkowitz
Oren Etzioni
Presented By
Ben Childs
Sergey Petrov
James Hunter
Issam Souilah
Department of Computer Science, November 2005
Agenda
•
Introduction
–
–
–
–
–
•
The PageGather Algorithm
–
–
–
–
•
Description of The Algorithm
Experimental Method
Time Complexity
Comparison with Related Algorithms
The IndexFinder Algorithm
–
–
•
•
•
•
•
Paper Outline
Motivation
What are Adaptive Websites?
Approaches to Adaptation
The Index Page Synthesis Use Case
Conceptual Cluster Mining
Experiments
Implementations
Conclusions
Related Work
Summary
Resources
Paper Outline
• Published 1999 (latest version 2001)
• Explores adaptive web sites
• Describes the design space of adaptive web
sites
• Considers a case study: index page synthesis
• Presents two algorithms:
– PageGather: a statistical cluster mining algorithm
– IndexFinder: a conceptual cluster mining algorithm
Motivation
• Designing a complex web site so it readily
yields its information is tricky, because:
1. Different visitors have distinct goals
2. Same user may seek different information at different times
3. Many sites outgrow their original design, accumulating links and
pages in unlikely places
4. A site may be designed for a particular use, but may be used in
unanticipated ways in practice
• Too often, web sites are fossils cast in HTML,
while web navigation is dynamic, timedependent, and idiosyncratic
What Are Adaptive Websites?
• Adaptive websites are sites that
automatically improve their organization and
presentation by learning from visitor access
patterns
• They mine the data buried in web server logs
to produce more easily navigable websites
• To demonstrate the feasibility of adaptive
websites, the index page synthesis use case
is considered
Approaches to Adaptation
• Aim is to make a website “better”, so we need a clear
quality measure
• Quality measure as a function of variables:
– How often users find what they are looking for
– How many clicks users have to make to get to their goal
– How much time users spend reading link text and scrolling
through pages
• Two approaches to adaptation:
– Content-based : organizes and presents pages based on their
content.
– Access-based : uses the way past visitors have interacted with the
site to guide how information is structured.
• Content-based and access-based adaptations are
complementary and may be used together
The Index Page Synthesis Case Study
(1)
• Page synthesis is the automatic creation of
web pages
• An index page is a page consisting of links to
a set of pages that cover a particular topic
• Index page synthesis problem: given a web
site and a visitor access log, create new index
pages containing collections of links to
related but currently unlinked pages
The Index Page Synthesis Case Study (2)
The Index Page Synthesis Problem:
1. What are the contents (i.e. hyperlinks) of the index
page?
2. How are the hyperlinks on the page ordered?
3. How are the hyperlinks labeled?
4. What is the title of the page? Does it correspond to
a coherent concept?
5. Is it appropriate to add the page to the site? If so,
where?
Solutions
• 2 Algorithms have been suggested
by the authors of the paper
• PageGather
• IndexFinder
The PageGather Algorithm
1. The PageGather algorithm is a statistical
cluster mining algorithm
2. Clustering algorithms take a collection of
objects as their input and produce a
partition of the collection
3. Cluster mining is a variation on traditional
clustering that may place a single object in
multiple overlapping clusters
4. PageGather uses cluster mining to find
collections of related pages at a website
Description of PageGather
1.
2.
3.
4.
5.
6.
Process the access log into visits
Compute the co-occurrence frequencies between
pages and create a similarity matrix
Create the graph corresponding to the matrix, and
find maximal cliques (or connected components) in
the graph
Rank the clusters found, and choose which to
output
Eliminate overlap among the clusters
Present it to the webmaster for evaluation
Experimental Method
• Experiments draw on data collected
from three distinct collections of web
pages
• The effectiveness of index page
synthesis is based on three factors:
1. Impact: How many people use the new pages and how often
2. Benefit: How much effort is saved by those who visit the pages
3. Recall: How much information sought by the user was actually
found
Time Complexity
• What is the running time of PageGather?
• Let L be the number of page views in the log and N
the number of pages at the site
• Step (1) requires O(L log L) time: page views must be
sorted by origin and time
• Step (2) requires O(L + N2) time: must process the
log and create a matrix of size O(N2)
• In step (3) we may find either connected components
(linear in the size of the graph) or cliques
(exponential in general, but since size of discovered
clusters is bound to k, this step is a polynomial of
degree k)
Comparison with related
algorithms
• PageGather significantly outperforms
other statistical clustering algorithms,
but is not as well as human-authored
clusters
The IndexFinder Algorithm
• PageGather relies on a statistical approach to
discovering candidate link sets; its candidates
do not correspond precisely to intuitive
concepts, whereas human-authored index
pages do
• An algorithm that finds only candidate link
sets that are conceptually coherent is desired
• IndexFinder is a key extension to PageGather
that guarantees that only sets corresponding
to topics are generated
IndexFinder - Problem Definition
• Given:
– A data collection D (e.g., a set of pages at a web site)
– A pairwise similarity measure m defined over D (e.g., page
co-occurrence frequencies derived from access logs)
– A conceptual language L for describing elements of D (e.g.,
conjunctions of descriptive features)
– A description in L of each object in D
• Output all subsets c of D such that:
– c is highly cohesive with respect to m (e.g., the average
pairwise similarity of the objects in c exceeds some
threshold)
– c corresponds to a concept expressible in L
IndexFinder – Previous work
• Relevant previous work of three types:
– Statistical approaches (e.g., PageGather): useful for finding
cohesive sets in large collections of data, but make no
attempt to ensure that their results correspond to an intuitive
concept
– Conceptual clustering algorithms (e.g., Fisher’s COBWEB
[4]): partition a data collection into clusters of similar objects.
Moreover, objects are described in a conceptual descriptive
language
– Concept learning algorithms: aim to find a conceptual
description of a set of objects from a data collection (note
that data needs to be classified in advance)
Experiments
• Experiments show that IndexFinder
outperforms both PageGather and COBWEB
and is close to the performance of the
human-authored index pages
Implementations
• And More:
– Use both user’s path and model to guess what pages they
are interested in seeing e.g., AVANTI Project [1]
– Automatic user categorization
– Hybrid approach
– Footprints [2] uses the metaphor of travellers creating
footpaths in the grass over time
– Using meta-information e.g., XML, Apple’s Meta-Content
Format, STRUDEL [3]
– Client-side customization
Conclusions (1)
• PageGather and IndexFinder outperform traditional
methods including: the Apriori data mining algorithm,
standard clustering algorithms and the COBWEB
conceptual clustering algorithm
• PageGather and IndexFinder are instances of novel,
domain-independent approaches to unsupervised
data mining
• Extensions and applications to these approaches
outside the domain of adaptive websites can be
found
Conclusions (2)
• Future work may focus on the automatic
placement of new index pages at the website
• Automatically suggesting names for the new
pages, and deciding where in the site they
should be located
• Index page synthesis itself is a step towards
the long-term goal of change in view:
adaptive websites that automatically suggest
re-organisations of their contents based on
visitor access patterns
Related Work
• By the authors:
– Mainly updates to the original paper (most recent
one in 2001)
• By others:
– Adaplix [5] : A system that extends HTML by
introducing conditional statements and an
inductive logic programming component to learn
the user's browsing preferences
– WebWatcher [6]: A “tour guide” of the web. It
accompanies the user from page to page,
highlighting hyperlinks that it believes will be of
interest
Summary
• We have covered:
– Adaptive Websites
– The Index Page Synthesis Use Case
– The PageGather Algorithm
– The IndexFinder Algorithm
– Implementations
– Related Work
Any Questions?
Resources
•
•
[1] J. Fink, A Kobsa, and A. Nill.
User-oriented Adaptivity and Adaptability in the AVANTI Project.
In Designing for the Web: Empirical Studies, Microsoft Usability Group, Redmond (WA).,
1996.
[2] A. Wexelblat and P. Maes.
Footprints: History-rich web browsing.
In Proc. Conf. Computer-Assisted Information Retrieval (RIAO), pages 75-84, 1997.
•
[3] M. Fernandez, D. Florescu, J. Kang, A. Levy, and D. Suciu.
System Demonstration - Strudel: A Web-site Management System.
In ACM SIGMOD Conference on Management of Data, 1997.
•
[4] D. Fisher. Knowledge Acquisition Via Incremental Conceptual Clustering. Machine
Learning, 2:139-172, 1987
•
[5] Nico Jacobs. Adaplix: Towards Adaptive Websites. In P. De Bra and L.
•
Hardman, editors, Proceedings van de Informatiewetenschap'99 Conferentie,
pages 22--28. Eindhoven University of Technology, November 1999
[6] URL : http://www.cs.cmu.edu/~webwatcher, accessed on 22 November 2005