Transcript powerpoint

Knowledge discovery
on the Web
Georgios Paliouras
Software and Knowledge Engineering Laboratory
Institute of Informatics and Telecommunications
N.C.S.R. “Demokritos”
[email protected]
http://www.iit.demokritos.gr/~paliourg
1
Contents
Introduction
Knowledge discovery from text & links
Knowledge discovery from usage data
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
2
Contents
Introduction
Information overload
The need for intelligent information access
Knowledge discovery approaches
Knowledge discovery from text & links
Knowledge discovery from usage data
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
3
WWW: the new face of the Net
Once upon a time, the Internet was a forum for
exchanging information. Then …
…came
the Web.
The Web introduced
new capabilities …
…and attracted many more people …
…increasing
commercial interest …
…and turning the Net into a real forum …
© Georgios Paliouras (March 2003)
4
Information overload
…as more people
started using it ...
…attracting even
more people ...
…increasing
…the quantity
the of
quantity
information
of online
on the
information
Web increased...
further...
…and leading to the overload
of information for the users ...
© Georgios Paliouras (March 2003)
5
WWW: an expanding forum
The Web is large and volatile:
More than 600.000.000 users online
More than 800.000 sign up every day
More than 9.000.000 Web sites
More than 300.000.000.000 pages online
Less than 50% of Web sites will be there next year
… leading to the abundance problem:
“99% of online information is of no
interest to 99% of the people”
© Georgios Paliouras (March 2003)
6
Information access services
A number of services aim to help the user gain
access to online information and products ...
… but can they really cope?
© Georgios Paliouras (March 2003)
7
New requirements
Current indexing does not allow for wide coverage:
Less than 5% of the Web covered by search engines.
What I want is hardly ever ranked high enough.
Product information in catalogues is often biased
towards specific suppliers and outdated.
Product descriptions are incomplete and insufficient
for comparison purposes.
‘E’ in ‘E-commerce’ stands for ‘English’: More than
70% of the Web is English.
… and many more problems lead to the conclusion ...
… that more intelligent solutions are needed!
© Georgios Paliouras (March 2003)
8
A new generation of services
Some have already made their way to the market…
… many more are being developed as I speak …
© Georgios Paliouras (March 2003)
9
Approaches to Web mining
Primary data (Web content):
Mainly text,
with some multimedia content (increasing)
and mark-up commands including hyperlinks.
Underlying databases (not directly accessible).
 Knowledge discovery from text and links
Pattern discovery in unstructured textual data.
Pattern discovery in the Web graph / hypertext.
© Georgios Paliouras (March 2003)
10
Approaches to Web mining
Secondary data (Web usage):
Access logs collected by servers,
potentially using cookies,
and a variety of navigational information collected
by Web clients (mainly JavaScript agents).
 Knowledge Discovery from usage data
Discovery of interesting usage patterns, mainly
from server logs.
Web personalization & Web intelligence.
© Georgios Paliouras (March 2003)
11
Contents
Introduction
Knowledge discovery from text & links
Introduction
Information filtering and retrieval
Information extraction
Ontology learning
Knowledge discovery from usage data
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
12
Information access
Goals:
Organize documents into categories.
Assign new documents to the categories.
Retrieve information that matches a user query.
Long history of manually-constructed and
statistical document category models
Dominating statistical idea:
TFIDF=term frequency * inverse document frequency
Problems on the Web:
Huge scale and high volatility demand automation.
© Georgios Paliouras (March 2003)
13
Text mining
Knowledge (pattern) discovery in textual data.
Clarifying common misconceptions:
Text mining is NOT about assigning documents to
thematic categories, but about learning document
classifiers.
Text mining is NOT about extracting information
from text, but about learning information extraction
patterns.
Difficulty: unstructured format of textual data.
© Georgios Paliouras (March 2003)
14
Approaches to text mining
Combination of language engineering (LE),
machine learning (ML) and statistical methods:
LE
MLStats
MLStats
LE
© Georgios Paliouras (March 2003)
15
Hyperlink information is useful
Information access can be improved by
identifying: authoritative pages (authorities)
and resource index pages (hubs).
Linked pages often contain complementary
information (e.g. product offers).
Thematically related pages are often linked,
either directly or indirectly.
© Georgios Paliouras (March 2003)
16
Contents
Introduction
Knowledge discovery from text & links
Introduction
Information filtering and retrieval
Information extraction
Ontology learning
Knowledge discovery from usage data
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
17
Document category modelling
Training documents (pre-classified)
Stopword removal (and, the, etc.)
Stemming (‘played’  ‘play’)
Pre-processing
Bag-of-words coding
Dimensionality reduction Statistical selection/combination of
characteristic terms (MI, PCA)
Machine Learning
Supervised classifier learning
Category models (classifiers)
© Georgios Paliouras (March 2003)
18
Document category modelling
Machine Learning methods used:
Memory-based learning (k-nearest neighbor).
Decision-tree and decision-rule induction.
Probabilistic learning (naive Bayes classifiers).
Support vector machines.
Boosting (combined usually with decision trees).
Maximum entropy modeling.
Neural networks (multi-layered perceptrons).
Problems: high dimensionality (sparseness), large
training sets (scale), overlapping categories.
© Georgios Paliouras (March 2003)
19
Document category modelling
Example: Filtering spam email.
Task: classify incoming email as spam
and legitimate (2 document categories).
Simple blacklist and keyword-based
methods have failed.
More intelligent, adaptive approaches
are needed (e.g. naive Bayesian
category modeling).
© Georgios Paliouras (March 2003)
20
Document category modelling
Step 1 (linguistic pre-processing): Tokenization,
removal of stopwords, stemming/lemmatization.
Step 2 (vector representation): bag-of-words or
n-gram modeling (n=2,3).
Step 3 (feature selection): information gain
evaluation.
Step 4 (machine learning): Bayesian modeling,
using word/n-gram frequency.
© Georgios Paliouras (March 2003)
21
Link structure analysis
Improve information retrieval by scoring Web
pages according to their importance in the
Web or a thematic sub-domain of it.
Nodes with large fan-in (authorities) provide
high quality information.
Nodes with large fan-out (hubs) are good
starting points.
© Georgios Paliouras (March 2003)
22
Link structure analysis
The HITS algorithm [Kleinberg, ACM Journal 1999]:
Given a set of Web pages, e.g. as generated by a query,
expand the base set by including pages that are linked to by
the ones in the initial set or link to them,
assign a hub and an authority weight to each page,
initialised to 1,
update the authority weight of page p according to the hub
weights of the pages that link to it:
a p  q|q p hq
update the hub weight of page p according to the authority
weights of the pages that it links to:
h p   q| p  q a q
repeat the weight update for a given number of times,
return a list of the pages ranked by their weights.
© Georgios Paliouras (March 2003)
23
Link structure analysis
Interesting issues:
Does the social network hypothesis hold, i.e.,
“authorities are highly cited”? This may be
unrealistic in competitive commercial domains.
What happens if link structure adapts to the
method, e.g. unrelated pages link to each other to
increase their rating?
What about interesting new pages? How will
people get to them?
© Georgios Paliouras (March 2003)
24
Focused crawling & spidering
Crawling/Spidering: Automatic navigation
through the Web by robots with the aim of
indexing the Web.
Crawling v. Spidering (subjective): inter-site v.
intra-site navigation.
Focused crawling/spidering: Efficient,
thematic indexing of relevant Web pages, e.g.
maintenance of a thematic portal.
Underlying assumption similar to HITS:
thematically similar pages are linked.
© Georgios Paliouras (March 2003)
25
Focused crawling & spidering
Focused crawling [Chakrabarti et al., WWW 1999]:
Given an initial set of Web pages about a topic, e.g. as
found in a Web directory,
use document category modelling to build a topic classifier,
extract the hyperlinks within the initial set of pages and add
them to a queue of pages to be visited,
retrieve pages from the queue,
use the classifier to assess the relevance of retrieved pages,
use a variant of HITS to assign a hub score to pages and the
hyperlinks in the queue,
re-sort the links in the queue according to their hub score,
continue the retrieval of new pages, periodically updating
the score of hyperlinks in the queue.
© Georgios Paliouras (March 2003)
26
Focused crawling & spidering
Domain-specific spidering:
Goal: retrieve interesting pages, without traversing
the whole site.
Differences from crawling:
The site is much more restricted in size and thematic
diversity than the whole of the Web.
Social network analysis is less relevant within a site (no
hubs and authorities).
Requirement: link scoring using local features, e.g.
the anchor text and the textual context.
© Georgios Paliouras (March 2003)
27
Focused crawling & spidering
Domain-specific spidering [Rennie & McCallum, ICML 1999]:
TRAINING:
Given a set of Web sites, within which interesting pages have been
manually identified,
use a simplified version of Q-learning (reinforcement learning) to
learn a good navigation policy:
Assign Q values to hyperlinks, according to their distance from
interesting pages.
Learn a model to predict the Q value of a link using local features.
USAGE:
Start at the home page of a site.
Extract all links, predict their Q values and add them to a queue.
Visit the first page in the queue and repeat the process.
© Georgios Paliouras (March 2003)
28
Focused crawling & spidering
Interesting issues:
Does the social network hypothesis hold, i.e.,
“related pages link to each other”? This may be
unrealistic in competitive commercial domains.
How dependant is focused crawling on the initial
set of Web pages?
What happens with multi-theme hubs?
What about interesting new pages? Will they be
discovered?
What are good descriptive features for link scoring
in domain-specific spidering? Can the same
features help to improve focused crawling?
© Georgios Paliouras (March 2003)
29
Contents
Introduction
Knowledge discovery from text & links
Introduction
Information filtering and retrieval
Information extraction
Ontology learning
Knowledge discovery from usage data
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
30
Information extraction
Goals:
Identify interesting “events” in unstructured text.
Extract information related to the events and store
it in structured templates.
Typical application:
Information extraction from newsfeeds.
Difficulties:
Deals with unstructured or semi-structured text.
Identification of entities and relations.
Usually requires some understanding of the text.
© Georgios Paliouras (March 2003)
31
A typical extraction system
Morphology
Syntax
Semantics
Discourse
Unstructured text and database
schema (event templates)
Lemmatization (‘said’  ‘say’),
Sentence and word separation.
Part-of-speech tagging, etc.
Shallow syntactic parsing.
Named-entity recognition.
Co-reference resolution.
Sense disambiguation.
Pattern matching.
Structured data (filled templates)
© Georgios Paliouras (March 2003)
32
Information extraction
Long history before the birth of the Web.
One of the hardest Language Engineering tasks.
ARPA Message Understanding Conferences have
pushed the field forward.
Information overload on the Web has increased the
need for IE systems.
IE is achievable for very narrow domains
(e.g. mergers and acquisitions).
Manual construction of IE systems is time-consuming.
Machine learning can help solve this problem.
© Georgios Paliouras (March 2003)
33
Customization of IE systems
Learning can be used to customize individual
modules of an IE system:
Sentence splitting (rule learning).
Part-of-speech tagging (Brill tagger).
Named-entity recognition (HMMs).
Co-reference resolution (rule learning).
Word sense disambiguation (rule learning).
Learning speeds up the customization to new
domains and new languages!
© Georgios Paliouras (March 2003)
34
Extraction pattern discovery
Morphology
Syntax
Semantics
Pattern Discovery
Unstructured text and database
schema (event templates)
Lemmatization (‘said’  ‘say),
Sentence and word separation.
Part-of-speech tagging, etc.
Shallow syntactic parsing.
Named-entity recognition.
Co-reference resolution.
Sense disambiguation.
IE pattern discovery.
IE patterns
© Georgios Paliouras (March 2003)
35
Extraction pattern discovery
Machine Learning methods used:
Decision-rule learning.
Grammar induction.
Co-training (semi-supervised learning).
Clustering.
Difficulties:
Difficult to produce hand-tagged training data.
Extensive use of LE.
Lack of sufficient background knowledge.
© Georgios Paliouras (March 2003)
36
Wrappers/fact extraction
Simplified information extraction:
Extract interesting facts from Web documents.
Assumes structure in the documents (usually
dynamically generated from databases).
Reduced demand for pre-processing and LE.
Typical application:
Product comparison services (price, availability, …).
Difficulties:
Semi-structured data.
Different underlying database schemata and
presentation formats.
© Georgios Paliouras (March 2003)
37
Wrappers/fact extraction
Example:
<HTML><TITLE> Some Country Codes </TITLE>
<BODY><B> Some Country Codes </B> <P>
<B> Congo </B> <I> 242 </I>
<B> Egypt </B> <I> 20 </I>
<B> Greece </B> <I> 30 </I>
<B> Spain </B> <I> 34 </I>
<HR> <B> End </B> </BODY> </HTML>
Wrapper (page P)
Skip past first occurrence of <P> in P
While (next <B> is before next <HR> in P)
For each <l, r>  { (<B>, </B>}) , (<I>, </I>) }
Extract the text between l and r
return <country, code > extracted pairs
© Georgios Paliouras (March 2003)
Country
Code
Congo
242
Egypt
20
Greece
30
Spain
34
38
Wrapper induction
Training documents (semistructured)
Data pre-processing
Abstraction of mark-up structure
(often omitted)
Database schema (interesting facts)
Machine Learning
Structural/sequence learning
Fact extraction patterns (wrapper)
© Georgios Paliouras (March 2003)
39
Wrapper induction
STALKER [Muslea et. al., AAAI 1998]
Features:
- Predefined hierarchical structure
of the page (EC)
- Single-slot extraction patterns
- Landmark-based extraction patterns
Sample Page:
Name: Taco Bell <br> <p>
-LA: 400 Pico; (213) 323-5545,
(800) 222-1111.
211 Flower; (213) 424-7645. <p>
-Venice: 20 Vernon; (310) 888-1010. <p>
<hr>
© Georgios Paliouras (March 2003)
Embedded Catalog Tree (EC):
Doc ::= Restaurant LIST(City)
City ::= CityName LIST(Loc)
Loc ::= Number Street LIST(Phone)
Phone ::= AreaCode PhoneNumber
Extraction patterns:
Restaurant: *’Name:’(*)’<br>’
LIST(City) : *’<p>’(*)’<hr>’
City (iteration): *’-‘(*)’<p>’
CityName: *(*)’:’
LIST(Loc): …… etc.
40
Wrapper induction
Hidden Markov Models [McCallum et. al., ICML 2000]
Each document is modelled by a stochastic process
S1
S2
S3
S4
B
P
O1
O2
T
O3
S
O4
HMM states are labelled:
B = “background”, P = “prefix”, T = “target”, S = “suffix”
A label sequence associated with each training sequence
O = {The software company ITC acquired the …} (Observed)
L={ B
B
P
T
S
B…} (Hidden)
Compute P({S1, ..ST} | {O1, .. OT}). (Viterbi algorithm)
© Georgios Paliouras (March 2003)
41
Wrapper induction
A separate HMM trained for each field
HMM structure fixed
e.g. 1 Background state, 2 Prefix, 2 Suffix, 2
Target
Baum-Welch not needed to train the HMM
Parameter values acquired by relative
frequencies:
c(i  j )
aij =
 sS c(i  s)
Transition Probabilities
© Georgios Paliouras (March 2003)
c( j  k )
bj(k) =
vV c( j  v)
Emission Probabilities
42
Wrapper induction
Hard issues:
How much LE? Traditional IE is a source of ideas.
How domain-specific?
How unstructured are the data/texts?
Generalization to unseen sites?
Multiple descriptions per page?
Descriptions that span more than one pages?
How much labelled data? Can we learn the template?
Could active learning help?
© Georgios Paliouras (March 2003)
43
Contents
Introduction
Knowledge discovery from text & links
Introduction
Information filtering and retrieval
Information extraction
Ontology learning
Knowledge discovery from usage data
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
44
Ontology learning
Training documents (unclassified)
Stopword removal (and, the, etc.)
Stemming (‘played’  ‘play’)
Syntactic/Semantic analysis
Pre-processing
Bag-of-words coding
Dimensionality reduction Hand-made thesauri (Wordnet)
Term co-occurrence (LSI)
Machine Learning Unsupervised learning (clustering
and association discovery)
Ontologies
© Georgios Paliouras (March 2003)
45
Ontology learning
Hierarchical clustering is most suitable:
Agglomerative clustering
Conceptual clustering (COBWEB)
Model-based clustering (EM-type: MCLUST)
… but flat clustering can also be adapted:
K-means and its variants
Bayesian clustering (Autoclass)
Neural networks (self-organizing maps)
Association discovery (e.g. Apriori) for nontaxonomic relations.
© Georgios Paliouras (March 2003)
46
Ontology learning
Example: Acquisition of an ontology for tourist
information. [based on Maedche & Staab, ECAI 2000]
© Georgios Paliouras (March 2003)
47
Ontology learning
Source data: Web pages of tourist sites.
Background knowledge: generic and domain-specific
ontologies.
Target users: Tourist directories, large travel agencies.
Goals:
Identify types of page (e.g. room descriptions) and
terms/entities inside pages (e.g. hotel addresses).
Identify taxonomic relations between concepts (e.g.
accommodation – hotel).
Identify non-taxonomic relations between concepts (e.g.
accommodation – area).
© Georgios Paliouras (March 2003)
48
Ontology learning
Heavy linguistic pre-processing:
Syntactic analysis,
e.g. verb subcategorization frames:
verb(arrive) -> prep(at), dir_obj(Torino).
Semantic analysis,
e.g. named entity recognition:
‘Via Lagrange’ -> Street name
e.g. special dependency relations:
‘Hotel Concord in Torino’
© Georgios Paliouras (March 2003)
49
Ontology learning
Various types of clustering methods have been
used for document clustering.
The ASIUM algorithm [Nédellec, LLL 1999] creates a
taxonomy from subcategorization frames,
using variabilization, predicate invention, and
conceptual clustering.
In [Maedche & Staab, ECAI 2000] non-taxonomic
relations are discovered using association rule
mining on special binary relations of concepts.
© Georgios Paliouras (March 2003)
50
Ontology learning
Hard issues:
How much LE?
How domain-specific?
Reliance on existing ontologies.
Which non-taxonomic relations?
What is an instance (document, paragraph, entity)?
Labelling of new concepts and relations.
Unlikely to have labelled data.
Evaluation.
© Georgios Paliouras (March 2003)
51
Contents
Introduction
Knowledge discovery from text & links
Knowledge discovery from usage data
Personalization on the Web
Data collection and preparation issues
Personalized assistants
Discovering generic user models
Sequential pattern discovery
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
52
Web personalization
Basic principles
[Schwartz, Webonomics, 1997]:
“The Web is ultimately a personal medium
in which every user's experience is
different than any other's.”
“The quantity of people visiting your site is
less important than the quality of their
experience.”
© Georgios Paliouras (March 2003)
53
Personalized information access
sources
personalization
server
© Georgios Paliouras (March 2003)
receivers
54
Personalization v. intelligence
Better service for the user:
Reduction of the information overload.
More accurate information retrieval and extraction.
Recommendation and guidance.
Customer relationship management:
Customer segmentation and targeted advertisement.
Customer attraction and retention.
Service improvement (site structure and content).
© Georgios Paliouras (March 2003)
55
User modelling
Basic elements:
Constructing models that can be used to adapt the
system to the user’s requirements.
Different types of requirement: interests (sports
and finance news), knowledge level (novice expert), preferences (no-frame GUI), etc.
Different types of model: personal – generic.
Knowledge discovery facilitates the
acquisition of user models from data.
© Georgios Paliouras (March 2003)
56
User Models
User model (type A):
[PERSONAL]
User x -> sports, stock market
User model (type B):
[PERSONAL]
User x, Age 26, Male -> sports, stock market
User community:
[GENERIC]
Users {x,y,z} -> sports, stock market
User stereotype:
[GENERIC]
Users {x,y,z}, Age [20..30], Male -> sports, stock
market
© Georgios Paliouras (March 2003)
57
Learning user models
Community 2
Community 1
User 1
User 2
User 3
User 4
User 5
User
communities
User
models
Observation of the users interacting with the system.
© Georgios Paliouras (March 2003)
58
Knowledge discovery process
Data collection
Data pre-processing
Pattern discovery
Collection of usage data by
the server and the client.
Data cleaning, user identification,
session identification
Construction of user models
Knowledge post-processing Report generation, visualization,
personalization module.
© Georgios Paliouras (March 2003)
59
Contents
Introduction
Knowledge discovery from text & links
Knowledge discovery from usage data
Personalization on the Web
Data collection and preparation issues
Personalized assistants
Discovering generic user models
Sequential pattern discovery
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
60
Usage data sources
Server-side data: access logs in Common or
Extended Log Format, user queries, cookies.
Client-side data: Java and Javascript agents.
Intermediary data: proxy logs, packet sniffers.
Registration forms: personal information and
preferences supplied by the user.
Demographic information: provided by census
databases.
© Georgios Paliouras (March 2003)
61
Pre-processing usage data
Cleaning:
Log entries that correspond to error responses.
Trails of robots.
Pages that have not been requested explicitly by the user
(mainly image files, loaded automatically). Should be
domain-specific.
User identification:
Identification by log-in.
Cookies and Javascript.
Extended Log Format (browser and OS version).
Bookmark user-specific URL.
Various other heuristics.
© Georgios Paliouras (March 2003)
62
Pre-processing usage data
User session/Transaction identification in log files:
Time-based methods, e.g. 30 min silence interval. Problems
with cache. Partial solutions: special HTTP headers, Java
agents.
Context-based methods: e.g. separate pages into
navigational and content and impose heuristics on the type
of page that a user session may consist of.
User sessions can be subdivided into smaller transaction
sequences, e.g. by identifying a “backward reference” in the
sequence of requests.
Encoding of training data:
Bag-of-pages representation of sessions/transactions.
Transition-based representation of sessions/transactions.
Manually determined features of interest.
© Georgios Paliouras (March 2003)
63
Collection and preparation
Problems:
Privacy and security issues:
The user must be aware of the data collected.
Cookies and client-side agents are often disabled.
Caching on the client or an intermediate proxy causes
data loss on the server side.
Registration forms are a nuisance and they are not
reliable sources.
Client-side agents increase response time.
User and session identification from server logs is hard.
Different data required for different user models.
© Georgios Paliouras (March 2003)
64
Contents
Introduction
Knowledge discovery from text & links
Knowledge discovery from usage data
Personalization on the Web
Data collection and preparation issues
Personalized assistants
Discovering generic user models
Sequential pattern discovery
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
65
Personalized assistants
Construction of a separate model for each user
and use of this model to:
help focus on interesting Web sites (personalized
crawling),
modify the structure and content of a site,
adapt the Web interface.
Various methods developed outside the Web
are applicable here, e.g. student modelling.
Collection of sufficient usage data is difficult.
User identification essential; most often log-in.
© Georgios Paliouras (March 2003)
66
Personalized assistants
Personalized crawling [Liebermann et al., ACM Comm., 2000]:
The system knows the user (log-in).
It uses heuristics to extract “important” terms from the Web pages
that the user visits and add them to thematic profiles.
Each time the user views a page, the system:
searches the Web for related pages,
filters them according to the relevant thematic profile,
and constructs a list of recommended links for the user.
The Letizia version of the system searches the Web locally,
following outgoing links from the current page.
The Powerscout version uses a search engine to explore the Web.
© Georgios Paliouras (March 2003)
67
Personalized assistants
Adapting site structure [Schwarzkopf, UM 2001]:
The system knows the user (log-in).
Session identification is done by means of bookmarking a
personal URL.
Bayesian networks define taxonomic relations between
topics covered by a Web site.
A different network is maintained for each user and the
probabilities map the user’s interests.
Dynamic hyperlinks and recommendations are derived by
the user’s model and included in the site’s home page.
© Georgios Paliouras (March 2003)
68
Personalized assistants
Adaptive Web interfaces [Jörding, UM 1999]:
The TELLIM system collects user information, (e.g. a
selection of a link) using a Java applet .
User information is used as training data in order to create
generic models reflecting the users’ interest in different
products.
The system creates short-term personal models using the
generic models and the current user’s behavior.
Web pages containing more detailed information about these
products, together with multimedia content and VRML
presentations are created dynamically and presented to the
users.
© Georgios Paliouras (March 2003)
69
Personalized assistants
Problems:
Users suspicious of the agent mishandling
their personal data. Especially for server-side
systems.
Limited amount of training data per user.
Usually require heuristics or the user’s
involvement, in order to accelerate learning.
© Georgios Paliouras (March 2003)
70
Contents
Introduction
Knowledge discovery from text & links
Knowledge discovery from usage data
Personalization on the Web
Data collection and preparation issues
Personalized assistants
Discovering generic user models
Sequential pattern discovery
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
71
Generic user models
Stereotypes: Models that represent a type
of user, associating personal characteristics
with parameters of the system,
e.g. Male users of age 20-30 are interested in sports
and politics.
Communities: Models that represent a
group of users with common preferences,
e.g. Users that are interested in sports and politics.
© Georgios Paliouras (March 2003)
72
Learning user stereotypes
Constructing stereotypes from personal user
models:
Given a set of models for the users who have
chosen each stereotype,
use supervised learning (e.g. decision trees) to
construct a model that distinguishes each
stereotype from all others.
Each user may belong in more than one
stereotype.
User sessions may be used instead of
personal user models.
© Georgios Paliouras (March 2003)
73
Collaborative filtering
Information filtering according to the
choices of similar users.
Avoids semantic content analysis.
Cold-start problem with new users.
Approaches:
memory-based learning,
model-based clustering,
item-based recommendation.
© Georgios Paliouras (March 2003)
74
Memory-based learning
Nearest-neighbour approach:
Construct a model for each user. Often use explicit
user ratings for each item.
Index the user in the space of system parameters,
e.g. item ratings.
For each new user,
index the user in the same space, and
find the k closest neighbours.
Simple metrics to measure the similarity between users,
e.g. Pearson correlation.
Recommend the items that the new user has not
seen and are popular among the neighbours.
© Georgios Paliouras (March 2003)
75
Memory-based learning
Sports news
1
0
© Georgios Paliouras (March 2003)
Finance news
1
76
Model-based clustering
Clustering users into communities.
Methods used:
Conceptual clustering (COBWEB).
Graph-based clustering (Cluster mining).
Statistical clustering (Autoclass).
Neural Networks (Self-Organising Maps).
Model-based clustering (EM-type).
BIRCH.
Community models: cluster descriptions.
© Georgios Paliouras (March 2003)
77
Model-based clustering
Clique-based clustering:
Construct a model for each user.
Calculate the similarity between users, in terms of
the common items in their models.
Construct a graph of users, where the edges are
labeled by user similarity.
Remove edges according to a similarity threshold.
Identify cliques in the graph.
Important: each user may belong in more
than one community.
© Georgios Paliouras (March 2003)
78
Model-based clustering
0,9
0,9
0,9
0,8
0,4
0,1
0,1
0,5
0,5
© Georgios Paliouras (March 2003)
79
Item-based recommendation
Focus on item usage in the profiles, instead
of the users themselves.
Practically useful in e-commerce, e.g. crosssell recommendations.
Simple modification to the clique-based
clustering method: graph of items instead of
graph of users.
Related to frequent itemset discovery in
association rule mining.
© Georgios Paliouras (March 2003)
80
Item-based recommendation
Sports
0,9
Politics
0,9
0,9
0,8
0,4
Finance
0,1
0,1
World
0,5
0,5
© Georgios Paliouras (March 2003)
81
Generic user models
Issues:
Acquisition of personal information for
stereotypes.
Cold-start for collaborative filtering.
Scalability of memory-based collaborative filtering.
Cluster analysis for model-based methods.
Evaluation measures.
Higher-order representation of user sessions.
Community-specific item-based filtering.
© Georgios Paliouras (March 2003)
82
Contents
Introduction
Knowledge discovery from text & links
Knowledge discovery from usage data
Personalization on the Web
Data collection and preparation issues
Personalized assistants
Discovering generic user models
Sequential pattern discovery
Knowledge discovery in action
Important open issues
© Georgios Paliouras (March 2003)
83
Sequential pattern discovery
Identifying navigational patterns, rather than
“bag-of-page” models.
Methods:
Clustering transitions between pages.
First-order Markov models.
Probabilistic grammar induction.
Association-rule sequence mining.
Path traversal through graphs.
Personal and community navigation models.
© Georgios Paliouras (March 2003)
84
Sequential pattern discovery
Clique-based transition clustering; small modification of
the model-based item clustering approach: an item is a
transition between pages.
Sports
->Politics
0,9
0,9
0,9
0,8
0,4
Sports
->Finance
© Georgios Paliouras (March 2003)
Finance
->Politics
0,1
0,1
0,5
0,5
Finance
->Sports
85
Sequential pattern discovery
Probabilistic grammar induction [Borges and
Levene, WebKDD 1999]:
Represent each session by a string of symbols.
Construct a probabilistic automaton, where:
The probability of each state corresponds to the frequency of
the corresponding page in the sessions.
The edge probabilities correspond to the transition frequencies.
Generate exhaustively all trails that can be generated by the
probabilistic grammar, which have a trail probability above a
heuristic threshold. Equivalent to ordered frequent itemsets.
Higher-order Markovian modelling is also possible.
© Georgios Paliouras (March 2003)
86
Sequential pattern discovery
Hard issues:
Sequential modelling is more prone to
errors in the data collection phase (e.g.
broken sequences due to caches).
Higher-order modelling is usually
computationally intractable.
Evaluation measures.
© Georgios Paliouras (March 2003)
87
Contents
Introduction
Knowledge discovery from text & links
Knowledge discovery from usage data
Knowledge discovery in action
Software and Knowledge Engineering Lab
The CROSSMARC project
Important open issues
© Georgios Paliouras (March 2003)
88
SKEL background & interests
Our background:
Language Engineering (Information Extraction).
User Modeling (for IE and IR systems).
Image Analysis.
Machine Learning (neural, statistical, symbolic).
Research statement:
Reducing the information overload, by facilitating
personalized access to information on the Web.
More at: http://www.iit.demokritos.gr/skel/
© Georgios Paliouras (March 2003)
89
Research @ SKEL
Text Categorization and Filtering.
Information Extraction.
Natural Language Generation.
Ontology engineering.
Knowledge Discovery
Grammar Induction
User Modelling / Personalization
Web usage mining / Web Intelligence
Multimedia Knowledge Fusion.
© Georgios Paliouras (March 2003)
90
Technology @ SKEL
The Ellogon text engineering environment
(http://www.iit.demokritos.gr/skel/Ellogon/).
The PServer generic personalization server.
The KOINOTITES Web usage mining environment.
The Filterix Web proxy filter for obscene content
(http://www.iit.demokritos.gr/skel/i-config/).
The Filtron personalized spam filter.
(http://www.iit.demokritos.gr/skel/i-config/).
Focused crawling and spidering tools.
Multilingual information extraction systems.
Multilingual natural language generation systems.
© Georgios Paliouras (March 2003)
91
Contents
Introduction
Knowledge discovery from text & links
Knowledge discovery from usage data
Knowledge discovery in action
Software and Knowledge Engineering Lab
The CROSSMARC project
Important open issues
© Georgios Paliouras (March 2003)
92
The CROSSMARC project
FP5 IST project (3/2001 – 9/2003)
CROSSMARC: CROSS-lingual MultiAgent Retail Comparison
Objective: A prototype multi-agent
system for cross-lingual retrieval and
extraction of information from product
description Web pages.
© Georgios Paliouras (March 2003)
93
The CROSSMARC consortium
Partner
Type
Country
NCSR “Demokritos”
C
EL
VeltiNet A.E.
P
EL
University of Edinburgh
P
UK
Universita di Roma Tor Vergata
P
I
Lingway
P
F
© Georgios Paliouras (March 2003)
94
The basic features of CROSSMARC
Innovative information retrieval and
extraction technologies for the Web.
Cross-lingual, multi-agent open
architecture.
Customisability to new domains.
Personalized interface.
© Georgios Paliouras (March 2003)
95
CROSSMARC: Overview
focused
crawling
interesting
Web sites
domainspecific
spidering
interesting
pages
domain
ontology
fact
extraction
annotated
pages
Retrieval
named-entity
recognition
and matching
Extraction
XML records
Personalized
interface
database
user
profiles
© Georgios Paliouras (March 2003)
96
Contents
Introduction
Knowledge discovery from text & links
Knowledge discovery from usage data
Knowledge discovery in action
Important open issues
Multimedia content
Large scale knowledge discovery
Discovery in graphs
User privacy issues
© Georgios Paliouras (March 2003)
97
Multimedia Web data
© Georgios Paliouras (March 2003)
98
Scaling up to unexplored sizes
Until recently machine learning research has
considered 10,000 examples a large dataset.
On the Web 10,000,000-record databases are
not rare.
Knowledge discovery algorithms should
operate under space and time constraints.
The Web is naturally dynamic.
Knowledge discovery algorithms should allow
incremental refinement of extracted models.
© Georgios Paliouras (March 2003)
99
Discovery in the Web graph
The Web is a graph and Web sites are
subgraphs.
Many resources on the Web have a graphical
or hierarchical structure, e.g. Web directories.
Knowledge discovery algorithms should be
aware of the graphical structure.
Discovery algorithms for structured and
graphical data can lead to new interesting
applications.
© Georgios Paliouras (March 2003)
100
Respecting the user’s privacy
Data collection and use should be transparent
to the user.
Careless use of personal data will (at the
best) scare users off.
Navigational data is personal, when
associated with an individual.
“Unobtrusive personalization” should be
exercised with cautiousness.
Technology can help safeguarding privacy.
© Georgios Paliouras (March 2003)
101
References
J. Borges and M. Levene, Data mining of user navigation patterns. Proceedings of Workshop on Web Usage Analysis
and User Profiling (WEBKDD), in conjunction with ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. San Diego, CA., pp. 31-36.
S. Chakrabarti, M. H. van den Berg, B. E. Dom, Focused Crawling: a new approach to topic-specific Web resource
discovery, Proceedings of the Eighth International World Wide Web Conference (WWW), Toronto, Canada, May 1999.
T. Jörding, T, A Temporary User Modeling Approach for Adaptive Shopping on the We`, In Proceedings of the 2nd
Workshop on Adaptive Systems and User Modeling on the WWW, UM'99, Banff, Canada, 1999.
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, v. 46, 1999.
H. Lieberman, C. Fry and L. Weitzman. Exploring the Web with Reconnaissance Agents, Communications of the ACM,
August 2001, pp. 69-75.
A. Maedche, S. Staab. Discovering Conceptual Relations from Text. In: W.Horn (ed.): ECAI 2000. Proceedings of the
14th European Conference on Artificial Intelligence (ECAI), Berlin, August 21-25, 2000.
A. McCallum, D. Freitag and F. Pereira, Maximum Entropy Markov Models for Information Extraction and
Segmentation, Proceedings of the International Conference on Machine Learning (ICML), Stanford, CA, 2000, pp.
591-598.
I. Muslea , S. Minton and C. Knoblock , STALKER: Learning extraction rules for semistructured Web-based
information sources. Proceedings of the National Conference on Artificial Intelligence (AAAI), Madison, Wisconsin,
1998.
C. Nédellec, Corpus-based learning of semantic relations by the ILP system, Asium, Learning Language in Logic,
Cussens J. and Dzeroski S. (Eds.), Springer Verlag, September 2000.
J. Rennie and A. McCallum. Efficient Web Spidering with Reinforcement Learning. Proceedings of the International
Conference on Machine Learning (ICML), 1999.
E. I. Schwartz. Webonomics. New York: Broadway books, 1997.
E. Schwarzkopf, An adaptive Web site for the UM2001 conference. Proceedings of the Workshop on Machine Learning
for User Modeling, in conjunction with the International Conference on User modelling (UM), pp 77-86, 2001.
© Georgios Paliouras (March 2003)
102
Knowledge discovery
on the Web
Georgios Paliouras
Software and Knowledge Engineering Laboratory
Institute of Informatics and Telecommunications
N.C.S.R. “Demokritos”
[email protected]
http://www.iit.demokritos.gr/~paliourg
103