Web content mining

Transcript Web content mining

NCSR “Demokritos”
Institute of Informatics & Telecommunications
Knowledge discovery on the Web
Georgios Paliouras
Email: [email protected]
WWW: http://www.iit.demokritos.gr/~paliourg
HERMES seminar, Februrary 17, 2001
A Short Story About the Web:
where we are and how we
reached here…
© Georgios Paliouras (February 2001)
2
WWW: the new face of the Net
Once upon a time, the Internet was a forum for
exchanging information. Then …
…came
the Web.
The Web introduced
new capabilities …
…and attracted many more people …
…increasing
commercial interest …
…and turning the Net into a real forum …
© Georgios Paliouras (February 2001)
3
Information sources
Providing information is still the main service …
Commercial
Non-Commercial
CNN
Reuters
CORDIS
Times
Yahoo
MLNET
© Georgios Paliouras (February 2001)
NCSTRL
Library
4
Information overload
…as more people
started using it ...
…attracting even
more people ...
…increasing
…the quantity
the of
quantity
information
of online
on the
information
Web increased...
further...
…and leading to the overload
of information for the users ...
© Georgios Paliouras (February 2001)
5
WWW: an expanding forum

The Web is large and expanding:





100.000 people sign up every day
about 300.000.000 people online
1.000.000 new pages added every day
600 GB of pages change every month
… leading to the abundance problem:
“99% of online information is of no
interest to 99% of the people”
© Georgios Paliouras (February 2001)
6
Information access services
A number of services aim to help the user gain
access to online information and products ...
… but can they really cope?
© Georgios Paliouras (February 2001)
7
New requirements






Manual indexing does not allow for wide coverage:
18% of the Web covered by the largest engines.
What I want is hardly ever ranked high enough.
Product information is often biased towards specific
suppliers and outdated.
Product descriptions are incomplete and insufficient
for comparison purposes.
‘E’ in ‘E-commerce’ stands for ‘English’.
… and many more problems lead to the conclusion ...
… that more intelligent solutions are needed!
© Georgios Paliouras (February 2001)
8
A new generation of services
Some have already made their way to the market…
… many more are being developed as I speak …
© Georgios Paliouras (February 2001)
9
Data mining
application
loop
Problem
understanding
Data selection and
pre-processing
technology
loop
Machine
Learning
Knowledge
Post-processing
and evaluation
Deployment
© Georgios Paliouras (February 2001)
10
Data on the Web

Primary data (Web content):





Mainly text,
with some multimedia content
and mark-up commands.
Underlying databases (not directly accessible).
Secondary data (Web usage):


Access logs collected by a Web server
and a variety of navigational information collected
by Web clients.
© Georgios Paliouras (February 2001)
11
Approaches to Web mining

Web content mining



Web structure mining



Pattern discovery in Web content data.
Mainly mining unstructured textual data.
Pattern discovery in the Web graph.
The graph is defined by the hyperlinks.
Web usage mining


Discovery of interesting usage patterns.
Mainly in server logs.
© Georgios Paliouras (February 2001)
12
Web Content Mining
© Georgios Paliouras (February 2001)
13
Web content mining tasks

Information Access
Document category modelling.
 Construction of thematic hierarchies.


Fact Extraction
Extraction of product information, presented in
different formats.


Information Extraction
Structured “event” summaries from large textual
corpora.

© Georgios Paliouras (February 2001)
14
Text mining


Knowledge (pattern) discovery in textual data.
Clarifying common misconceptions:



Text mining is NOT about assigning documents to
thematic categories, but about learning document
classifiers.
Text mining is NOT about extracting information
from text, but about learning information extraction
patterns.
Difficulty: unstructured format of textual data.
© Georgios Paliouras (February 2001)
15
Underlying technology
Combination of language engineering (LE),
machine learning (ML) and statistical methods:
LE
MLStats
MLStats
LE
© Georgios Paliouras (February 2001)
16
Information access

Goals:





Organize documents into categories.
Assign new documents to the categories.
Retrieve documents that match a user query.
Long history, of manually-constructed and
statistical document category models.
Dominating statistical idea:
TFIDF=term frequency * inverse document frequency
© Georgios Paliouras (February 2001)
17
Information access on the Web

Problems with traditional IA methods:



Large volume of data: document category models
are hard to construct and maintain manually.
Highly dynamic environment: thematic hierarchies
change continuously.
Web content mining solutions:


Automated construction and maintenance of
category models.
Automated construction of ontologies (thematic
hierarchies).
© Georgios Paliouras (February 2001)
18
Document category modeling
Training documents (pre-classified)
Pre-processing
Feature selection
Machine Learning
Stopword removal (and, the, etc.)
Stemming (‘played’  ‘play’)
Bag-of-words coding
Statistical selection of characteristic
terms (mutual information)
Supervised classifier learning
Category models (classifiers)
© Georgios Paliouras (February 2001)
19
Document category modeling

Machine Learning methods used:








Memory-based learning (k-nearest neighbor).
Decision-tree and decision-rule induction.
Bayesian learning (naïve Bayes classifiers).
Support vector machines.
Boosting (combined usually with decision trees).
Maximum entropy modeling.
Neural networks (multi-layered perceptrons).
Problems: high dimensionality, large training
sets, overlapping categories.
© Georgios Paliouras (February 2001)
20
Ontology/Taxonomy construction
Training documents (unclassified)
Stopword removal (and, the, etc.)
Stemming (‘played’  ‘play’)
Pre-processing
Bag-of-words coding
Feature compression Hand-made thesauri (Wordnet)
Term co-occurrence (LSI)
Machine Learning Unsupervised learning (clustering)
Thematic hierarchies, including
category models (classifiers)
© Georgios Paliouras (February 2001)
21
Ontology/Taxonomy construction

Hierarchical clustering is most suitable:




… but flat clustering can also be adapted:




Agglomerative clustering
Conceptual clustering (COBWEB)
Model-based clustering (EM-type: MCLUST)
K-means and its variants
Bayesian clustering (Autoclass)
Neural networks (self-organizing maps)
Feature compression makes the difference!
© Georgios Paliouras (February 2001)
22
Fact extraction

Goal:
Extract interesting facts from Web documents.

Typical application:
Product comparison services (price, availability, …).

Difficulties:
Semi-structured data.
 Different database schemata and presentation
formats.


Common approach:

Manual construction of wrappers.
© Georgios Paliouras (February 2001)
23
Wrapper induction
Training documents (semistructured)
Data pre-processing
Abstraction of mark-up structure
Database schema (interesting facts)
Machine Learning
Structural/sequence learning
Fact extraction patterns (wrapper)
© Georgios Paliouras (February 2001)
24
Wrapper induction

Machine Learning methods used:






Grammar (Finite State Automata) induction.
Hidden Markov Models.
Programming by demonstration.
Inductive logic programming.
Proprietary algorithms (HLRT, Shopbot learner).
Simple heuristics can make the learning task
much easier!
© Georgios Paliouras (February 2001)
25
Schema extraction
Training documents (semistructured)
Data pre-processing
Machine Learning
Declarative, relational, graphical
data representation (OEM, XML, …)
Structural/relational learning
Abstract database schemata and
mappings (DataGuides)
© Georgios Paliouras (February 2001)
26
Schema extraction

Machine Learning methods used:





Inductive logic programming.
Association rule induction.
Clustering.
Many proprietary methods!
There is a need for efficient learning from
structured and graphical data!
© Georgios Paliouras (February 2001)
27
Information extraction

Goals:



Identify interesting “events” in unstructured text.
Extract information related to the events and store
it in structured templates.
Typical application:
Information extraction from newsfeeds.

Difficulties:



Deals with unstructured text.
Extracts complex events, rather than simple facts.
Usually requires deep understanding of the text.
© Georgios Paliouras (February 2001)
28
A typical extraction system
Morphology
Syntax
Semantics
Discourse
Unstructured text and database
schema (event templates)
Lemmatization (‘said’  ‘say’),
Sentence and word separation.
Part-of-speech tagging, etc.
Shallow syntactic parsing.
Named-entity recognition.
Co-reference resolution.
Sense disambiguation.
IE pattern matching.
Structured data (filled templates)
© Georgios Paliouras (February 2001)
29
Information extraction







Long history before the birth of the Web.
One of the hardest Language Engineering tasks.
ARPA Message Understanding Conferences have
pushed the field forward.
Information overload on the Web has increased the
need for IE systems.
IE is achievable for very narrow domains
(e.g. mergers and acquisitions).
Manual construction of IE systems is time-consuming.
Machine learning can help solve this problem.
© Georgios Paliouras (February 2001)
30
Extraction pattern discovery
Morphology
Syntax
Semantics
Pattern Discovery
Unstructured text and database
schema (event templates)
Lemmatization (‘said’  ‘say),
Sentence and word separation.
Part-of-speech tagging, etc.
Shallow syntactic parsing.
Named-entity recognition.
Co-reference resolution.
Sense disambiguation.
IE pattern discovery.
IE patterns
© Georgios Paliouras (February 2001)
31
Extraction pattern discovery

Machine Learning methods used:






Decision-rule learning.
Grammar induction.
Co-training (semi-supervised learning).
Clustering.
Linguistic resources are also needed.
Difficulties:


Difficult to produce hand-tagged training data.
Lack of sufficient background knowledge.
© Georgios Paliouras (February 2001)
32
Customization of IE systems

Learning can be used to customize individual
modules of an IE system:






Sentence splitting (rule learning).
Part-of-speech tagging (Brill tagger).
Named-entity recognition (HMMs).
Co-reference resolution (rule learning).
Word sense disambiguation (rule learning).
Learning speeds up the customization to new
domains and new languages!
© Georgios Paliouras (February 2001)
33
Multimedia content





Most of the Web’s content is still text.
The quantity of multimedia content (image,
sound, video) is increasing.
Some sites are dominated by non-textual
content (digital museums, maps).
Some textual content is presented in images
(e.g. advertisement banners).
Web content mining should include
multimedia content.
© Georgios Paliouras (February 2001)
34
Mining multimedia data




Multimedia information retrieval is really
about image retrieval, using overly simplistic
methods (e.g. color histogram matching).
Multimedia information extraction has not
been used on the Web.
Learning can assist in constructing complex
multimedia IR and IE systems.
Learning can also be used to combine data in
different modalities.
© Georgios Paliouras (February 2001)
35
Open issues…







Scalable learning methods.
Dimensionality reduction.
Semi-supervised learning.
Learning from structured and graphical data.
Word/Phrase sense disambiguation.
Customization of IE systems.
Multimedia content mining.
© Georgios Paliouras (February 2001)
36
Web Structure Mining
© Georgios Paliouras (February 2001)
37
Hyperlink information is useful

Information retrieval can be improved by:





Identifying authoritative pages.
Identifying resource index pages.
Summarizing common references.
Linked pages often contain complementary
information (e.g. product offers).
Structural analysis of a Web site facilitates its
improvement.
© Georgios Paliouras (February 2001)
38
Improved information retrieval

Social network analysis:



Nodes with large fan-in (authorities) provide high
quality information.
Nodes with large fan-out to authorities (hubs) are
good starting points.
Disconnected subgraphs correspond to different
social (e.g. research) communities.
© Georgios Paliouras (February 2001)
39
Multi-page fact extraction



Interesting facts often span more than one
pages, usually hyperlinked.
Fact extraction wrappers need to include
patterns for processing hyperlinked pages.
Structure-aware wrapper induction discovers
patterns that relate pages.
© Georgios Paliouras (February 2001)
40
Web site design and optimization




A Web site can be represented as a graph.
The graph of a site contains different types of
links: crosswise, upward, downward, outward.
Substructures of a Web graph can reveal
problematic areas: missing links, unwanted
connections and loops, etc.
Combination with usage data is interesting.
© Georgios Paliouras (February 2001)
41
Open issues…





Web structure mining is an emerging
research field.
The work so far focuses on small subgraphs
(a node and its adjacent neighbors).
Structure-aware Web (content and usage)
mining.
Hyperlink classification.
Handling of dynamically generated pages.
© Georgios Paliouras (February 2001)
42
Web Usage Mining
© Georgios Paliouras (February 2001)
43
“The Quantity of People Visiting Your
Site Is Less Important Than the
Quality of Their Experience”
Evan I. Schwartz,
Webonomics, Broadway Books, 1997
© Georgios Paliouras (February 2001)
44
Personalized information access
sources
server
receivers
© Georgios Paliouras (February 2001)
45
Motivation for personalization

Better service for the user:



Reduction of the information overload.
More accurate information retrieval and extraction.
Customer relationship management:



Customer segmentation and targeted
advertisement.
Customer attraction and retention strategy.
Service improvement (site structure and content).
© Georgios Paliouras (February 2001)
46
Underlying technology

User modeling:




Constructing models that can be used to adapt the
system to the user’s requirements.
Different types of requirement: interests (sports
and finance news), knowledge level (novice expert), preferences (no-frame GUI), etc.
Different types of model: personal – generic.
Machine learning and statistics facilitate the
acquisition of user models.
© Georgios Paliouras (February 2001)
47
User Models

User model (type A):

User model (type B):
[PERSONAL]
User x -> sports, stock market
[PERSONAL]
User x, Age 26, Male -> sports, stock market

User community:
[GENERIC]
Users {x,y,z} -> sports, stock market

User stereotype:
[GENERIC]
Users {x,y,z}, Age [20..30], Male -> sports, stock market
© Georgios Paliouras (February 2001)
48
Learning user models
Community 2
Community 1
User 1
User 2
User 3
User 4
User 5
User
communities
User
models
Observation of the users interacting with the system.
© Georgios Paliouras (February 2001)
49
Web usage mining process
Data collection
Data pre-processing
Pattern discovery
Collection of usage data by
the server and the client.
Data cleaning, user identification,
session identification
Construction of user models
Knowledge post-processing Report generation, visualization,
personalization module.
© Georgios Paliouras (February 2001)
50
Usage data sources




Server-side data: access logs in Common or
Extended Log Format, user queries, cookie
tracks, packet sniffers.
Client-side data: Java and Javascript agents.
Registration forms: personal information
supplied by the user.
Demographic information: provided by
census databases.
© Georgios Paliouras (February 2001)
51
Problems in data collection

Privacy and security issues:





The user must be aware of the data collected.
Cookies and client-side agents are often disabled.
Caching on the client or an intermediate
proxy causes data loss on the server side.
Registration forms are a nuisance and they
are not reliable sources.
Client-side agents increase response time.
© Georgios Paliouras (February 2001)
52
Pre-processing usage data



Cleaning: removing pages that have not been
requested explicitly by the user (mainly
multimedia files, loaded automatically).
Should be domain-specific.
User identification: difficult when server log
data are used (only IP information available).
User-session/transaction identification:
difficult when the same IP is used by many
users. Upper limit on transition time (30 min).
© Georgios Paliouras (February 2001)
53
Aggregate data mining

Discovery of interesting aggregate patterns:





Classification of customers (loyal – casual).
Market basket analysis.
Time-series analysis.
Supplementary sales data are needed.
Methods:



Supervised learning.
Association rules.
Statistical and visual analysis.
© Georgios Paliouras (February 2001)
54
Constructing personal models

Applications:





Personal Web browsing assistant.
Personalized site structure.
Personalized interface.
Requires user registration and often user
feedback.
Methods: Various types of supervised
learning.
© Georgios Paliouras (February 2001)
55
Collaborative filtering




Information filtering according to the choices
of similar users.
Avoids semantic content analysis.
Cold-start problem with new users.
Methods:


Primarily memory-based learning, (e.g. k-nn).
No user models constructed.
Recently model-based clustering.
© Georgios Paliouras (February 2001)
56
Collaborative filtering
Sports news
1
0
Finance news
© Georgios Paliouras (February 2001)
1
57
Community models
Clustering users into communities.
 Methods used:








Conceptual clustering (COBWEB).
Graph-based clustering (Cluster mining).
Statistical clustering (Autoclass).
Neural Networks (Self-organising Maps).
Model-based clustering (EM-type).
BIRCH.
Community models: cluster descriptions.
© Georgios Paliouras (February 2001)
58
Community models
0,9
0,9
0,9
0,8
0,4
0,1
0,1
0,5
0,5
© Georgios Paliouras (February 2001)
59
Navigational pattern discovery


Identifying navigational patterns, rather than
“bag-of-page” models.
Methods:






Clustering transitions between pages.
First-order Markov models.
Probabilistic grammar induction.
Association-rule sequence mining.
Path traversal through graphs.
Personal and community navigation models.
© Georgios Paliouras (February 2001)
60
Knowledge post-processing




Identifying interesting patterns: heuristic
measures of interestingness.
Report generation: basic statistics, group
statistics, path analysis, etc.
Personalization: separate on-line module,
using the models generated by data mining.
Site optimization: adaptive Web sites,
introducing index pages.
© Georgios Paliouras (February 2001)
61
Open issues…






Technical problems in data collection and preprocessing.
Privacy and security issues.
Sequential and graphical data mining.
Efficient mining for large datasets.
Combination with content and structure data.
Further incorporation of user modeling ideas.
© Georgios Paliouras (February 2001)
62
Web mining in action:
projects in I.I.&T., NCSR
“Demokritos”
© Georgios Paliouras (February 2001)
63
Background & interests

Our background:





Language Engineering (Information Extraction).
User Modeling (for IE and IR systems).
Image Analysis.
Machine Learning (neural, statistical, symbolic).
Research statement:
Reducing the information overload, by facilitating
personalized access to information on the Web.
© Georgios Paliouras (February 2001)
64
ECRAN

Objective:
Customization of information extraction modules to
new domains and languages.

Partners:
Thomson-CSF (FR), University of Ancona (IT),
University of Rome "Tor Vergata" (IT), Smart
Information Services GmbH (DE), NCSR
"Demokritos" (GR), University of Friburg (CH),
University of Sheffield (UK)
 Ended: February 1999.
© Georgios Paliouras (February 2001)
65
Web mining in ECRAN

Methods for customizing an IE system:




Supervised learning of named-entity recognizers.
Supervised learning for word-sense
disambiguation.
Unsupervised learning of extraction patterns.
Personalizing an IE system:


Adaptive personal user models.
Supervised learning of user stereotypes.
© Georgios Paliouras (February 2001)
66
UMIE
© Georgios Paliouras (February 2001)
67
MITOS

Objectives:
Personalized Information Retrieval and Extraction
from Greek financial news articles.
 Discovery of interesting patterns, combining
published events and stock exchange data.


Partners:
NCSR "Demokritos", Athens University of
Economics and Business, University of Peireas,
University of Patras, Knowledge S.A., SENA,
KAPA-TEL S.A.
 Ending: April 2001
© Georgios Paliouras (February 2001)
68
Web mining in MITOS




Supervised learning for document
classification, used in Information Retrieval.
Supervised learning to construct an IE system
for Greek: sentence splitting, part-of-speech
tagging, named-entity recognition, extraction
pattern discovery.
Supervised learning of personal user models
and unsupervised learning of communities.
Pattern discovery in data extracted from text
and stock exchange data.
© Georgios Paliouras (February 2001)
69
Personalization in MITOS
Clustering
Collaborative
recommendation
Personal models
© Georgios Paliouras (February 2001)
Communities
70
M-PIRO

Objective:
Personalized descriptions of digital objects
(primarily museum exhibits).

Partners:
Edinburgh University (UK), System Simulation Ltd
(UK), NCSR "Demokritos” (GR), University of
Athens (GR), Foundation of the Hellenic World
(GR), Institute of Research in Science and
Technology (IT).
 Ending: February 2003
© Georgios Paliouras (February 2001)
71
Personalization in M-PIRO



Knowledge-level modeling and adaptation of
object descriptions.
Modeling of short-term and long-term visit
history.
Community modeling for collaborative
recommendation and adaptive object
indexing.
© Georgios Paliouras (February 2001)
72
M-PIRO
Dynamic text generation
The temple of Athena is
approximately 3 centuries
older than the Stadium, the
previous building that you
visited. It was built during
the early 5th century BC, on the peninsula south of
the Port Theatre. Prehistoric ruins were also found
in the surrounding area. You will probably find
them interesting.
Image property of the Foundation of Hellenic World.
© Georgios Paliouras (February 2001)
73
CROSSMARC

Objective:
Personalized, cross-lingual fact extraction for retail
product comparison.

Partners:
NCSR “Demokritos” (GR), VeltiNet AE (GR),
Univ. of Edinburgh (UK), Univ. Roma, Tor Vergata
(IT), Informatique CDC (FR), Internet Commerce
Network (FR)

Starting: March 2001
© Georgios Paliouras (February 2001)
74
Web mining in CROSSMARC




Wrapper induction and schema extraction.
Customization of information extraction
modules to new domains and languages.
Personalization through personal and
community models.
Web usage mining for Customer Relationship
Management.
© Georgios Paliouras (February 2001)
75
Conclusions:
where do we go from here?
© Georgios Paliouras (February 2001)
76
A paradox of the new era
High commercial demand for
research products!
Solutions need to be simple and
efficient!
© Georgios Paliouras (February 2001)
77
Really useful Web mining
Content mining
Authoritative
filtering
Collaborative
filtering
Really useful
Web mining!
Structure mining
Usage mining
Site optimization
© Georgios Paliouras (February 2001)
78
Mining multimedia Web data
© Georgios Paliouras (February 2001)
79
Scaling up to unexplored sizes





Until recently machine learning research has
considered 10,000 examples a large dataset.
On the Web 10,000,000-record databases
are not rare.
Web mining algorithms should operate under
space and time constraints.
The Web is naturally dynamic.
Web mining algorithms should allow
incremental refinement of extracted models.
© Georgios Paliouras (February 2001)
80
Mining the Web graph



The Web is a graph and Web sites are
subgraphs.
Web mining algorithms should be aware of
the graphical structure.
Data mining algorithms for structured and
graphical data can lead to new Web mining
applications.
© Georgios Paliouras (February 2001)
81
Respecting the user’s privacy





Data collection and use should be
transparent to the user.
Careless use of personal data will (at the
best) scare users off.
Navigational data is personal, when
associated with an individual.
“Unobtrusive personalization” should be
exercised with cautiousness.
Technology can help safeguarding privacy.
© Georgios Paliouras (February 2001)
82

Web content mining

Transcript Web content mining

Directory