Web title extraction

Download Report

Transcript Web title extraction

Web
Knowledge
Web Content Mining
Najlaa Gali and Pasi Fränti
17.3.2016
Web
Web Mining
Goals:
• Automatic extraction of useful information from web
• Challenging task: heterogeneity and lack of structure
Three types:
• Web usage mining:
Discovery of user patterns from web usage logs
• Web structure mining:
Discovery from the structure of links.
• Web content mining:
Discovery from content: text and images.
Applications 1
• To gather, categorize, organize and provide the best possible
information available on the WWW to the user requesting the
information.
• Keyword (or term) based association analysis.
• Topic classification.
• Similarity detection
– Cluster pages by a common author
– Cluster pages containing information from common source
Applications 2
•
•
•
•
•
•
•
Sequence analysis: predicting a recurring event, discovering trends.
Event detection and tracking.
Help to understand the users behavior.
Anomaly detection: find information that violates usual patterns.
Discovery of frequent phrases.
Text segmentation.
Produce a higher quality of information to the user based upon the
requests made through examining images, content, formats and web
structure (improving quality of search results).
• Businesses can maximize the use of this text mining to improve
marketing of their sites as well as the products they offer.
Content of Web Page
Hypertext Markup Language (HTML, XHTML)
Logo image
Navigation bar
Title
Keywords
Text
Images
Components of Web page
• Web page is created from : Hypertext Markup Language (HTML),
Cascading Style Sheets (CSS) and Java script (JS).
• HTML: describes the structure of a website.
• CSS: define the look and layout of text and other material.
• Script such as Java script: affect the behavior of HTML web pages.
HTML
• A standard markup language used to
create web pages.
<!DOCTYPE html>
• Its elements consist of tags enclosed in
<html>
angle brackets (like <html>)
<head>
<title>This is a title</title>
</head>
• The elements form the building blocks of
all websites.
<body>
<p>Welcome to DAA++</p>
</body>
• The HTML allows images and objects to
</html>
be embedded and can be used to create
interactive forms.
• Can embed scripts and styling sheets.
Example of HTML elements
• Title: <title>The Title</title>
• Heading: <hx>Heading level</hx>, x= 1…6
• Paragraph: <p>Paragraph</p>
• Link: <a href="https://www.daa.com/">A link to DAA++!</a>
• Line break: <p>This <br> is a paragraph <br> with <br> line
breaks</p>
• Image: <img src=“/documents/PasiFranti.jpg” style=“width: 80px;
height: 104px;”>
CSS
• Designed primarily to enable the separation of document content
from document presentation, including aspects such as the layout,
colors, and fonts.
• Often used to set the visual style of web pages and user interfaces
written in HTML and XHTML.
• Enable multiple HTML pages to share formatting by specifying the
relevant CSS in a separate .css file, so that to reduce the complexity
and repetition in the structural content with every node.
How to style using CSS
• HTML presentational attributes (without CSS)
<h1><font color="red"> DAA++</font></h1>
• CSS style properties
<h1 style="color:red"> DAA++</h1>
• Internal styling
• Link to external styling sheet
<link href="path/to/file.css" rel="stylesheet">
<html>
<head>
<style>
#xyz { color: red }
</style>
</head>
<body>
<p id="xyz" > Hello DAA++</p>
</body>
</html>
Javascript
• One of the three essential technologies (HTML,CSS, JS) of WWW
content production.
• Adds client-side behavior to HTML pages such as animation of page
elements, Interactive content (games), and playing audio and video.
• Validating input values of a web form to make sure that they are
acceptable before being submitted to the server.
<script> document.body.appendChild (document.createTextNode('Hello World!'));
var h1 = document.getElementById('header'); // holds a reference to the <h1> tag
h1 = document.getElementsByTagName('h1')[0]; // accessing the same <h1>
</script>
<noscript>Your browser either does not support JavaScript, or has it turned off.
</noscript>
DOM
DOM Concept
• DOM makes all components of a web page accessible
– HTML elements
– their attributes
– text
• They can be created, modified and removed with JavaScript.
html
head
title
meta
body
meta
h1
p
ul
a
li
li
DOM objects
• DOM components are accessible as objects or collections of objects
• DOM components form a tree of nodes
– relationship parent node – children nodes
• Attributes of elements are accessible as text
• Browsers can show DOM visually as an expandable tree
Example of DOM
html
<div>
<h3>INTERSPORT DW SPORTS</h3>
< p>
<span>Unit 49a The Circus, Cabot Circus </span><br>
<span>BS1 3BD</span>
<span>BRISTOL, United Kingdom</span>
</p>
</div>
INTERSPORT DW SPORTS
Unit 49a The Circus, Cabot Circus
BS1 3BD BRISTOL, United Kingdom
div
h3
INTERSPORT
DW SPORTS
p
span
span
span
Unit 49 a…
BS1…
BRISTOL…
Text nodes in DOM
• Text node
– Can only be as a leaf in DOM tree
– it’s nodeValue property holds the text
– innerHTML can be used to access the text
<p> This is text
<a href="/path/page.html">link in it</a>.
</p>
p
This is text
a
Link in it
Goal
Summary Extraction
• Title “Rosso restaurant”, “City pharmacy”
• Keywords “restaurant, food, lunch, dinner”
• Representative Image
• Short description
ma-pe: 16.00-22.00
la: 12.00-22.00
puh. 013 227 874
Web Page Title
<title>Wentworth House Hotel Bath Hotels - Cheap Hotels in Bath, Somerset, UK</title>
• Title Tag (91 %)
• Logo image (89 %)
• Web page body (93 %)
Title and Meta Tags
• The obvious source for titling
• But includes also additional information
– <title> Piato Restaurant – 123 Blues Point Road,
McMahons Point, Sydney | Visit Piato and experience the
life & flavour of Europe. North Sydney Functions. North
Sydney Restaurants.</title>
– <title> Joensuu Keskusta | Intersport - Sport to the people
</title>
Joensuu Keskusta
• Segmentation is needed!
Intersport
Sport to the people
Work flow
https://www.jdwetherspoon.com/pubs/all-pubs/england/london/the-coronet-holloway
The coronet
Extract Content of Title and Meta tags
<title>The Coronet, Holloway | Our Pubs | J D Wetherspoon</title>
<meta name="keywords" content="The Coronet" />
Segmentation by delimiters
<title>Sydney Waterfront Restaurant | Restaurant Milsons Point - Aqua Dining</title>
<title>SIGNORELLI GASTRONOMIA - Pyrmont Italian Restaurant - EAT • DRINK
• SHOP • COOKItalian Restaurant Pyrmont Sydney – Signorelli Gastronomia</title>
<title>Neutral Bay Club | Tennis, Bowls, Bistro & Functions | Sydney</title>
<title>The Coronet, Holloway | Our Pubs | J D Wetherspoon</title>
Pre-defined delimiter patterns:
space – space
space : space
: space
space >
?,
Space /
space / space
, space
space :
space «
-,
-|
space . space
space space |
space »
space ::
space <
Candidate Segments
<title>The Coronet, Holloway | Our Pubs | J D Wetherspoon</title>
<meta name="keywords" content="The Coronet" />
Candidates
• The Coronet
• Holloway
• Our Pubs
• J D Wetherspoon
Scoring-Position in Title and Meta Tags
• Appear first or last either in Title or Meta gets 0.1
0.1
0.0
0.0
0.1
<title>The Coronet, Holloway | Our Pubs | J D Wetherspoon</title>
0.1
<meta name="keywords" content="The Coronet" />
Candidates
• The Coronet
• Holloway
• Our Pubs
• J D Wetherspoon
0.1
0.0
0.0
0.1
Popularity Among Header Tags
<h1 class="banner-inner__title">The Coronet</h1>
<h2 class="venue-finder__title-text">Find a pub or hotel</h2>
<h2 class="venue-finder__title-text">Our Pubs</h2>
<h2 class="venue-finder__title-text" ng-hide="isPubName">Check out your
nearest pub or hotel</h2>
<h3 class="feature-panel__title">Discover our food menu</h3>
<h3 class="feature-panel__title">Our drinks selection</h3>
<h4 class="tab__title">Nearby J D Wetherspoons</h4>
Candidates
• The Coronet
• Holloway
• Our Pubs
• J D Wetherspoon
16=6
15=5
13=3
Frequency
Weight
Scoring-Position in Web Link
Domain
Path
File name
https:// www.jdwetherspoon.com/ pubs/all-pubs/england/london/ the-coronet-holloway
1
3
1.5
Dice similarity
measure
Candidates
• The Coronet
• Holloway
• Our Pubs
• J D Wetherspoon
3 × 0.70
= 2.1
3 × 0.58
= 1.74
1.5 × 0.00 = 0.00
1 × 1.00
= 1.00
Rank Segments
Position in tag
Popularity among
header tags
Position
in web link
The Coronet
0.1
6
2.10
Holloway
0.0
0
1.74
Our Pubs
0.0
5
0.00
J D Wetherspoon
0.1
3
1.00
Candidates
Normalizing
Position in tag
Popularity among
header tags
Position
in web link
Total
The Coronet
0.1
1.00
1.00
2.10
J D Wetherspoon
0.1
0.50
0.48
1.08
Holloway
0.0
0.00
0.83
0.83
Our Pubs
0.0
0.83
0.00
0.83
Candidates
Impact of criteria
• Criteria 1 has the lowest impact (0.65)
− More generic words such as home and welcome are often placed at the
beginning
− Either the slogan, address or general information about the web page is
placed at the end of the title.
• Criteria 2 has slightly higher impact (0.68)
− Heading tags are not always used, and even when existing, the correct title is
not always there.
• Criterion 3 is statistically significant in comparison with criteria 1 and 2.
Criteria
(1) Position in tag
(2) Popularity among hx tags
(3) Position in web link
1+2
1+3
2+3
1+2+3
Average similarity
0.65
0.68
0.84
0.70
0.85
0.82
0.84
Web Page Body
Content of text nodes
N-grams (n=1…6)
Filter by part-of-speech (POS) patterns
Construct DOM tree
Body
div
div
h2
h1
div
div
Aqua Dining
Sydney Waterfront Restaurant |…
h3
h5
a
Feeling social?..
Navigation
a
facebook
p
Aqua Dining offers a…
Extract text nodes
Navigation
Feeling Social? Find us on
Facebook
Sydney Waterfront Restaurant Restaurant Milsons Point
Aqua Dining offers a quintessential Sydney dining experience with
unrivalled harbour views that sweep from Luna Park to the world
famous Sydney Harbour Bridge and the Sydney Opera House.
Apply POS tagging
NNP=Proper noun, singular
NNPS=Proper noun, plural
NN=Noun, singular or mass
VBG=Verb, gerund
VB=Verb, base form
PRP=Personal pronoun
DT=Determiner
CC=Coordinating
conjunction
JJ=Adjective
NNP
Navigation
VBG
VB
NNP
PRP IN
Feeling Social? Find us on
NNP
Facebook
NNP
NNP
NN
NNPS
NNP
NNP
Sydney Waterfront Restaurant Restaurant Milsons Point
NNP
NNP
VBZ
DT
JJ
NNP
NN
NN
Aqua
Dining offers a quintessential
Sydney dining experience
IN
NNS
IN DT
NN
NN
NNP
JJ
WDT
IN
NNP
with
unrivalled
harbour
views
that NNP
sweep CCfromDTLunaNNPPark toNNPthe
NN
JJ
NNP
NNP
world
famous Sydney Harbour Bridge and the Sydney Opera
NNP
House.
Extract potential phrases
NNP=Proper noun, singular
NNPS=Proper noun, plural
NN=Noun, singular or mass
VBG=Verb, gerund
VB=Verb, base form
PRP=Personal pronoun
DT=Determiner
CC=Coordinating
conjunction
JJ=Adjective
NNP
Navigation
VBG
VB
NNP
PRP IN
Feeling Social? Find us on
NNP
Facebook
NNP
NNP
NN
NNPS
NNP
NNP
Sydney Waterfront Restaurant Restaurant Milsons Point
NNP
NNP
VBZ
DT
JJ
NNP
NN
NN
Aqua
Dining offers a quintessential
Sydney dining experience
IN
NNS
NN
NN
IN DT
NNP
JJ
WDT
IN
NNP
with
unrivalled
harbour
views
that NNP
sweep CCfromDTLunaNNPPark toNNPthe
NN
JJ
NNP
NNP
world
famous Sydney Harbour Bridge and the Sydney Opera
NNP
House.
Feature extraction
•
•
•
•
•
•
•
•
•
Similarity with the link of the web page
Appearance in title tag
Appearance in meta tag
Popularity on the web page (frequency)
Appearance in heading (h1, h2…h6) tags
Capitalization
Capitalization frequency
Independent appearance
Phrase length
Similarity with web link
https://www.aquadining.com.au/
Phrase
similarity title
with url tag
Social
0
Aqua
Dining
0.6
Sydney
0
harbor
views
0
Milston
point
0
meta
tag
headers
Popularity
Capital
Capitalized
freq.
Indep.
App.
Length
in char
Appearance in Title tag
<title>Sydney Waterfront Restaurant | Restaurant Milsons Point - Aqua Dining</title>
Phrase
similarity title
with url tag
Social
0
0
Aqua
Dining
0.6
1
Sydney
0
1
harbor
views
0
0
Milston
point
0
1
meta
tag
headers
Popularity
Capital
Capitalized
freq.
Indep.
App.
Length
in char
Appearance in Meta title tag
<meta property="og:title" content="Sydney Waterfront Restaurant | Restaurant Milsons
Point - Aqua Dining" />
Phrase
similarity title
with url tag
meta
tag
Social
0
0
0
Aqua
Dining
0.6
1
1
Sydney
0
1
1
harbor
views
0
0
0
Milston
point
0
1
1
headers
Popularity
Capital
Capitalized
freq.
Indep.
App.
Length
in char
Appearance in Header tags
<h1 class="site-title">Aqua Dining Sydney Restaurant</h1> Weight= 6
<h2>Aqua Dining</h2> Weight= 5
Phrase
similarity title
with url tag
meta
tag
headers
Social
0
0
0
0
Aqua
Dining
0.6
1
1
11
Sydney
0
1
1
6
harbor
views
0
0
0
0
Milston
point
0
1
1
0
Popularity
Capital
Capitalized
freq.
Indep.
App.
Length
in char
Popularity on the web page
<title>Sydney Waterfront Restaurant | Restaurant Milsons Point - Aqua Dining</title>
<meta property="og:title" content="Sydney Waterfront Restaurant | Restaurant milsons
point - Aqua Dining" />
<h1 class="site-title">Aqua Dining Sydney Restaurant</h1>
<h2>Aqua Dining</h2>
Phrase
similarity title
with url tag
meta
tag
headers
Popularity
Social
0
0
0
0
0
Aqua
Dining
0.6
1
1
11
4
Sydney
0
1
1
6
3
harbor
views
0
0
0
0
0
Milston
point
0
1
1
0
2
Capital
Capitalized
freq.
Indep.
App.
Length
in char
Capitalization
Phrase
similarity title
with url tag
meta
tag
headers
Popularity
Capital
Social
0
0
0
0
0
1
Aqua
Dining
0.6
1
1
11
4
1
Sydney
0
1
1
6
3
1
harbor
views
0
0
0
0
0
0
Milston
point
0
1
1
0
2
1
Capitalized
freq.
Indep.
App.
Length
in char
Capitalization frequency
<title>Sydney Waterfront Restaurant | Restaurant Milsons Point - Aqua Dining</title>
<meta property="og:title" content="Sydney Waterfront Restaurant | Restaurant milsons
point - Aqua Dining" />
<h1 class="site-title">aqua Dining Sydney Restaurant</h1>
<h2>aqua Dining</h2>
Phrase
similarity title
with url tag
meta
tag
headers
Popularity
Capital
Capitalized
freq.
Social
0
0
0
0
0
1
0
Aqua
Dining
0.6
1
1
11
4
1
2
Sydney
0
1
1
6
3
1
3
harbor
views
0
0
0
0
0
0
0
Milston
point
0
1
1
0
2
1
1
Indep.
App.
Length
in char
Independent appearance
<title>Sydney Waterfront Restaurant | Restaurant Milsons Point - Aqua Dining</title>
<meta property="og:title" content="Sydney Waterfront Restaurant | Restaurant
milsons point - Aqua Dining" />
<h1 class="site-title">aqua Dining Sydney Restaurant</h1>
<h2>aqua Dining</h2>
Phrase
similarity title
with url tag
meta
tag
headers
Popularity
Capital
Capitalized
freq.
Indep.
App.
Social
0
0
0
0
0
1
0
0
Aqua
Dining
0.6
1
1
11
4
1
2
1
Sydney
0
1
1
6
3
1
3
0
harbor
views
0
0
0
0
0
0
0
0
Milston
point
0
1
1
0
2
1
1
0
Length
in char
Length
Phrase
similarity title
with url tag
meta
tag
headers
Popularity
Capital
Capitalized
freq.
Exact
match
Length
in char
Social
0
0
0
0
0
1
0
0
5
Aqua
Dining
0.6
1
1
11
4
1
2
1
10
Sydney
0
1
1
6
3
1
3
0
6
harbor
views
0
0
0
0
0
0
0
0
11
Milston
point
0
1
1
0
2
1
1
0
12
Normalization
Phrase
similarity title
with url tag
meta
tag
headers
Popularity
Capital
Capitalized
freq.
Exact
match
Length
in char
Social
0
0
0
0
0
1
0
0
5
(0.4)
Aqua
Dining
0.6
1
1
11
(1)
4
(1)
1
2
(0.6)
1
10
(0.96)
Sydney
0
1
1
6
(0.5)
3
(0.75)
1
3
(1)
0
6
(0.5)
harbor
views
0
0
0
0
0
0
0
0
11
(0.88)
Milston
point
0
1
1
0
2
(0.5)
1
1
(0.3)
0
12
(0.8)
Classifiers
• Naive Bayes
• Support Vecotr Machines (SVM)
• Clustering
• K-nearest neighbors (k-NN)
Results with Titler corpus
Extracted titles
Method
Baseline (Title Tag)
TitleFinder (Moham.et al. 2012)
Styling (Changuel et al. 2009)
TTA (Gali and Fränti 2016)
Titler (BYAES) (New)
Titler (CLUS) (New)
Titler (KNN) (New)
Titler (SVM) (New)
Rouge-1
Precision Recall
0.89
0.43
0.36
0.73
0.46
0.87
0.87
0.88
0.41
0.61
0.41
0.80
0.40
0.81
0.82
0.84
F-score
0.52
0.47
0.35
0.75
0.42
0.82
0.83
0.85
Jaccard
Dice
0.50
0.43
0.38
0.75
0.64
0.82
0.84
0.85
0.58
0.50
0.43
0.78
0.70
0.86
0.87
0.88
Results with Mopsi Services
Annotated titles
Rouge-1
Method
Baseline (Title Tag)
TitleFinder (Moham.et al. 2012)
Styling (Changuel et al. 2009)
TTA (Gali and Fränti 2016)
Titler (KNN) (New)
Precision Recall F-score Jaccard
0.71
0.35
0.14
0.52
0.59
0.33
0.47
0.21
0.59
0.56
0.41
0.37
0.15
0.52
0.55
0.44
0.37
0.22
0.54
0.59
Dice
0.54
0.43
0.28
0.62
0.66
Logo Image
• ~89 % of web pages have their title within a logo image
• Needs to detect logo image
• Apply OCR
• Challenging !!!