Transcript Slides

Site-Level
Web Template Extraction
Based on Hyperlink Analysis
Information
Retrieval
Web
Mining
Content
Extraction
Template
Extraction
Block
Detection
Information
Retrieval
Web
Mining
Content
Extraction
Template
Extraction
Block
Detection
Template Extraction
Why is Template Extraction useful?
• Human reading. It has been measured that almost 40-50% of
the components of a webpage can be considered irrelevant.
• Enhancing indexers and text analyzers to increase their
performance by only processing relevant information.
• Extraction of the main content of a webpage to be suitably
displayed in a small device such as a PDA or a mobile phone
• Extraction of the relevant content to make the webpage more
accessible for visually impaired or blind.
What is a webpage?
What is a webpage?
What is a webpage?
Three different interpretations
1. Rendered View
2. HTML Code
3. DOM Tree
What is a webpage?
! " # $%%
# 12%
+%
*%
&' %
, -. /-%
13 4 %
*%
, -. /-%
*%
*%
( %%%%%%%%
*%
&) %
, -. /-%
* % 8951( 0%
(%
+%
! " #$%&
+%
0+! 67%
*%
0! " # $%
, -. /-%
*%
05%
0&%
&5%
*%
0&%
0# %
, -. /-%
, -. /-%
05%
*%
*%
0# %
, -. /-%
05%
*%
+%
, -. /-%
* % , -. /-%%
+%
, -. /-%
, -. /-%%
What is a webpage?
Three different interpretations
1. Rendered View
2. HTML Code
3. DOM Tree
What is a webpage?
Three different interpretations
1. Rendered View
Visual features classification…
2. HTML Code
CETR, Content Code Vector…
3. DOM Tree
Site Style Tree, CNR…
What is a webpage?
Three different interpretations
1. Rendered View
Visual features classification…
2. HTML Code
CETR, Content Code Vector…
3. DOM Tree
Site Style Tree, CNR…
HTML Code approach
CETR
HTML Code approach
What is a webpage?
Three different interpretations
1. Rendered View
Visual features classification…
2. HTML Code
CETR, Content Code Vector…
3. DOM Tree
Site Style Tree, CNR…
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
HTML
Template Extraction
Exact Top-Down Mapping
Template Extraction
Our method for template extraction in a nutsell:
1.
Identify a set of webpages in the website topology.
Select those nodes that belong to the menu.
Use a complete subdigraph.
1.
The template is the intersection between the initial webpage and all
DOM trees in the subdigraph.
The intersection is computed with a Top-Down Exact Mapping between the
DOM trees.
Both steps can be done with a cost linear with the size of the DOM trees.
Template Extraction
Hyperlink Analysis
DEMO
Summary
Main Ideas
1) Use densitometric features (TR) to analyse HTML code
2) Use Chars Nodes Ratio (CNR) to analyse DOM trees
3) Use Top-Down Exact Mappings (TDEM) to isolate the
template of webpages
4) Use the menus of a website to extract the template
5) Use a complete subdigraph to identify the main menu
6) Use folder information inside URLs to direct the search
Thank you