Web Data Extraction

Transcript Web Data Extraction

Web Data Extraction
Aki Hecht
Seminar in Databases (236826)
January 2009
Agenda





Introduction
Building Tag Trees
Mining Data Regions
Partial Tree Alignment
Extraction Given Multiple Pages
Introduction


Enormous amount of data is stored in open databases.
Most databases retrieve web pages with structured
data objects.



Usually “Deep Web” pages
Non trivial task to crawl those pages
The data is important and useful for many applications:


Price comparison engines
Collecting individuals information
The goal
Given a HTML page containing multiple data records –
insert the data into a table.


No assumptions allowed on the amount of data
records in the page nor on their structure/content.
The extraction should be done automatically

Human intervention can help in getting more accurate
results, but the cost is too high.
Example 1
Example 2
More than one
data region!
General idea

Given a Web page:


Build the HTML tag tree
Mine data regions



Identify data records from each data region
Learn the structure of a general data record


Mining data records directly is hard
A data record can contain optional fields
Extract the data
Agenda





Introduction
Building Tag Trees
Mining Data Regions
Partial Tree Alignment
Extraction Given Multiple Pages
Building a tag tree

Most HTML tags work in pairs. Within each
corresponding tag-pair, there can be other pairs of tags,
resulting in a nested structure.



Some tags do not require closing tags (e.g., <li>, <hr> and
<p>) although they have closing tags.
Additional closing tags need to be inserted to ensure all tags
are balanced.
Building a tag tree from a page using its HTML code is
thus natural.
An example
The tag tree
Building trees using visual cues



The HTML code can contain errors.
Browsers are sophisticated enough to display pages
with HTML errors.
We can build the tag tree using the browser’s
mechanism.


Each HTML element is rendered as a rectangle.
Containments of rectangles representing nesting.
An example
Agenda





Introduction
Building Tag Trees
Mining Data Regions
Partial Tree Alignment
Extraction Given Multiple Pages
Tree Edit Distance


Tree edit distance between two trees A and B is the cost
associated with the minimum set of operations needed to
transform A into B.
The set of operations used to define tree edit distance
includes three operations:
node removal
 node insertion
 node replacement
A cost is assigned to each of the operations.

Finding Tree Edit Distance

Tree edit distance is very similar to string edit distance.


Can be found in the same way
Done by finding the minimal cost mapping between the two
trees.
Finding Tree Edit Distance

cont.
The algorithm for finding the minimal cost
mapping is identical for both trees and strings.

Based on dynamic programming
Mining Data Regions

Definition: A generalized node of length r consists of r
(r  1) nodes in the tag tree with the following two
properties:



the nodes all have the same parent.
the nodes are adjacent.
Definition: A data region is a collection of two or more
generalized nodes with the following properties:




the generalized nodes all have the same parent.
the generalized nodes all have the same length.
the generalized nodes are all adjacent.
the similarity between adjacent generalized nodes is greater
than a fixed threshold.
An Example
1
2
5
6
7
3
8
9
4
10
11
Region 1
The regions were found using
tree edit distance.
For example, nodes 5 and 6
are similar (low cost mapping),
they also suit the above
definition and therefore they
define a data region
12
Region 2
13
14 15
16
17
Region 3
18
19
Agenda





Introduction
Building Tag Trees
Mining Data Regions
Partial Tree Alignment
Extraction Given Multiple Pages
Partial Tree Alignment

For each data region we have found we need to
understand the structure of the data records in the
region.


Not all data records contain the same fields (optional fields are
possible)
We will use (partial) tree alignment to gather the
structure.
The algorithm

Choose a seed tree:


A seed tree, denoted by Ts, is picked with the maximum number of
data items.
Tree matching:

For each unmatched tree Ti (i ≠ s),



match Ts and Ti.
Each pair of matched nodes are linked (aligned).
For each unmatched node nj in Ti do
 expand Ts by inserting nj into Ts if a position for insertion can be
uniquely determined in Ts.
The expanded seed tree Ts is then used in subsequent matching.
Partial Tree Alignment of two trees
Ts
a
Ti
p
b
e
p
c
b
e
d
Insertion is possible
New part of Ts
a
p
b
c
d
e
Ts
Insertion is not
possible
a
Ti
p
b
e
a
p
x
e
Full algorithm
A complete example
Ts = T1 p
T2
… xd b
Ts
p
p
bg n c
T3
k
p
b c d h k
No node inserted
… x b d
p
New Ts
…
x b
c d
h
k
T2 is matched again
T2
b n
p
p
c k g
… x b n c d h k g
Output data table
T1
…
x
b
…
1
1
T2
1
T3
1
n
c
d
h
k
g
1
1
1
1
1
1
1
1
Different data records contain different fields!
1
Agenda





Introduction
Building Tag Trees
Mining Data Regions
Partial Tree Alignment
Extraction Given Multiple Pages
Extraction given multiple pages

The described technique is good for a single list page.




It can clearly be used for multiple list pages.
Templates from all input pages may be found separately and
merged to produce a single refined pattern.
Extraction results will get more accurate.
In many applications, one needs to extract the data
from the detail pages as they contain more information
on the object.
Detail pages – an example
More data in
the detail pages
A list page
Extraction from detail pages

For extraction, we can treat each detail page as a data
record, then extract using partial tree alignment.

For instance, to apply the algorithm, we simply create a rooted
tree as follows:
 create an artificial root node, and
 make the tag tree of each page as a child sub-tree of the
artificial root node.
An example
r
…
We already know how to extract data from a data region
Difficulty with detail pages

Although a detail page focuses on a single object, the
page may contain a large amount of “noise”, at the top,
on the left and right and at the bottom.


Mostly in commercial websites
Since we treat each page as a data record, the algorithm will
also extract the “noise”.
An example (a lot of noise)
The solution


To start, a sample page is taken as the wrapper.
The wrapper is then refined by solving mismatches
between the wrapper and each sample page, which
generalizes the wrapper.

A mismatch occurs when some token in the sample does not
match the grammar of the wrapper.
Wrapper generalization

Different types of mismatches:


Text string mismatches: indicate data fields (or items).
Tag mismatches: indicate list of repeated patterns or optional
elements.

Find the last token of the mismatch position and identify some
candidate repeated patterns from the wrapper and sample by
searching forward.
An example
Summary

Automatic extraction of data from a web page requires
understanding of the data records’ structure.


First step is finding the data records in the page.
Second step is merging the different structures and build a
generic template for a data record.

Partial tree alignment is one method for building the template.
Summary
cont.
Automatic extraction
 Advantages:


It is scalable to a huge number of sites due to the automatic
process.
Disadvantages:


It may extract a large amount of unwanted data because the
system does not know what is interesting to the user. Domain
heuristics or manual filtering may be needed to remove
unwanted data.
Extracted data from multiple sites need integration, i.e., their
schemas need to be matched.
Thank you!
Question?
Bibliography



Y. Zhai, B. Liu “Web data extraction based on partial tree alignment”.
International World Wide Web Conference (2005)
Y. zhai, B. Liu "Structured data extraction from the web based on partial
tree alignment," IEEE Transactions on Knowledge and Data Engineering
(2006)
DC Reis, PB Golgher, AS Silva, AF Laender “Automatic web news extraction
using tree edit distance” Proceedings of the 13th international conference
on World Wide Web Conference (2004)

Web Data Extraction

Transcript Web Data Extraction

Directory