Web Information Extraction

Download Report

Transcript Web Information Extraction

Web Information Extraction
3rd Oct 2007
Information Extraction
(Slides based on those by
Ray Mooney, Craig Knoblock, Dan Weld, Perry,
Subbarao Kambhampati, Bing Liu)
Introduction

The Web is perhaps the single largest data source in the
world.

Much of the Web (content) mining is about

Data/information extraction from semi-structured objects
and free text, and

Integration of the extracted data/information

Due to the heterogeneity and lack of structure, mining and
integration are challenging tasks.

This talk gives an overview.
Introduction


Web mining aims to develop new techniques to extract
useful knowledge from the Web
Web offers unprecedented opportunity and challenges to
NLP
 Huge amount of information accessible
 Wide and diverse coverage
 Information of all types – structured table, text,
multimedia data …
 Semi-structured (html)
 Linked
 Redundant


Noisy. main content, advertisement, navigation panel,
copyright notices, …
Surface Web and Deep Web





Surface web: pages that can be browsed using q web
browser
Deep web: databases accessible through parameterized
query interfaces
Services.
Dynamic
Virtual Society. Interactions among people,
organizations, and systems.
Information Extraction (IE)



Identify specific pieces of information (data) in a
unstructured or semi-structured textual document.
Transform unstructured information in a corpus of
documents or web pages into a structured database.
Applied to different types of text:
 Newspaper articles
 Web pages
 Scientific articles
 Newsgroup messages
 Classified ads
 Medical notes
Information Extraction vs. NLP?



Information extraction is attempting to find some of the
structure and meaning in the hopefully template driven
web pages.
As IE becomes more ambitious and text becomes more
free form, then ultimately we have IE becoming equal to
NLP.
Web does give one particular boost to NLP
 Massive corpora..
Sample Job Posting
Subject: US-TN-SOFTWARE PROGRAMMER
Date: 17 Nov 1996 17:37:29 GMT
Organization: Reference.Com Posting Service
Message-ID: <[email protected]>
SOFTWARE PROGRAMMER
Position available for Software Programmer experienced in generating software for PCBased Voice Mail systems. Experienced in C Programming. Must be familiar with
communicating with and controlling voice cards; preferable Dialogic, however, experience
with others such as Rhetorix and Natural Microsystems is okay. Prefer 5 years or more
experience with PC Based Voice Mail, but will consider as little as 2 years. Need to find a
Senior level person who can come on board and pick up code with very little training.
Present Operating System is DOS. May go to OS-2 or UNIX in future.
Please reply to:
Kim Anderson
AdNET
(901) 458-2888 fax
[email protected]
Extracted Job Template
computer_science_job
id: [email protected]
title: SOFTWARE PROGRAMMER
salary:
company:
recruiter:
state: TN
city:
country: US
language: C
platform: PC \ DOS \ OS-2 \ UNIX
application:
area: Voice Mail
req_years_experience: 2
desired_years_experience: 5
req_degree:
desired_degree:
post_date: 17 Nov 1996
Amazon Book Description
….
</td></tr>
</table>
<b class="sans">The Age of Spiritual Machines : When Computers Exceed Human Intelligence</b><br>
<font face=verdana,arial,helvetica size=-1>
by <a href="/exec/obidos/search-handle-url/index=books&field-author=
Kurzweil%2C%20Ray/002-6235079-4593641">
Ray Kurzweil</a><br>
</font>
<br>
<a href="http://images.amazon.com/images/P/0140282025.01.LZZZZZZZ.jpg">
<img src="http://images.amazon.com/images/P/0140282025.01.MZZZZZZZ.gif" width=90
height=140 align=left border=0></a>
<font face=verdana,arial,helvetica size=-1>
<span class="small">
<span class="small">
<b>List Price:</b> <span class=listprice>$14.95</span><br>
<b>Our Price: <font color=#990000>$11.96</font></b><br>
<b>You Save:</b> <font color=#990000><b>$2.99 </b>
(20%)</font><br>
</span>
<p> <br>…
<p>
<br>
Extracted Book Template
Title: The Age of Spiritual Machines :
When Computers Exceed Human Intelligence
Author: Ray Kurzweil
List-Price: $14.95
Price: $11.96
:
:
Product information/ Comparison
shopping, etc.





Need to learn to extract info from online vendors
Can exploit uniformity of layout, and (partial)
knowledge of domain by querying with known products
Early e.g., Jango Shopbot (Etzioni and Weld)
 Gives convenient aggregation of online content
Bug: originally not popular with vendors
 Make personal agents rather than web services?
This seems to have changed (e.g., Froogle)
Commercial information…
A book,
Not a toy
Title
Need this
price
Information Extraction

Information extraction systems
 Find and understand the limited relevant parts of texts

Clear, factual information (who did what to whom when?)
 Produce

a structured representation of the relevant
information: relations (in the DB sense)
 Combine knowledge about language and a domain
 Automatically extract the desired information
E.g.
 Gathering earnings, profits, board members, etc. from
company reports
 Learn drug-gene product interactions from medical
research literature
 “Smart Tags” (Microsoft) inside documents
Using information extraction to
populate knowledge bases
http://protege.stanford.edu/
Named Entity Extraction

The task: find and classify names in text, for example:
The European Commission said
[ORG]on
said
on Thursday
it
Thursday
it disagreed
with
disagreed
with German [MISC] advice.
German advice.
Only France and
[LOC]
and Britain
backed
Fischler .
[PER]
Britain
backed[LOC]
Fischler
's proposal
's proposal .
“What we have to be extremely careful of is how other
“What
we have
be extremely
careful
is how
other
countries
are to
going
to take Germany
'sof
lead”,
Welsh
countries
are going
to take
Germany
's lead”,
National Farmers
' Union
( NFU
) chairman
John Welsh
Lloyd Jones
National
Farmers
said on BBC
radio'.Union [ORG] ( NFU [ORG] ) chairman John
Lloyd Jones [PER] said on BBC [ORG] radio .

The purpose:
 … a lot of information is really associations between named entities.
 … for question answering, answers are usually named entities.
 … the same techniques apply to other slot-filling classifications.
CoNLL (2003) Named Entity
Recognition task
Task: Predict semantic label of each word in text
Foreign
Ministry
spokesman
Shen
Guofang
told
Reuters
:
NNP
NNP
NN
NNP
NNP
VBD
NNP
:
I-NP
I-NP
I-NP
I-NP
I-NP
I-VP
I-NP
:
ORG
ORG
O
PER
PER
O
ORG
:
}
Standard
evaluation
is per
entity, not
per token
Precision/Recall/F1 for IE





Recall and precision are straightforward for tasks like IR
and text categorization, where there is only one grain
size (documents)
The measure behaves a bit funnily for IE/NER when
there are boundary errors (which are common):
 First Bank of Chicago announced earnings …
This counts as both a fp and a fn
Selecting nothing would have been better
Some other systems (e.g., MUC scorer) give partial
credit (according to complex rules)
Template Types




Slots in template typically filled by a substring from the document.
Some slots may have a fixed set of pre-specified possible fillers that
may not occur in the text itself.
 Terrorist act: threatened, attempted, accomplished.
 Job type: clerical, service, custodial, etc.
 Company type: SEC code
Some slots may allow multiple fillers.
 Programming language
Some domains may allow multiple extracted templates per
document.
 Multiple apartment listings in one ad
Task: Wrapper Induction

Learning wrappers is wrapper induction

Sometimes, the relations are structural.




Can’t computers automatically learn the patterns a human
wrapper-writer would use?
Wrapper induction is usually regular relations which can be
expressed by the structure of the document:


Web pages generated by a database.
Tables, lists, etc.
the item in bold in the 3rd column of the table is the price
Wrapper induction techniques can also learn:

If there is a page about a research project X and there is a link
near the word ‘people’ to a page that is about a person Y then Y
is a member of the project X.

[e.g, Tom Mitchell’s Web->KB project]
Web Extraction






Many web pages are generated automatically from an underlying
database.
Therefore, the HTML structure of pages is fairly specific and regular
(semi-structured).
However, output is intended for human consumption, not machine
interpretation.
An IE system for such generated pages allows the web site to be
viewed as a structured database.
An extractor for a semi-structured web site is sometimes referred to
as a wrapper.
Process of extracting from such pages is sometimes referred to as
screen scraping.
Web Extraction using DOM Trees



Web extraction may be aided by first parsing web pages
into DOM trees.
Extraction patterns can then be specified as paths from
the root of the DOM tree to the node containing the text
to extract.
May still need regex patterns to identify proper portion of
the final CharacterData node.
Sample DOM Tree Extraction
HTML
Element
HEADER
BODY
B
Age of Spiritual
Machines
Character-Data
FONT
by
A
Ray
Kurzweil
Title: HTMLBODYBCharacterData
Author: HTML BODYFONTA CharacterData
Wrappers: Simple Extraction Patterns


Specify an item to extract for a slot using a regular expression
pattern.
 Price pattern: “\b\$\d+(\.\d{2})?\b”
May require preceding (pre-filler) pattern to identify proper
context.
 Amazon list price:



Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”
Filler pattern: “\$\d+(\.\d{2})?\b”
May require succeeding (post-filler) pattern to identify the end of
the filler.
 Amazon list price:



Pre-filler pattern: “<b>List Price:</b> <span class=listprice>”
Filler pattern: “.+”
Post-filler pattern: “</span>”
Simple Template Extraction


Extract slots in order, starting the search for the filler of the n+1 slot
where the filler for the nth slot ended. Assumes slots always in a
fixed order.
 Title
 Author
 List price
 …
Make patterns specific enough to identify each filler always starting
from the beginning of the document.
Pre-Specified Filler Extraction


If a slot has a fixed set of pre-specified possible fillers,
text categorization can be used to fill the slot.
 Job category
 Company type
Treat each of the possible values of the slot as a
category, and classify the entire document to determine
the correct filler.
Wrapper induction
Highly regular
source documents


Relatively simple
extraction patterns

Efficient
learning algorithm

Writing accurate patterns for
each slot for each domain
(e.g. each web site) requires
laborious software
engineering.
Alternative is to use machine
learning:


Build a training set of
documents paired with
human-produced filled
extraction templates.
Learn extraction patterns for
each slot using an
appropriate machine learning
algorithm.
Learning for IE


Writing accurate patterns for each slot for each domain (e.g. each web site)
requires laborious software engineering.
Alternative is to use machine learning:


Build a training set of documents paired with human-produced
filled extraction templates.
Learn extraction patterns for each slot using an appropriate
machine learning algorithm.
Wrapper induction:
Delimiter-based extraction
<HTML><TITLE>Some Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>

Use <B>, </B>, <I>, </I> for extraction
Learning LR wrappers
labeled pages
<HTML><HEAD>Some Country Codes</HEAD>
<B>Congo</B>
<I>242</I><BR>
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Egypt</B>
<I>20</I><BR>
<B>Congo</B>
<I>242</I><BR>
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Belize</B>
<I>501</I><BR>
<B>Egypt</B>
<I>20</I><BR>
<B>Congo</B>
<I>242</I><BR>
<HTML><HEAD>Some
Country Codes</HEAD>
<B>Spain</B>
<I>34</I><BR>
<B>Belize</B>
<I>501</I><BR>
<B>Egypt</B>
<I>20</I><BR>
<B>Congo</B>
<I>242</I><BR>
</BODY></HTML>
<B>Spain</B>
<I>34</I><BR>
<B>Belize</B>
<I>501</I><BR>
<B>Egypt</B>
<I>20</I><BR>
</BODY></HTML>
<B>Spain</B>
<I>34</I><BR>
<B>Belize</B>
<I>501</I><BR>
</BODY></HTML>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
wrapper
l1, r1, …, lK, rK
Example: Find 4 strings
<B>, </B>, <I>, </I>
 l1 ,
r1 ,
l2 ,
r2 
LR: Finding r1
<HTML><TITLE>Some Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
LR: Finding l1, l2 and r2
<HTML><TITLE>Some Country Codes</TITLE>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
</BODY></HTML>
A problem with LR wrappers
Distracting text in head and tail
<HTML><TITLE>Some Country Codes</TITLE>
<BODY><B>Some Country Codes</B><P>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
<HR><B>End</B></BODY></HTML>
One (of many) solutions: HLRT
Ignore page’s head and tail
<HTML><TITLE>Some Country Codes</TITLE>
<BODY><B>Some Country Codes</B><P>
<B>Congo</B> <I>242</I><BR>
<B>Egypt</B> <I>20</I><BR>
<B>Belize</B> <I>501</I><BR>
<B>Spain</B> <I>34</I><BR>
<HR><B>End</B></BODY></HTML>
}
head
}
body
} tail

Head-Left-Right-Tail wrappers
More sophisticated wrappers


LR and HLRT wrappers are extremely simple
 Though applicable to many tabular patterns
Recent wrapper induction research has explored more expressive
wrapper classes [Muslea et al, Agents-98; Hsu et al, JIS-98;
Kushmerick, AAAI-1999; Cohen, AAAI-1999; Minton et al, AAAI2000]
 Disjunctive delimiters
 Multiple attribute orderings
 Missing attributes
 Multiple-valued attributes
 Hierarchically nested data
 Wrapper verification and maintenance
Boosted wrapper induction



Wrapper induction is only ideal for rigidly-structured
machine-generated HTML…
… or is it?!
Can we use simple patterns to extract from natural
language documents?
… Name: Dr. Jeffrey D. Hermes …
… Who: Professor Manfred Paul …
... will be given by Dr. R. J. Pangborn …
… Ms. Scott will be speaking …
… Karen Shriver, Dept. of ...
… Maria Klawe, University of ...
BWI: The basic idea



Learn “wrapper-like” patterns for texts
pattern = exact token sequence
Learn many such “weak” patterns
Combine with boosting to build “strong” ensemble
pattern



Boosting is a popular recent machine learning method where
many weak learners are combined
Demo: http://www.smi.ucd.ie/bwi
Not all natural text is sufficiently regular for exact string
matching to work well!!
Natural Language Processing-based
Information Extraction


If extracting from automatically generated web pages, simple
regex patterns usually work.
If extracting from more natural, unstructured, human-written text,
some NLP may help.
 Part-of-speech (POS) tagging


Syntactic parsing


Identify phrases: NP, VP, PP
Semantic word categories (e.g. from WordNet)


Mark each word as a noun, verb, preposition, etc.
KILL: kill, murder, assassinate, strangle, suffocate
Extraction patterns can use POS or phrase tags.
 Crime victim:


Prefiller: [POS: V, Hypernym: KILL]
Filler: [Phrase: NP]
Stalker: A wrapper induction system
(Muslea et al. Agents-99)
E1:513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515
E2:90 Colfax, <b>Palms</b>, Phone (800) 508-1570
E3:
523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293
E4:
403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008
We want to extract area code.


Start rules:
R1: SkipTo(()
R2: SkipTo(-<b>)
End rules:
R3: SkipTo())
R4: SkipTo(</b>)
Learning extraction rules

Stalker uses sequential covering to learn extraction rules
for each target item.
 In each iteration, it learns a perfect rule that covers as
many positive items as possible without covering any
negative items.
 Once a positive item is covered by a rule, the whole
example is removed.
 The algorithm ends when all the positive items are
covered. The result is an ordered list of all learned
rules.
Rule induction through an example
Training examples:
E1:
E2:
E3:
E4:
513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515
90 Colfax, <b>Palms</b>, Phone (800) 508-1570
523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293
403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008
We learn start rule for area code.
 Assume the algorithm starts with E2. It creates three initial candidate
rules with first prefix symbol and two wildcards:




R1: SkipTo(()
R2: SkipTo(Punctuation)
R3: SkipTo(Anything)
R1 is perfect. It covers two positive examples but no negative
example.
Rule induction (cont …)
E1:
E2:
E3:
E4:





513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515
90 Colfax, <b>Palms</b>, Phone (800) 508-1570
523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293
403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008
R1 covers E2 and E4, which are removed. E1 and E3 need
additional rules.
Three candidates are created:
 R4: SkiptTo(<b>)
 R5: SkipTo(HtmlTag)
 R6: SkipTo(Anything)
None is good. Refinement is needed.
Stalker chooses R4 to refine, i.e., to add additional symbols, to
specialize it.
It will find R7: SkipTo(-<b>), which is perfect.
Limitations of Supervised Learning


Manual Labeling is labor intensive and time consuming,
especially if one wants to extract data from a huge
number of sites.
Wrapper maintenance is very costly:
 If Web sites change frequently
 It is necessary to detect when a wrapper stops to work
properly.
 Any change may make existing extraction rules
invalid.
 Re-learning is needed, and most likely manual relabeling as well.
Road map



Structured data extraction
 Wrapper induction
 Automatic extraction
Information integration
Summary
The RoadRunner System
(Crescenzi et al. VLDB-01)
Given a set of positive examples (multiple sample pages). Each
contains one or more data records.
 From these pages, generate a wrapper as a union-free regular
expression (i.e., no disjunction).
The approach
 To start, a sample page is taken as the wrapper.
 The wrapper is then refined by solving mismatches between the
wrapper and each sample page, which generalizes the wrapper.

Compare with wrapper induction


No manual labeling, but need a set of positive pages of the same
template
 which is not necessary for a page with multiple data records
not wrapper for data records, but pages.
 A Web page can have many pieces of irrelevant information.
Issues of automatic extraction
 Hard to handle disjunctions
 Hard to generate attribute names for the extracted data.
 extracted data from multiple sites need integration, manual or
automatic.
The DEPTA system (Zhai & Liu WWW-05)
Data
region1
A data
record
A data
record
Data
region2
Align and extract data items (e.g.,
region1)
image1
EN7410 17inch LCD
Monitor
Black/Dark
charcoal
$299.99
Add to
Cart
(Delivery /
Pick-Up )
Penny
Shopping
Compare
image2
17-inch LCD
Monitor
$249.99
Add to
Cart
(Delivery /
Pick-Up )
Penny
Shopping
Compare
image3
AL1714 17inch LCD
Monitor, Black
$269.99
Add to
Cart
(Delivery /
Pick-Up )
Penny
Shopping
Compare
image4
SyncMaster
712n 17-inch
LCD Monitor,
Black
Add to
Cart
(Delivery /
Pick-Up )
Penny
Shopping
Compare
Was:
$369.99
$299.99
Save $70
After: $70
mail-inrebate(s)
1. Mining Data Records
(Liu et al, KDD-03; Zhai and Liu, WWW-05)



Given a single page with multiple data records (a list
page), it extracts data records.
The algorithm is based on
 two observations about data records in a Web page
 a string matching algorithm (tree matching ok too)
Considered both
 contiguous
 non-contiguous data records
The Approach
Given a page, three steps:
 Building the HTML Tag Tree
 Erroneous tags, unbalanced tags, etc
 Some problems are hard to fix
 Mining Data Regions
 Spring matching or tree matching
 Identifying Data Records
Rendering (or visual) information is very useful in the whole
process
Building tree based on visual cues
1 <table>
2
<tr>
3
4
5
</tr>
6
<tr>
7
8
9
</tr>
10</table>
left
right
top
bottom
<td> … </td>
<td> … </td>
100
100
100
200
300
300
200
300
200
200
200
200
400
300
300
300
<td> … </td>
<td> … </td>
100
100
200
300
200
300
300
300
300
400
400
400
table
The tag tree
tr
td
tr
td
td
td
Mining Data Regions
1
2
5
6
7
3
8
9
4
10
11
Region 1
12
Region 2
13
14 15
16
17
Region 3
18
19
20
Identify Data Records



A generalized node may not be
a data record.
Extra mechanisms are needed
to identify true atomic objects
(see the papers).
Some highlights:
 Contiguous
 non-contiguous data records.
Name 1
Description of
object 1
Name 2
Description of
object 2
Name 3
Description of
object 3
Name 4
Description of
object 4
Name 1
Name 2
Description of
object 1
Description of
object 2
Name 3
Name 4
Description of
object 3
Description of
object 4
2. Extract Data from Data Records


Once a list of data records are identified, we can align
and extract data items in them.
Approaches (align multiple data records):
 Multiple string alignment


Multiple tree alignment (partial tree alignment)


Many ambiguities due to pervasive use of table related tags.
Together with visual information is effective
Most multiple alignment methods work like hierarchical
clustering,
 Not effective, and very expensive
Tree Matching (tree edit distance)

Intuitively, in the mapping
 each node can appear no more than once in a mapping,
 the order between sibling nodes are preserved, and
 the hierarchical relation between nodes are also preserved.
A
B
p
p
e
b
a
c
d
a
h
c
d
The Partial Tree Alignment approach


Choose a seed tree: A seed tree, denoted by Ts, is picked with the
maximum number of data items.
Tree matching:
For each unmatched tree Ti (i ≠ s),
 match Ts and Ti.
 Each pair of matched nodes are linked (aligned).
 For each unmatched node nj in Ti do

expand Ts by inserting nj into Ts if a position for insertion can be
uniquely determined in Ts.
The expanded seed tree Ts is then used in subsequent matching.
Illustration of partial tree alignment
Ts
a
Ti
p
b
e
d
c
b
e
Insertion is possible
New part of Ts
a
p
p
b
c
Insertion is not
possible
e
d
Ts
p
a
b
Ti
e
a
p
x
e
A complete
example
Ts = T1 p
T2
… x b
d
Ts
b
p
p
n c
T3
p
b c d h k
k g
No node inserted
… x b d
New Ts
T2 is matched again
T2
b n
…
p
x b
c, h, and k inserted
c d
h
k
p
c k g
p
… x b n c d h k g
Output Data Table
T1
…
x
b
…
1
1
T2
1
T3
1
n
c
d
h
k
g
1
1
1
1
1
1
1
1
1

DEPTA does not work with nested data
records.

NET (Liu & Zhai, WISE-05)extracts data from
both flat and nested data records.
Some other systems and techniques



IEPAD (Chang & Lui WWW-01), DeLa (Wang & Lochovsky WWW03)
 These systems treat a page as a long string, and find repeated
substring patterns.
 They often produce multiple patterns (rules). Hard to decide
which is correct.
EXALG(Arasu & Garcia-Molina SIGMOD-03), (Lerman et al,
SIGMOD-04).
 Require multiple pages to find patterns.
 Which is not necessary for pages with multiple records.
(Zhao et al, WWW-04)
 It extracts data records in one area of a page.
Limitations and issues




Not for a page with only a single data record
Does not generate attribute names for the extracted data (yet!)
extracted data from multiple sites need integration.
It is possible in each specific application domain, e.g.,
 products sold online.



need “product name”, “image”, and “price”.
identify only these three fields may not be too hard.
Job postings, publications, etc …
Road map



Structured data extraction
 Wrapper induction
 Automatic extraction
Information integration
Summary
Web query interface integration


Many integration tasks,
 Integrating Web query interfaces (search forms)
 Integrating extracted data
 Integrating textual information
 Integrating ontologies (taxonomy)
 …
We only introduce integration of query interfaces.
 Many web sites provide forms to query deep web
 Applications: meta-search and meta-query
Global Query Interface
united.com
airtravel.com
delta.com
hotwire.com
Synonym Discovery (He and Chang, KDD-04)

Discover synonym attributes
Author – Writer, Subject – Category
S1:
author
title
subject
ISBN
S2:
writer
title
category
format
S3:
name
title
keyword
binding
Holistic Model Discovery
author
writer name subject
category
S1:
author
title
subject
ISBN
V.S.
S2:
writer
title
category
format
S3:
name
title
keyword
binding
Pairwise Attribute
Correspondence
S1.author  S3.name
S1.subject  S2.category
Schema matching as correlation
mining
Across many sources:


Synonym attributes are negatively correlated
 synonym attributes are semantically alternatives.
 thus, rarely co-occur in query interfaces
Grouping attributes with positive correlation
 grouping attributes semantically complement
 thus, often co-occur in query interfaces
1. Positive correlation mining as potential groups
Mining positive correlations
Last Name, First Name
2. Negative correlation mining as potential matchings
Mining negative correlations
Author =
{Last Name, First Name}
3. Matching selection as model construction
Author (any) =
{Last Name, First Name}
Subject = Category
Format = Binding
A clustering approach to schema
matching (Wu et al. SIGMOD-04)




1:1 mapping by clustering
Bridging effect
 “a2” and “c2” might not look
similar themselves but they
might both be similar to “b3”
1:m mappings
 Aggregate and is-a types
User interaction helps in:
 learning of matching thresholds
 resolution of uncertain mappings
X
Find 1:1 Mappings via Clustering
Interfaces:
Initial similarity matrix:
After one merge:

Similarity functions
linguistic similarity
 domain similarity

…, final clusters:
{{a1,b1,c1}, {b2,c2},{a2},{b3}}
Find 1:m Complex Mappings
Aggregate type – contents of fields on the many side are part of
the content of field on the one side
Commonalities – (1) field proximity, (2) parent label similarity,
and (3) value characteristics
Complex Mappings (Cont’d)
Is-a type – contents of fields on the many side are sum/union of
the content of field on the one side
Commonalities – (1) field proximity, (2) parent label similarity,
and (3) value characteristics
Instance-based matching via query
probing (Wang et al. VLDB-04)


Both query interfaces and returned results (called
instances) are considered in matching. It assumes
 a global schema (GS) is given and
 a set of instances are also given.
Uses each instance value (V) in GS to probe the
underlying database to obtain the count of V appeared in
the returned results.
 These counts are used to help matching.
Query interface and result page
Title
Author
Publisher
Publish Date
ISBN
Format
…
Data Attributes
Result Page
Search Interface
Road map



Structured data extraction
 Wrapper Induction
 Automatic extraction
Information integration
Summary
Summary



Give an overview of two topics
 Structured data extraction
 Information integration
Some technologies are ready for industrial exploitation,
e.g., data extraction.
Simple integration is do-able, complex integration still
needs further research.