Transcript ppt

Automatic Schema Matching
Seminar on Databases and the Internet
Yaron Naveh
January 2006
Articles
A survey of approaches to automatic schema
matching
Rahm & Bernstein (2001)
Discovering Complex Matchings across Web
Query Interfaces: A Correlation Mining
Approach
He, Chen-Chuan Chang & Han (2004)
Automatic Schema Matching,
SDBI, 2006
2
Contents




Problem Definition
Applications
Classic Approaches
Correlation Mining Approach
Automatic Schema Matching,
SDBI, 2006
3
Match Definition
Authors
Authors
ID
AID
Name
AName
NumOfBooks
ANumOfBooks
A match is a mapping between elements of two
schemas that correspond semantically to each other
Automatic Schema Matching,
SDBI, 2006
4
Match Properties
Authors
Authors
(1:1)
ID
ID
Name
NumOfBooks
FName
?
(1:n)
LName
? YearOfBirth
• (n:m) matching also possible
Automatic Schema Matching,
SDBI, 2006
5
Match Properties (cont’d)
Authors
Authors
ID
ID
Name
Name
Salary ($)
Salary (NIS)
• Salary(NIS) = Salary($) * 4.55
• We will not find the function, just the attributes
Automatic Schema Matching,
SDBI, 2006
6
Match Properties (cont’d)
Employees
Employees
EmpName
EmpName
DeptID
Join
DeptName
Departments
DeptID
DeptName
One relation is mapped to two others
Automatic Schema Matching,
SDBI, 2006
7
Match Properties (cont’d)
Lessons
Lessons
Teacher
Teacher
StartTime
EndTime
?
?
Time
• Too hard for PC!
• PC should only suggest mappings to the user
Automatic Schema Matching,
SDBI, 2006
8
Match Properties (cont’d)
So maybe it can all be done manually?
Field1
Field1
Field2
Field2
Field3
Field3
Field4
Field4
Field5
Field5
Field6
Field6
Field7
Field7
Field8
Field8
Field9
Field9
field10
field10
An automated tool can be helpful here…
Automatic Schema Matching,
SDBI, 2006
9
Match Generalization
We have defined a match for the relational model.
There are other interesting models:
…
Books
<author>
Authors
<id>1</id>
ID
<name>Calvino</name>
</author>
Authors
…
Automatic Schema Matching,
SDBI, 2006
Name
10
Match Generalization (cont’d)


Define a Schema to be a set of elements
connected by some structure
Use the natural correspondence:
• nodes and edges in graphs
• elements, subelements, and IDREFs in XML
…
Automatic Schema Matching,
SDBI, 2006
11
Contents




Problem Definition
Applications
Classic Approaches
Correlation Mining Approach
Automatic Schema Matching,
SDBI, 2006
12
Data Migration
Migrate data from old DB to new DB
Old Forum
New Forum
Date
Time
From
Writer
Message
Message
IsVisible
Special case: Data warehouse
Automatic Schema Matching,
SDBI, 2006
ResponseTo
13
E-Commerce
Map between different message formats
Book Store
<book>
<name>The Invisible Cities</name>
<price>50</price>
</book>
General Store
<product>
<name>book</name>
<price>50</price>
</product>
Automatic Schema Matching,
SDBI, 2006
14
Global Query Interface
You want to build a Meta-Querier. However…
GOOGLE
<input name=search>
MSN
<input name=q>
<select name=type>
Yahoo
<input name=qry>
<input name=type>
Automatic Schema Matching,
SDBI, 2006
15
Global Query Interface (cont’d)
Solution: Reduce the html form to its “schema”
GOOGLE
Yahoo
MSN
Search
Qry
q
Type
Type
Automatic Schema Matching,
SDBI, 2006
16
Semantic Query Processing
Keywords search scenario
Authors
Find: Author + Ram + Oren
Id
Author
Name
Ram
Oren
SELECT * WHERE Id=‘Ram Oren’
?
How does this
differ from
previous
examples?
SELECT * WHERE Name=‘Ram Oren’ ?
Automatic Schema Matching,
SDBI, 2006
17
Contents




Problem Definition
Applications
Classic Approaches
Correlation Mining Approach
Automatic Schema Matching,
SDBI, 2006
18
Matchers



There are a few algorithms to map attributes of 2
schemas
Define such an algorithm as a matcher
Define a hybrid matcher as a matcher that combines
results from other matchers
Automatic Schema Matching,
SDBI, 2006
19
Schema-based Vs. Instance-based
Two ways to perform a match:
• Use schema data (field name, type, constraints…)
• Use data from the table
Automatic Schema Matching,
SDBI, 2006
20
Instance-based
Two options for using data from the table:
• Build a schema from instance data, then use schema
matchers
• Use the data directly. Example:
Books
BookID
1
TotPages TotPrice
500
50
2
3
400
450
What is
TotalP?
Books
BookID TotalP
1
60
40
90
Automatic Schema Matching,
SDBI, 2006
21
Instance-based (cont’d)
When will we use/not use instance based matchers?
• Useful when no schema data is available
• Not useful when no instance data is available…
Automatic Schema Matching,
SDBI, 2006
22
Schema-Based
What useful data is there in the schema?
• Element’s name
• Description
• Data Type
• Relationships
• Constraints
Automatic Schema Matching,
SDBI, 2006
23
Schema-Based: Name Matching
Map elements with similar names:
• String equality
• Common substrings (Birthday --> DayOfBirth)
• Canonical names (CName --> Customer Name)
• Synonyms (Car --> Automobile)
• Hypernyms (Book is-a Publication)
• Soundex (ShipTo --> Ship2)
• User provided (Issue --> Bug)
Automatic Schema Matching,
SDBI, 2006
24
Schema-Based: Description
Map elements based on description
Schema A
empn //employee name
Schema B
name //name of employee
Automatic Schema Matching,
SDBI, 2006
25
Schema-Based: Constraint Based
Map elements based on Constraints:
• Data Types
• Unique, Primary, Foreign
Permissions
Employees
Employees
Payments
ID
Name
Name
ID
PLevel
PID
PID
Sum
?
Automatic Schema Matching,
SDBI, 2006
26
Reuse Previous Matching
Schema A
Schema B
Schema C
Name
AName
Author
Salary
Income
Money
• Get mapping AC From mappings AB and BC
• A partial reuse is also possible (e.g. on some of the
attributes)
• Be aware of the domain: salary and income are not
always the same!
Automatic Schema Matching,
SDBI, 2006
27
Complexity
• We must compare every subgroup of attributes in
schema A to every subgroup in schema B
• Exponential in the number of attributes
• However, we can assume the number of attributes is
blocked…
• Also check (n:m) matching only for n,m<C for some C
Automatic Schema Matching,
SDBI, 2006
28
Contents




Problem Definition
Applications
Classic Approaches
Correlation Mining Approach
Automatic Schema Matching,
SDBI, 2006
29
Data Mining
Data Mining is the process of discovering patterns in data,
usually stored in a Database.
Sells
TransID
Item
1
Book
1
Pencil
2
Book
2
Soap
3
Book
3
Soap
Which items are likely to co-appear?
Automatic Schema Matching,
SDBI, 2006
30
Data Mining (cont’d)
Sells
TransID
Item
1
Book
1
Pencil
2
Book
2
Soap
3
Book
3
Soap
Support of an itemset: the fraction of
transactions that contain all items in the
itemset.
What is the support for {Book}? 1
And for {Book, Soap}? 0.666
The A-Priori property: the support for any
subset of an itemset is bigger than the
support for the itemset
Automatic Schema Matching,
SDBI, 2006
31
Data Mining (cont’d)
Algorithm to find frequent itemsets:
Sells
TransID
Item
1
Book
1
Pencil
2
Book
2
Soap
3
Book
3
Soap
1. Define a threshold minSupport for
“frequent” itemsets
2. Calculate support for all itemsets of
size (1)
3. Calculate support for itemsets of size
Why can
2,3,4…
we stop?
4. For each size k save the frequent
itemsets
5. Stop when there are no frequent
itemsets in size K.
Automatic Schema Matching,
SDBI, 2006
32
Data Mining (cont’d)
Sells
Example:
TransID
Item
1.
Set minSupport = 0.5
1
Book
2.
1
Pencil
S({Book})=1, S({Pencil})=0.33,
S({Soap})=0.666
2
Book
3.
S({Book, Soap})=0.666
2
Soap
3
Book
4.
S({Book, Soap, Pencil})=0
3
Soap
Where is
{Soap, Pencil}?
Automatic Schema Matching,
SDBI, 2006
33
Back to Schema Matching…
Authors
Id
Salary
Name
Year
Id
Author
Id
Id
Id
AuthorFirst
First
FirstName
AuthorLast
Last
LastName
What is the
YearBirth
Income
difference from
the
Goal: Map {Name} to {Author}, {Salary} supermarket
to {Income}…
example?
Idea:{Name} and {Author} are unlikely to appear together
Solution: go to the supermarket, but instead of food buy
attributes!
Automatic Schema Matching,
SDBI, 2006
34
The Algorithm
Input: set of m schemas
Id
Salary
Name
Year
Id
Author
Id
Id
Id
AuthorFirst
First
FirstName
AuthorLast
Last
LastName
YearBirth
Income
Output: set of n-ary mappings
{Name}:{Author}:{AuthorFirst, AuthorLast}:{First,Last}…
{Salary}:{Income}
{Year}:{YearBirth}
Automatic Schema Matching,
SDBI, 2006
35
Algorithm
Naive Algorithm
1. Make a list L of all attributes from all schemas
L = {Name, Salary, FirstName, LastName,
Author, First, Last…}
2. For each pair of attributes, calculate their
support (how often they appear together)
S(Name, Salary) = 0.4
S(First, Last) = 0.95
S(Last, Name) = 0.1
Automatic Schema Matching,
SDBI, 2006
36
Algorithm (Cont’d)
3. Choose groups with low support
S(Name, Salary) = 0.4
S(First, Last) = 0.95
S(Last, Name) = 0.1
4. Using the A-Priory property calculate support
for groups of sizes 3,4,5…
S(Name, LastName, Salary) = 0
S(First, Last, Salary) = 0.1
5. Return all groups with low support
Automatic Schema Matching,
SDBI, 2006
37
Algorithm (Cont’d)
The algorithm is naive.
suppose we have negative correlation for this:
{name, author}
Then we also have negative correlation for this:
{name, author, salary}
{name, author, yearOfBirth}
Actually for any attribute X we have:
{name, author, X}
Automatic Schema Matching,
SDBI, 2006
38
Improvement
Improvement: Define the support (s) of an
itemset {a,b,c…} to be
MAX { s(a,b), s(b,c), s(a,c) … }
Example:
s(name, author)=0.1
s(name, salary)=0.5
What is the
logic behind
s(salary, author)=0.6
this?
s(name,author,salary)=MAX (0.1,0.5,0.6)=0.6
Now the support can go up so checking it is not trivial
Automatic Schema Matching,
SDBI, 2006
39
Generalizing the algorithm
Now the algorithm finds all groups of attributes
(a,b,c…) s.t. none of the pairs appears together.
Hopefully these are attributes with the same semantic:
{name, author}
{salary, payments}
…
But what about this?
({first,last}, {name})
Currently we find only (1:1) matching
For (n:m) we need to preprocess…
Automatic Schema Matching,
SDBI, 2006
40
Preprocess
Pre-Process for the algorithm:
1. Make a list L of all attributes from all schemas
L = {Name, Salary, FirstName, LastName,
Author, First, Last…}
2. Run the normal A-Priori algorithm (find all
attributes that DO appear together)
S(first, last)=0.9
S(firstName,lastName)=0.85
Automatic Schema Matching,
SDBI, 2006
41
Preprocess
3. For each schema S in the input:
For each frequent attributes group A:
S
Id
First
If A intersects with S than add new
attribute “A” to S
S’
A
Id
{First,Last}
Last
Last
4. Run the previous algorithm on S1’,
S2’… to find negative correlation
Now we can find groups like:
({first,last}, {name})
First
Automatic Schema Matching,
SDBI, 2006
First, Last
42
Still Not Perfect…
Suppose we found these mappings:
{first,last}:{name}:{author}
{first, yearOfBirth}:{birthDate}
{yearOfBirth, monthOfBirth}:{birthDate}
There is a contradiction!
Automatic Schema Matching,
SDBI, 2006
43
Solution
Solution: rank the mappings according to the support
of the lowest pair in each mapping
1. {first,last}:{name}:{author}
2. {first, yearOfBirth}:{birthDate}
3. {yearOfBirth, monthOfBirth}:{birthDate}
Add the top rank to the results
1. {first,last}:{name}:{author}
Delete contradictions to this rank:
2. {first, yearOfBirth}:{birthDate} X
Process next mapping
3. {yearOfBirth, monthOfBirth}:{birthDate}
Automatic Schema Matching,
SDBI, 2006
44
Attributes with the same name
Step 1 of the algorithm (reminder):
Make a list S of all attributes from all schemas
S = {Name, Salary, FirstName, LastName,
Author, First, Last…}
This means that two attributes with the same
name are always considered the same.
Payment (longint)
?
Payment (datetime)
Solution: add the type to the name
Id
Id_Int
First
First_String
Last
Last_String
Automatic Schema Matching,
SDBI, 2006
45
Correlation Measure
Id
Salary
Name
Id
Author
Year
Id
Id
Id
AuthorFirst
First
FirstName
AuthorLast
Last
LastName
YearBirth
Income
The rare attribute problem:
s(Income, Id)=0.2
So Income=Id?
Automatic Schema Matching,
SDBI, 2006
46
Correlation Measure (cont’d)
Id
Salary
Name
Year
Id
Author
Id
Id
Id
AuthorFirst
First
FirstName
AuthorLast
Last
LastName
YearBirth
Income
The sparseness problem:
s(Salary, Income)=0
If Salary=Income than what is their equivalence in the
other tables?
Automatic Schema Matching,
SDBI, 2006
47
Correlation Measure (cont’d)
Support of an itemset: the fraction of transactions that
contain all items in the itemset.
There are other ways to calculate support:
Let A,B be two attributes. Define
A
^A
B
f11
f10
f1+
^B
f01
f00
f0+
f+1
f+0
f++
f11: the number of schemas where
both A,B appears
f10: number of schemas where only
A appears
…
f1+: f11+f10
Automatic Schema Matching,
SDBI, 2006
48
Correlation Measure (cont’d)
B
^B
A
f11
f01
f+1
^A
f10
f00
f+0
f1+
f0+
f++
We used:
support=f11/f++
Lift:
f00f11/f10f11
H-measure
f01f10/f+1f1+
Every measure fits a different situation
For example, in the matching problem we want
to “punish” attributes that co-appear
Automatic Schema Matching,
SDBI, 2006
Id
Salary
Name
Year
49
Applications
This approach can only be used when we have many
schemas
• Data Migration?
• Web query interfaces. Example:
El-Al.Com
Arkia.Com
American Airlines.Com
•Adult
•Adult
•Passengers
•Child
•Child
•To
•Infant
•Destination
Is it possible to use the algorithm for migration by
running it on many Automatic
random
schemas?
Schema
Matching,
SDBI, 2006
50
Complexity
The A-Priory algorithm is O(2^n)
Usually there are only few correlations, so in step (k+1) we
consider just a few from the groups of size k
Automatic Schema Matching,
SDBI, 2006
51
Automatic Schema Matching,
SDBI, 2006
52