Extracting Schema from Semistructured Data

Transcript Extracting Schema from Semistructured Data

Extracting Schema from
Semistructured Data
Nestorov, Abiteboul, and Motwani at
Stanford
Perspective
• This paper is new work.
• More than the details look at the issues:
–
–
–
–
What are their goals?
What does this contribute?
Do they attain their goals?
Why do we need this?
2
Sample Database
7
1
Hours
Name Entree
11
2
Name
Manager
3
8
4
24 “Burger “Fries”
Company
King”
Name
5
“The Keg”
Manager
Entree
9
“Steak”
10
“Jim”
Phone
6
“AA+ 543-7798
Management”
Schema = Types
3
Where does semistructured data
come from?
•
•
•
•
Document collections
Biological data
HTML
Bibtex, etc.
4
Who needs structure?
• For the user
– To know what queries are possible
– Browsing the database
– Type checking
• Storage
– Data layout to facilitate querying
• E.g. place similar objects on same page
– Indexes
5
Who Needs Structure?(2)
• Query optimization
– All the relational query optimization tricks
• Maintaining statistics per data type
– Cardinality, # of pages, Index cardinality, etc.
• Estimating the cost/size of result of query plans
– Efficient processing of path expressions
• Other?
6
Their Goals
Approximate typing (schema extraction) of
semistructured data.
Example (little lie) Typing Program:
Restaurant(X) :-
Link(X,A,B,C) & Name-atom(A) &
Entrée-atom(B) & Manager-atom(C)
7
Outline of the Algorithm
Given a database:
1. Find the perfect typing program.
– This typing might be too large so we:
2. Coalesce similar types into k types.
3. Assign a type to objects in database.
4. Deduce meaningful names for the types.
8
Typing
7
The two base relations:
Name
Manager
Entree
- link(FromObj, ToObj, Label)
8
- atomic(Obj, Value)
“The Keg”
9
“Steak”
10
“Jim”
These are the only two EDB’s of the typing program.
Restaurant(X) :-
link(X,A,Name) & atomic(A, Ap) &
link(X,B,Entrée) & atomic(B, Bp) &
link(X,C,Manager) & atomic(C,Cp)
9
Typing 2
Restaurant(X) :-
link(X,A,Name) & atomic(A, Ap) &
link(X,B,Entrée) & atomic(B, Bp) &
link(X,C,Manager) & atomic(C,Cp)
EDB:
link(7, 8, Name)
atomic(8, “The Keg”)
IDB: (intensional relations)
defined by the typing program
Extension of an IDB:
Restaurant(1)
10
Restriction on Types
Arbitrary type programs are not allowed.
Rules typei(X) can only be built from the following:
1. link(Y, X, c) & typej(Y)
2. link(X, Y, c) & typej(Y)
3. link(X, Y, c) & atomic(Y, Z)
cj
cj
c0
Types can only express local characteristics.
X
The collection of typed links is a set.
(2 entrées = 1 entrée)
11
Semantics of Type Program
The greatest fixpoint of a datalog program on
a database defines the semantics of the
typing.
Fixpoint = Extensions of IDB’s + EDB’s
– Least fixpoint
• start with model of only EDB’s
• at each step union into the model anything new.
12
Greatest Fixpoint
1. Start with a model of EDB’s and all possible extensions.
2. At each step, remove any extensions not derived by applying
the rules.
Least fixpoint doesn’t work:
person(X) :- link(X, Y, is-manager-of) & firm(Y) &
link(X, Yp, name) & atomic(Yp, Z)
firm(X) :- link(X, Y, is-managed-by) & person(Y) &
link(X, Yp, name) & atomic(Yp, Z)
13
Imperfect Types
Defect: a measure of how well an
object fits a given type.
7
Name
= Excess + deficit
type1 =
manager0
+
4
“The Keg”
Manager
Entree
5
“Steak”
6
“Jim”
name0 + entree0
11
Defect is 2 for assigning 11
to type1.
Name
8
“McD”
# seats
Entree
9
10
“biscuit”
53
14
Imperfect Types(2)
• Excess: # of EDB’s not
used to validate any
object’s type.
• Deficit: Minimum # of
ground facts that need to
be added to make all
type derivations
possible.
7
Name
4
“The Keg”
Manager
Entree
5
“Steak”
6
“Jim”
11
Name
8
“McD”
# seats
Entree
9
10
“biscuit”
53
15
Perfect Typing Program (Stage 1)
Gore.
16
Multiple Roles
O1
Name
O3
O2
Team
Scholes
Team
Name
Country Country
Movie France Name
Man Utd
Cantona
Rocky
Horror
Bleu
Star Trek
Country
Movie
Movie
Binoche
England
How hard is it to choose to types for the cover?
How do you quantify atomization?
17
Clustering (Stage 2)
Define a distance function between two types:
First approximation is difference between the bodies of
their rule definitions.
t1 :- a0, b2
t3 :- b2, b1, b3
t2 :- a0, b1
d(t1, t2) = 2
18
A Better Function
Include some measure of the weight of a type(# of
objects of that type):
t2 ~> t1
 ( w1, w2)
Some desirable properties:
• increasing in d = coalesce similar types
• decreasing in w1 = compensate for ‘expected noise’
• increasing in w2 = maintain types with large extents
Choosing what to coalesce is hard!
19
Recasting (Stage 3)
Assign each object to types within the k types formed
from stage 2.
(optional) choose a better value of k an rerun step 2.
20
Results
• Heavy use of synthetic data.
– Create a type definition and generate instances
that are peturbed randomly in some way.
• What do the graphs show?
– Are the data sets realistic?
21
Conclusions
• Paper problems:
– The algorithm isn’t completely explained.
– Many comments are not elaborated.
• But, it’s an important problem and good
first approach.
22

Extracting Schema from Semistructured Data

Transcript Extracting Schema from Semistructured Data

Directory