soumen_cse.iitb.ernet.in

Download Report

Transcript soumen_cse.iitb.ernet.in

Part One
XML and Databases
Soumen Chakrabarti
CSE, IIT Bombay
Form and content
• The Web today
– HTML generated by hand, wysisyg editors,
‘webified’ databases
– HTML specifies rendering for human reading
– Screen scraping required to consolidate data
• The Web in the future
– Common interchange format (XML)
– Concentrate on content, not form
– Represent data class broader than relations
Role of databases
• Contribute
– Data storage and indexing
– Query processing and optimization
– Views, transformations, integration
• Adopt
– Search modalities
– Content-based approximate search
– Linguistic analysis
Features of semi-structured data
•
•
•
•
•
No explicit schema, or volatile schema
Schema size comparable to data size
Structure changes without notice
Heterogeneous, deeply nested, irregular
Has nature of documents rather than tables
Semi-structured data model example
Bib
&o1
complex object
paper
paper
book
references
&o12
&o24
references
author
title
year
&o29
references
author
http
page
author
title publisher
title
author
author
author
&o43
&25
&96
1997
last
firstname
lastname
atomic object
firstname
lastname
&243
“Serge”
“Abiteboul”
“Victor”
Object Exchange Model (OEM)
first
&206
“Vianu”
122
133
Syntax
{ paper: { author: “Abiteboul”,
author: { firstname: “Victor”,
lastname: “Vianu”},
title: “Regular path queries …”,
page: { first: 122, last: 133 }
}
}
Some observations
•
•
•
•
Missing or additional attributes
Multiple attributes
Different types in different objects
Heterogeneous collections
Object ID’s and references
<person id=“o555”>
o456
<name>Jane </name>
children
children
</person>
mother
<person id=“o456”>
o555
o123
<name>Mary</name>
<children idref=“o123 o555”/>
</person>
<person id=“o123” mother=“o456”>
<name>John</name>
</person>
Names and acronyms
• OEM (Object Exchange Model): a semistructured data model from Stanford, 1995
• Lore: a system for storing data adhering to
the OEM
• Lorel: a query language for Lore
• XML (eXtensible Markup Language): a
simplification of SGML and a
generalization of HTML
• XML-QL: Query language for XML
Lorel query examples
select Bib.paper.title
from Bib.paper
where Bib.paper.year >1995
Alternative
select X.title
from Bib.paper X, Bib.(paper|book) Y
where Y.author.lastname? = “Ullman”
and Y.reference+ X
Transitive closure
Navigating partially
known structures
XML-QL query examples
where <book language=“french”>
<publisher><name>Morgan Kaufmann</name>
</publisher>
<author> $a </author>
</book> in “www.a.b.c/bib.xml”
construct $a
where <book language = $l> <author> $a </>
</> in “www.a.b.c/bib.xml”
construct <result><author>$a</><lang>$l</></>
XML storage in ternary relation
Ref
S o u rc e
&o1
&
&
&
&
&
paper
&o2
title
&o3
author
author
&o4
“The Calculus” “…”
year
&o5
“…”
&o6
“1986”
• Too many joins
• Label name storage redundant
o1
o2
o2
o2
o2
Val
N ode
&
&
&
&
o3
o4
o5
o6
L abel
D est
paper
title
a u th o r
a u th o r
year
&
&
&
&
&
o2
o3
o4
o5
o6
V a lu e
T h e C a lc u lu s
…
…
1986
Storage optimization through mining
paper
paper paper
Paper1
paper
year
author
title
author
authortitle authortitleauthor title
fn
ln fn
ln
fn
ln
fn
fn 1
ln 1
fn 2
ln 2
title
year
X
X
X
X
X
X
X
-
X
-
X
X
X
X
-
ln
Paper2
• Inline common cases
• Tolerate a few nulls
a u th o r
X
title
X
Schema extraction
• Schema: a template for type/semantics
specification
• Conformance
– Does that data conform to a given schema ?
• Classification
– If so, which objects belong to what
classes/types?
• Applications
– Storage and query optimization
Graph simulation
Given two edge-labeled graphs G1 and G2, a
simulation is a relation R between nodes
such that if (x1, x2) is in R, and (x1, a, y1)
is in G1, then there exists (x2, a, y2) in G2
(same label) such that (y1,y2) is in R
x1
G1
R
x2
a
a
y1
R
y2
G2
Upper and lower bound schema
• Lower bound schema
– Conformance: find simulation R from S to D
– Classification: check if (c,x) in R
– Used in storage optimization
• Upper bound schema (data guides)
– Conformance: find simulation R from D to S
– Classification: check if (x,c) in R
– Used in path index generation and query
optimization
Sample data
&r
employee
employee employee
employee employee
manages
employee
employee
manages
manages
manages
manages
&p1
&p2
managedby
&p3
company worksfor worksfor
&p4
&p5
&p6
&p7
managedby
managedby
managedby
worksfor
employee
managedby
worksfor worksfor
worksfor
worksfor
worksfor
&c
&p8
Lower bound schema
Root
&r
employee
company
employee
Bosses
&p1,&p4,&p6
worksfor
Company
&c
manages
managedby
worksfor
Regulars
&p2,&p3,&p5,&p7,&p8
Storage using lower bound schema
Lower-bound schema
person
Root
company
works-for
managed-by
Company
Employee
c.e.o.
name
address
name
string
Employee
o id
…
…
nam e
…
…
m a n a g e d -b y
…
…
w o rk s-fo r
…
…
Store rest in
overflow graph
Upper bound schema (DataGuides)
Root
&r
company
managedby
employee
Employees
&p1,&p1,&p3,P4
&p5,&p6,&p7,&p8
manages
worksfor
Bosses
&p1,&p4,&p6
worksfor
Company
&c
manages
managedby
worksfor
Regulars
&p2,&p3,&p5,&p7,&p8
Query optimization issues
Select x from A.B x where exists y in x.C: y=5
A
A
D
A
D
B
D
D
B
B
B
B
B
B
B
C
C
C
C
C
C
C
C
C
5
5
5
4
4
5
4
4
5
B
What makes the problem difficult
•
•
•
•
Selectivity estimation
Index selection
Access cost models
Clustering choices
Part Two
Information Retrieval and
Databases
Soumen Chakrabarti
CSE, IIT Bombay
Information retrieval (IR)
• Search
– ‘Inverted’ index
– Boolean match
– Relevance ranking
cat
dog
D5: 3, 37, 50
D7: 9, 20
D7: 7, 90, 400
D20: 22, 533
• Classification
– Learn topics from examples
• Clustering
– Discover topics from a document collection
• Never done inside a relational database
Current style of loose integration
• RDBMS provides hooks
• Declare some columns as textual with
keyword index
• Inserts, updates, and deletes trigger external
program, e.g., Verity search engine
• Search engine maintains separate indices
• Simple query rewriting to combine
relational and text-match where-clauses
Reasons
• Space
– BLOB vs. pure relational representation
– Average English word is only 5 bytes
• Time
– Most text engines are resigned to flexible (i.e.,
no) model for data consistency
– Much faster read-only access than relational
database lookups
New features desired
• Operations that are more complex than
keyword search can benefit from tighter
coupling with RDBMS
• Approximate search is essential (Anand
Rajaraman, Amazon.com, SIGMOD 99)
– Misspelling book title, author name common
– Variant of OEM edge label (author/writer/poet)
• Similarity extends to structure as well
(‘Travolta’ NEAR ‘Cage’ = ‘Face/Off’)
Case study: generalized ‘like’
• SQL has limited string matching constructs
– like ‘%x’, ‘x%’, ‘%x%’
– x must be exact match
• Need more lenient match
– Applications: LDAP, IR
• String edit distance is not suitable
– “Given query, order strings in database in
increasing order of edit distance and pick top 5”
Sliding-window matching
nas
asc
sce cen ent pas sca cal
nascent
pascal
ras
rascal
• Given a query, scan to get a set of 3-grams
• Similarity of string in database to query =
number of shared 3-grams
Issues
•
•
•
•
•
Minimally disruptive architecture
Low storage overheads
Fast query processing
Good selectivity estimates
Combining with other predicates for
ranking
• Efficiently handling updates