Transcript Keynote PPT

Database Decay
and
What To Do About It
by
Michael Stonebraker
(With Dong Deng and Michael Brodie)
My Thesis
•Traditional database design is all wrong
•
•
Gold standard techniques are not used “in the wild” and for
good reasons
Contributes to database decay
• DBMS application development is all wrong
•
Contributes to database decay
• There are kluges to slow down decay
•
Which I will talk about
•Best idea is a new paradigm for app
development
•
Which I will talk about
Context
•Operational databases
• But “rot” is upstream from analytic
databases
•
And “flows downstream”
Traditional Database Design
Wisdom
•Use a modeling tool (e.g. an EntityRelationship diagram tool)
•To construct/modify an E-R diagram
•When satisfied, push a button
•Tool spits out a relational schema in 3rd
normal form
•Code in ODBC/JDBC against this schema
Example
Supplier (sno, 1
111
sregion)
N
Part (pno,
psize, pcolor)
Supply (qty)
Two entities (Supplier and Part with attributes)
One relationship (Supply with attributes, which is 1 – N)
Lots of Elaboration
•Weak entities
•Inheritance
•…
•But the basics is all we need
Algorithm to Produce
3rd Normal Form
•Relation for each entity with attributes
•For each 1 - N relationship, add the key
of the 1 side to the table for the N side
with its attributes
•For each M – N relationship, add a table
for the relationship with both keys and
the attributes
For Our Data
Supplier (sno, sregion)
Part (pno, psize, pcolor, sno, qty)
•Application groups write applications for
this relational schema, typically using
ODBC/JDBC as the interface
Large Applications
• Usually designed/coded by separate
application groups, reporting to different
managers
•
•
•
Parts controlled by engineering
Supply controlled by procurement
Supplier controlled by finance
• In aggregate have 3 applications
• Often on different development schedules
and budgets
Now Suppose….
•Management decides to allow multiple
suppliers for each part, as long as they
are in different regions
•
Perhaps to get better terms
…
•Such changes happen frequently “in the
wild”
•
Or
•
Often once a quarter or more
New Diagram
Supplier (sno, M
111
sregion)
N
Supply (qty)
Part (pno,
psize, pcolor)
Two entities (Supplier and Part with attributes)
One relationship (Supply with attributes which is now M – N)
New Tables
Supplier (sno, sregion)
Part (pno, psize, pcolor)
Supply (sno, pno, qty)
Supplier table is unchanged
“Old Part” table is a join of Part and Supply
(defineable as a view based on ”new stuff”)
Summary of The Traditional
Wisdom
•Convert “old” database to “new”
•Define “old” as views
•Applications (in theory) continue to run
•Database stays in 3NF (defined as
goodness by research community)
A Dirty Little Secret
•Based on a survey of about 20 DBAs at 3
large companies in the Boston area…..
•
•
None use this methodology
or
Use it for “green field” design and then abandon it
•I.e. the “gold standard” is not used “in
the wild”
•So why not??????
• Rest of the talk is a mix of speculation
and ideas
Problem #1: View Update Issues
•Application: change the qty for part XXX to
500
•
update to “old part”
•But “old part” is a view which is a join
•
•
•
Updates to such views disallowed in most RDBMSs
Either because of semantic ambiguity
or vendor laziness
Problem #2: Application
Semantics May Change
•Delete the fact that supplier XXX supplies
part YYY
•
•
•
In “old” this deletes part XXX
Or
Turns fields in the Part table to “null”
•In the new world where there are
multiple suppliers of part XXX, it
presumably only deletes a supply tuple
Problem #3: Other Applications
May Fail
•And they are in different departments
•With different budgets
Net Result
•Substantial risk!
•
•
Applications all over the enterprise must be found and
corrected
Often with no budget for this activity
Less Risky Solution -- Kluge
•Leave the old schema as is
Supplier (sno, sregion)
Part (pno, psize, pcolor, sno, qty)
•Replicate part info, when multiple
suppliers supply the same part
•
•
Result does not conform to any ER diagram!
No longer 3NF
Our Two Examples Do the Right Thing
• No views to deal with
• Deleting a tuple in the Part table “does the
right thing”
• Reasonable chance that other applications will
continue to run correctly
The Net-Net
(Our Conjecture)
• DBAs want to lower risk
• Therefore DBAs try not to change the schema
• Our change does not conform to any ER
schema
• ER diagram diverges from reality, if it ever existed
• Hence, DBAs deal with table schemas, not ER
schemas
• Table schemas become increasingly distant from 3NF
• We term this process “database decay”
• Which ultimately means that a total rewrite is the
only way forward
So What to Do?
• Higher-level interfaces
• Kluges
•
•
Defensive schemas
Defensive applications
• A new development paradigm
Higher Level Interface
• ORMs
•
Object model
• E-R frameworks
•
E-R model
• Solve neither of our problems!
• And encapsulate the SQL, so it is
inaccessible
Defensive Schemas
• Use the 3 table version, even though
initially the two table one will work
•
•
Anticipate changes and “build them in”
And code against the larger schema initially
• Leave some “dead fields” in the initial
schema
•
For later use
Defensive Schemas
• Think “tall and skinny”
• Single table
•
Data (entity-name, attribute-name, value)
• Or lots of binary tables
•
Attribute (entity-key, value)
Tall and Skinny
• Generally horrible performance
•
But survives lots of changes
• And integrity control must be built into
application logic
•
Which may need to be changed
Defensive Applications
• Never use
Select *
From XXX
• If a field you don’t use is dropped, then
this code dies
• Put in the target list only that stuff that
you actually need
Defensive Applications
• Use
Select Sum ( …)
From XXX
• Even if you know there is only one result
• Later on there may be more than one,
and this code will continue to work
• Additional examples in the paper
Best Idea – Change the Paradigm
• ODBC is coded all over the enterprise
•
Where it is unmanageable
• Application programmers must all
understand the schema
•
•
Which is a big challenge
For large applications with 10’s to 100’s of tables
• And they must understand defensive
programming
•
Again a big challenge
Old Way
App 1
App 2
ODBC/JDBC
App 3
ODBC/JDBC
database
New Way
App 1
App 2
messages
App 3
messages
middleware
database
(DBA-controlled)
Processing Model
•Application programmers cannot write
SQL!
•Instead he/she negotiates with the DBA
•
•
•
He propose a task
If DBA agrees, DBA writes the SQL
And defines a messaging interface for the application
programmer
Processing Model
• Think web services
• Think stored procedures
• Think execute task (parameters)
• We will focus on a SQL interface – since
we want to figure out what will continue
to run when the schema changes
DBA Effectively Has
•A database of
•
(Task, SQL) pairs
•There are lots of details (multiple
commands, code in between the SQL
commands, transactions, …) which we
ignore
And….
•He constructs the initial relational schema
•
Perhaps iteratively
•However, think tables; not some other
notation
•
Remember schema may decay!
•App programmers implement business logic,
but don’t need to know the schema
•DBA converts messages into SQL
Can We Help?
•With the initial schema
•With the replacement SQL
•
We will start here
Suppose….
•Schema is considered a collection of
tables with primary-key/foreign-key
relationships
• Think of this as a graph
• Connected by PK/FK “join paths”
• At scale 10’s to 100’s
In General
•The parameters of each message specify
locations in this graph
•And we require the “shape of the stored
procedure”
• 1 input or N inputs
• 1 output or N outputs
Example
•Task: Order Part XXX from Supplier YYY
(generates an order of size qty) -- shape
is one part and one supplier
•Graph:
Supplier
Supply
Part
Example
•Straightforward to (mostly automatically)
generate the SQL from the message
•Because there is only a single join path
Supplier
Supply
Part
When Business Conditions
Change…
•Change the schema
• It is the “right” thing to do!
•And map the SQL
Example
•Task: Order Part XXX from Supplier YYY
(generates an order of qty)
•Graph:
Supply
Supplier
Part
For Each SQL Command:
•Have the old path
•And the new path
•And the old SQL
For Each SQL Command:
•Can classify the command as:
• Can be run unchanged (path and shape
are the same)
• Path has changed, but there is a unique
replacement with the same shape (new
SQL can be generated automatically)
• There is no new path or multiple new
paths or shape has changed (human
intervention required)
Possible Strategies
•Run the tool in advance of committing
the schema change to see how extensive
the changes are
•Use the tool at run time to fix the SQL at
1st access
st
• Effectively “schema at 1 access”
•Or anything in between
We have a Prototype
•Would love to find an “in the wild”
situation
•To try it on
•
Especially one with a lot of tables!
•This is a plea for a use case!!!!