Data Independence

Download Report

Transcript Data Independence

CS 540
Database Management Systems
Lecture 2: Data Independence
Contributions of the paper
• Concepts of data independence and
declarative queries.
– Advocates more high level and natural modeling
– Argues for declarative languages
• Definition of relational model
– Data structure based on relations
– A set of operations (algebra) to manipulate data.
• Formal notions
– Expressive power, redundancy, and consistency
The key idea
“Future users of large data banks must be
protected from having to know how the data is
organized in the machine (the internal
representation)”
• It started relational databases.
Data model
• An abstraction of data and how it is being used.
• Elements of a data model
– Structural part: mathematical structures to describe
data.
– Operations: a set of operations used to query and/or
manipulate the data
• A way of thinking about information
• A successful paradigm in data management.
The state of the world before relational
model
• Network/ hierarchical DBMS, 1960’s
– IDS network DBMS: Bachman at GE, 1961
– IMS hierarchical DBMS: IBM in 1968
• CODASYL approach to data management, 1960’s
– CODASYL: Conf. Of Data System Languages, set up by US DOD, to
standardize software applications
– COBOL (comm. bus. oriented lang.) defined by CODASYL
• ruled the business data processing world
• DBTG (Database Task Group, under CODASYL), 1971
– closely aligned with COBOL
– DBTG Report would standardize network model
(Bachman got Turing award in 1973 for the network model)
Network model: DBTG report
• Network DB
– A collection of records
• record = collection of fields
• similar to an entity in ER model
– Records connected by binary, many-to-one links
• similar to binary relationships in ER
• simulate one-to-one, many-to-many by many-to-one.
Network model: implementation
• Student record linked to enrollment record
Johnson
…
CS001
A+
CS308
B
• A lot of linkage pointers
– ring-structured ptrs implements many-one links
• Data manipulation is thus navigational
DBTG query example
SQL: select name from student where dept = “EECS”
DBTG:
student.dept = “EECS”;
find any student using dept;
while DB-status = 0 do
begin
get student;
print (student.name);
find duplicate student using dept;
end
DBTG query example: predicates
SQL: select name from student where dept = “EECS”
and gender = “Male”
DBTG:
student.dept = “EECS”;
find any student using dept;
while DB-status = 0 do
begin
get student;
if student.gender = “Male”
print (student.name);
find duplicate student using dept;
end
DBTG query example: navigation
SQL: select E.grade from student S, enrollment E
where S.name = “Johnson” and E.id = S.id
DBTG:
student.name = “Johnson”;
find any student using name;
find first enrollment within StudentEnroll
while DB-status = 0 do
begin
get enrollment;
print (enrollment.grade);
find next enrollment within StudentEnroll;
end
What’s wrong?
What’s wrong?
• Mix presentation and data access
• As a result,
– Programming is difficult and complex
– Application can become incorrect once there’s a change
in data representation
• Just like programming in assembly languages (as
opposed to high-level programming languages)
Data dependence
• Ordering preferences
– Applications may rely on a particular ordering of the
stored data
– example?
• Indexing dependence
– Applications may rely on the availability of certain
indices, but indices are semantically redundant and only
necessary for “optimization”.
– example?
Data dependence
• Access path dependence
– Applications would hard code access paths to data, so would
rely on the continued existence of the used access paths
Data dependence
• Access path dependence
– Applications would hard code access paths to data, so would
rely on the continued existence of the used access paths
Levels of abstraction in DBMS
• Physical implementation
– storage structure, indexing, access method
• Logical data model
– conceptual data structure and manipulation
• Views
– different portions of databases
• Who should see each level?
Relational model of data
• Relations
– given sets S1, S2, …, Sn (not distinct)
– relation R is a subset of the Cartesian product S1 x S2 x … x Sn
– Sj is jth domain of R, n is degree of R
• Relations as tables
–
–
–
–
–
each row represents an n-tuple of R
ordering of rows is immaterial
all rows are distinct
ordering of columns is significant
label each column with the name of the corresponding domain
Relation: example
Relation name
Book
Title
Attribute names
Price
Category
Year
tuples
MySQL
$102.1
computer
2001
Cell biology
$201.69
biology
1954
French cinema $53.99
art
2002
NBA History $63.65
sport
2010
18
Data manipulation
• Relational algebra
– operations
• Relational calculus
– semantics in terms of logics
• Essential beauty of relational model:
– Query is data and data is query!
Named Relations
Expressible Relations
Operations on relations
• Usual set operations
– since relations are sets of homogenous tuples
Operations on relations: deriving
relations
• Permutation
– interchange the columns of an n-ary relation
• Projection
– select columns and remove any duplication in the rows
• Join
– selectively combining tuples in two relations
– as a “class” of new relations that losslessly take some columns from
either source relations
• Composition
– join two relations and remove join columns
• Restriction
– filter one relation with another
• You can combine them to write queries.
Algebra: questions
• What is missing in this set of operators?
• Is it minimal?
• How is it different from “current” algebra?
The relational algebra (now)
• Basic operations:
Selection (  ) Selects a subset of rows from relation.
– Projection ( ) Deletes unwanted columns from relation.
– Cross-product (
) Allows us to combine two relations.
– Set-difference (
) Tuples in reln. 1, but not in reln. 2.
– Union (  ) Tuples in reln. 1 and in reln. 2.
– Renaming ( r )
–


Redundancy and consistency
• Redundancy
– redundant if something can be “derived” from others
– foundation: what operations allowed in “derivation”
• Consistency
– data snapshot must satisfy some constraints
• Started research in schema design
– e.g., normal forms; normalization
The advantages of relational model
• Simplicity!
• Mathematically complete data model
• Declarative query languages
– queries can be automatically compiled, executed,
and optimized without resorting to low-level
programming
Does relational model provide data
independence?
• Ordering independence?
• Index independence?
• Access path independence?
Unexpected benefits
• Client-server architecture
– SQL request/response enables high-level, compact
exchange between clients and server
– clients: input and output, application logics
– server: data processing
• Parallel processing: relations in and out
– pipeline: piping the output of one op into the next
– partition: N op-clones, each processes 1/N input
• Graphical user interfaces
– relations fits the spreadsheet (table) metaphor
The rise of relational model
• Codd’s paper in 1970
– resistance even within IBM
– Too mathematical, no system (students raised the
same questions!)
• First implementations, 1973
– System R at IBM San Jose Lab
– INGRES at UC.Berkeley
• The “Great Debate” in 1975 SIGMOD conf.
The great debate (SIGMOD 1975)
• COBOL/CODASYL  Relational
– too mathematical (to understand)
• Relational  COBOL/CODASYL
– too complicated (to program)
Relational model/system impact
• Codd’s paper published in 1970
• First implementations, 1973-– System R at IBM San Jose Lab, 1974-1978
– INGRES at UC.Berkeley, 1973-1977
• System R influence:
– IBM DB2
– Oracle: started from published spec. of System R
• INGRES:
– member later funded Sybase
– evolved into Microsoft SQL server by buying code from
Sybase
What have changes over the years?
• Row may not be distinct now
– set versus bag semantics
– SQL: “select distinct” to eliminate the duplicates
• Non-simple domains: i.e. complex objects
– allowed only built-in data types
– new: object-relational DB, multimedia DB
• Generations of relations: temporal aspect
– temporal databases
• e.g.: query GPA at the end of year 2000
Problems with relational model
• Data is often hierarchical/complex in nature
– normalization is unnatural decomposition of data
for storage, to be assembled by joins at query time.
• Other data models provide a more natural
representations for in many domains.
Network/hierarchical models making a
come back!
• A great deal of graph data sets
– Web is a huge network database!
• XML is both navigational and hierarchical
<student>
<name>John Smith</name>
<dept>CS</dept>
<enrollments>
<enrollment>
<course>CS311</course>
<grade>A+</grade>
</enrollment>
… … // more enrollments
</enrollment>
<student>
Domain specific data models
• Scientific data is better captured by arrays.
• We create new data models for certain domains
– preserve the data independence principle.
Does relational model provide access
path independence?
• A schema for an HR database:
Employee(employee, department)
Management(manager, employee)
– Find the manager of each employee.
• Another schema for the HR database:
EmployeeManagement(employee, manager, department)
– Find the manager of each employee.
– different query!
• Relational model is not fully access path independent.
The access path independence of relational model
versus previous models
The access path independence of relational model
versus previous models
• Which one has more variations for the same data?
Solution: Universal Relation
• View database as a single universal relation!
– It is the natural join of all relations in the database.
– It contains all attributes.
HR database:
Employee(employee, department)
Management(manager, employee)
Universal relation:
EmployeeManagement(employee, manager, department)
• Write your queries over the universal relation.
• Problems?
Universal Relation Assumptions
• Unique attribute names
– Each attribute in the database has a unique name.
• Relationship uniqueness
– managers versus managers the manager of
– Focus on the most basic relationship
• Query answering
– Pre-compute the universal relation and run the
query.
– Translate the query to a join of base relations
Universal Relation War!
• Controversial!
• Some successes.
• Still an influential idea
Questions to think about
• How to manage current network data without
losing the benefit of data independence?
• Is there a trade-off between data dependence
and say, efficiency?
• How to combine intuitive nature of network
model and benefits of relational models?
Carry away messages
• Raise important research questions
– See deficiencies in the current state of the world (data
dependency)
– Propose a change to the world that would address some
of the deficiencies (declarative queries)
• Leverage principled/mathematical tools (relational
algebra).
What is next?
• How to carry out and present your project?
• Overview of some sample projects.