Transcript MIS 533
Introduction to Databases
Data Organisation
Definition
Data modelling
SQL
DBMS functions
Basics of data Organisation:
DATA HIERARCHY (four categories)
• Fields = represent a single data item
• Records = made up of a related set of
fields describing one instance of an entity
• File / Table = a set of related records - as
many as instances (occurrence) in the set
Example of data structure
Fields
Records
Name
First name Telephone
Sampras
Healy
Clinton
Henry
Pete
Margaret
Bill
Thierry
File / Table
45 25 65 65
25 58 96 63
12 25 28 89
25 78 85 85
+ Other files =>complete data
Structure = DB
Database: Definition.
"A collection of interrelated data stored together
with controlled redundancy, to serve one or more
applications in an optimal fashion; the data is
stored so that it is independent of the application
programs which use it; a common and controlled
approach is used in adding new data and in
modifying existing data within the database."
Definition - closer look
• A collection of interrelated data stored together
• with controlled redundancy
• to serve one or more applications in an optimal
fashion
• the data is stored so that it is independent of
the application programs which use it
• a common and controlled approach is used in
adding new data and in modifying existing data
within the database.
Advantages of Databases:
• data are independent from applications stored centrally
• data repository accessible to any new
program
• data are not duplicated in different locations
• programmers do not have to write extensive
descriptions of the files
• These save enough money and time to offset
the extra costs of setting and maintaining
DBs
Disadvantages of DBs:
• Data are more accessible so more easily abused
• Large DBs require expensive hardware and
software
• specialised / scarce personnel is required to
develop and maintain large DBs
• People / business units may object to “their” data
being widely available in a DB
Characteristics of DBs…
• High concurrency (high performance under load)
• Multi-user (read does not interfere with write)
• Data consistency – changes to data don’t affect running
queries + no phantom data changes
• High degree of recoverability (pull the plug test)
ACID test
• Atomicity
• Consistency
• Isolation
• Durability
All or nothing
Preserve consistency of database
Transactions are independent
Once committed data is preserved
DataBase Management
System (DBMS):
• program that makes it possible to:
– create
– use
– maintain
a database
• It provides an interface / translation
mechanism between the logical
organisation of the data stored in the DB
and the physical organisation of the data
Using a database:
Two main functions of the DBMS :
• Query language - for people who are not
programmer (greatest advantage of DB)
• Data manipulation language - for programmers
who want to modify the links between data
elements within the DB
• Also, Host Language - the language used by
programmers to develop the rest of the application
- eg: Visual Basic for Applications (VBA) / Oracle
developer 2000
Different types of DBs:
• creating the DB = specifying the links
between data items
• different types of relationships can be
specified - ie different logical views
• they correspond to three main types of
DBMSs:
–
–
–
–
Hierarchical DBs
Network DBs
Relational DBs
Object Oriented DBs
Hierarchical DBs:
• data item are related as “Parent” and
“Child” in a tree-like structure
• “parent” means data item is higher in the
tree than “child” and connected to it
• one “parent” can have more than one
“child”, but one “child” can only have one
“parent”
• most common platform = IBM’s
Information Management System (IMS)
Example
Customers
Payments
Orders
Currency
Items
Unit of
packaging
Very fast retrieval
Substitution
Product
Undesirable side effects:
• Insertion of record:
– dependent record cannot be added without a parent
– eg: units of packaging cannot be added without linkage
to an existing item
• Deletion of record:
– deletion of a parent deletes all children
– deleting an existing item will delete its replacement
items
• Impossible to have two parents = trouble
Network DBs:
• same as parent and children in Hierarchical
DB, but children can have more than one
parent
• It is also possible to link items upwards to
other items parents
• practically, it means that the DBMS is more
flexible for data retrieval
Example
Suppliers
Customers
Payments
Orders
Currency
Items
Unit of
packaging
Substitution
Product
Relational DBs:
• Data items stored in tables
• Specific fields in tables related to other field
in other tables (joint)
• infinite number of possible viewpoints on
the data (queries)
• Highly flexible DB but overly slow for
complex searches
• Oracle, SyBase, Ingres, Access, Paradox
for Windows...
Describing relationships
• Attempt at modelling the business elements
(entities) and their relationships (links)
• Can be based on users’ descriptions of the
business processes
• Specifies dependencies between the data
items
• Coded in an Entity-Relationship Diagram
(ERD)
Types of Relationships
• one-to-one: one instance of one data item
corresponds to one instance of another
• one-to-many: one instance to many
instances
• many-to-many: many instance
correspond to many instances
• Also some relationships may be:
– compulsory
– optional
Example
• Student registering system
• What are the entities?
• What type of relationship do they have?
• Draw the diagram
Entity Relationship Diagram
Next step - creating the data
structure
• Few rules - a lot of experience
• Can get quite complex (paramount for the
speed of the DB)
• Tables must be normalised - ie redundancy
is limited to the strict minimum by an
algorithm
Data Structure Diagrams
• Describe the underlying structure of the DB: the
complete logical structure
• Data items are stored in tables linked by pointers
– attribute pointers: data fields in one table that will link it to
another (common information)
– logical pointers: specific links that exist between tables
• Tables have a key
• If an attribute seems to belong to a relationship rather
than an attribute, it may mean an associative entity must
be added
ORDER
Customer
order number
Item description
Item Price
Quantity ordered
Customer number
Item number
Customer number
Customer name
Customer address
Customer balance
Customer special rate
1
2
3
4
Item
Item number
Item description
Item cost
Quantity on hand
* compulsory attributes
0 optional attributes
Definitions
• Entity
• Attributes
• Instance(s)
• Domain
• Key (candidate primary and foreign)
Definitions
• Relationship
• Ordinality
• Cardinality
• Associative Entity
Some test questions
• Is it a bird is it a plane?
• Is it an entity or an attribute?
Normalisation
• Process of simplifying the relationships
amongst data items as much as possible (see
example provided - handout)
• Through an iterative process, structure of
data is refined to 1NF, 2NF, 3NF etc.
• Reasons for normalisation:
– to simplify retrieval (speed of response)
– to simplify maintenance (updates, deletion,
insertions)
– to reduce the need to restructure the data for each
new application
First Normal Form
• design record structure so that each record
looks the same (same length, no repeating
groups)
• repetition within a record means one relation
was missed = create new relation
• elements of repeating groups are stored as a
separate entity, in a separate table
• normalised records have a fixed length and
expanded primary key
Second Normal Form
• Record must be in first normal form first
• each item in the record must be fully
dependent on the key for identification
• Functional dependency means a data
item’s value is uniquely associated with
another’s
• only on-to-one relationship between
elements in the same file
• otherwise split into more tables
Third normal form
• to remove transitive dependencies
• when one item is dependent on an item
which is dependent from the key in the file
• relationship is split to avoid data being lost
inadvertently
• this will give greater flexibility for the design
of the application + eliminate deletion
problems
• in practice, 3 NF not used all the time speed of retrieval can be affected
Beyond data modeling
• Model must be normalised – purpose ?
• Outcome is a set of tables = logical design
• Then, design can be warped until it meets the
realistic constraints of the system
• Eg: what business problem are we trying to
solve? – see handout
[riccardi p. 113, 127]
Realistic constraints
• Users cannot cope with too many tables
• Too much development required in hiding
complex data structure
• Too much administration
• Optimisation is impossible with too many tables
• Actually: RDBs can be quite slow!
Key practical questions
• What are the most important tasks that the DB
MUST accomplish efficiently?
• How must the DB be rigged physically to
address these?
• What coding practices will keep the coding clean
and simple?
• What additional demands arise from the need
for resilience and security?
Analysis - Three Levels of Schema
External Schema 1 External Schema 2 External Schema …
Tables
Disk
Array
Logical Schema
Internal Schema
4 way trade-off
Security
Ease of use
Performance
Clarity of code
Key decisions
• Oracle offers many different ways to do things
– Indexes
– Backups…
• Good analysis is not only about knowing these =>
understanding whether they are appropriate
• Failure to think it through => unworkable model
• Particularly, predicting performance must be done
properly
– Ok on the technical side, tricky on the business side
Design optimisation
• Sources of problems:
– Network traffic
– Excess CPU usage
• But physical I/O is greatest threat (different from physical
I/O)
• Disks still the slowest in the loop
• Solution: minimise or re-schedule access
• Also try to minimise the impact of Q4 (e.g. mirroring,
internal consistency checks…)
Creating links between the tables
• use common fields to join tables / queries
• very easy when data is properly
normalised
• Gives total flexibility in terms of data
retrieval
• Main strength of RDBs (SQL)
Structured Query Language
• used for defining and manipulating data in
Relational DBs
• aimed at:
–
–
–
–
–
–
reducing training costs
increasing productivity
improve application portability
increase application longevity
reduce dependency on single vendors
enable cross systems communication
• In practice, SQLs can be a bit different
Querying RDBs with SQL
• use a form of pseudo english to retrieve
data in a view (which looks like a table)
• syntax is based on a number of “clauses”
• Select: specifies what data elements will
be included in the view
• From: lists the tables involved
• Where: specifies conditions to filter the
data
– specific values sought
– links between tables
Example with one table
• find the name and address of customer
number 1217
Example with a range
• find the items which are priced between £50 and
£15000
Example with two tables
• find the rep name of all customers
Example with two tables
• same for customer Robson only
Use of a Search Condition - nested
queries
• find the name and address of the customer who
ordered order # 110
Additional syntax
• Add computation in the “select” statement:
– select SUM(price)
– select AVG(price), MAX, MIN, COUNT
• Simplify comparisons with a BETWEEN
clause and LIKE clause (with *, ?)
• Add sorting instruction after the where clause
– ORDER BY name (alphabetical)
– ORDER BY price (ascending)
• find contents (item# and description) of order
110:
• find the average price of the cars for sale
• find the average price of all orders taken so
far by customer “Jones”
• find how much cash customer “Barry” has
generated in total
find the average price of all orders taken so far
Oracle Demo Set Sales Order
Processing
CUSTOMER TABLE
SALES_ORDER TABLE
PRODUCT TABLE
ITEM TABLE
PRICE TABLE
DEPARTMENT TABLE
EMPLOYEE TABLE
LOCATION TABLE
JOB TABLE
Oracle Demo Set Employee Data