Introduction to Databases

Download Report

Transcript Introduction to Databases

Introduction to Databases
Data Organisation
Definition
Data Modelling
SQL
DBMS Functions
Basics of Data
Organisation:
DATA HIERARCHY (four categories)
• Fields = represent a single data item
• Records = made up of a related set of
fields describing one instance of an entity
• Table = a set of related records
• Database = a collection of related tables
Example of Data Structure
Fields
Records
Surname
First name Telephone
Martin
Andrew
Peter
Lorene
Smith
Healy
White
Marsh
11 25 65 65
11 58 96 63
11 25 28 89
11 78 85 85
Table + Other Tables => complete data
Structure = DB
Database: Definition.
"A collection of interrelated data stored
together with controlled redundancy, to
serve one or more applications in an
optimal fashion; the data is stored so that
it is independent of the application
programs which use it; a common and
controlled approach is used in adding new
data and in modifying existing data within
the database."
Definition - Examined
• A collection of interrelated data stored together
• with controlled redundancy
• to serve one or more applications in an optimal
fashion
• the data is stored so that it is independent of
the application programs which use it
• a common and controlled approach is used in
adding new data and in modifying existing data
within the database.
Advantages of Databases:
• data are independent from applications stored centrally
• data repository accessible to any new
program
• data are not duplicated in different
locations
• programmers do not have to write
extensive descriptions to build up a DBs
• Physical and logical protection is
centralised
Disadvantages of DBs:
• Centralisation can be a weakness
• Large DBs require expensive hardware and
software
• Specialised / Scarce personnel are required
to develop and maintain large DBs
• Standardisation of data on a central
repository has implications for the format in
which it is stored
Characteristics of DBs…
• High concurrency (high performance under
load)
• Multi-user (read does not interfere with write)
• Data consistency – changes to data don’t affect
running queries + no phantom data changes
• High degree of recoverability (pull the plug test)
ACID test
•
•
•
•
Atomicity
Consistency
Isolation
Durability
All or nothing
Preserve consistency of database
Transactions are independent
Once committed data is preserved
DataBase Management
System (DBMS):
• program that makes it possible to:
– create
– Use (insert / update / delete data)
– maintain
a database
• It provides an interface mechanism
between the logical organisation of the
data stored in the DB and the physical
organisation of the data
SQL Examples
SELECT store_name FROM Store_Information
Source: http://www.1keydata.com/sql/sqlselect.html
Interacting with the DB:
Structured Query language (SQL) –.
• Data Manipulation Language (DML)- for
programmers who want to modify the
data model in which the data is stored.
• Host Language - the language used by
programmers to embed SQL statements
into their application – e.g. Java
import java.sql.*;
public class CreateCoffees {
public static void main(String args[]) {
String url = "jdbc:mysql://localhost/cerami";
Connection con;
String createString;
createString = "create table COFFEES " +
"(COF_NAME VARCHAR(32), " +
"SUP_ID INTEGER, " +
"PRICE FLOAT, " +
"SALES INTEGER, " +
"TOTAL INTEGER)";
Statement stmt;
try {
Class.forName("com.mysql.jdbc.Driver");
}
catch(java.lang.ClassNotFoundException e) {
System.err.print("ClassNotFoundException: ");
System.err.println(e.getMessage());
}
try {
con = DriverManager.getConnection(url);
stmt = con.createStatement();
stmt.executeUpdate(createString);
stmt.close();
con.close();
}
catch(SQLException ex) {
System.err.println("SQLException: " + ex.getMessage());
}
}
Relational DBs:
• Data items stored in tables
• Specific fields in a table related to another
field in associated tables (joins)
• infinite number of possible viewpoints on
the data (queries)
• Highly flexible DB but overly slow for
complex searches
• Oracle, SyBase, Ingres, Access, Paradox
for Windows...
Describing Relationships
• Attempt at modelling the business
elements (entities) and their
relationships (links)
• Can be based on users’ descriptions of
the business processes
• Specifies dependencies between the
data items
• Coded in an Entity-Relationship Diagram
(ERD)
Types of Relationships
• one-to-one: one instance of one data item
corresponds to one instance of another
• one-to-many: one instance to many instances
• many-to-many: many instance correspond to
many instances
• Also some relationships may be:
– compulsory
– optional
Examples
Source:
www.smartdraw.com
Entity Relationship Diagram
(Generic)
Source:
www.smartdraw.com
ERD Example – Banking
Transaction
Deposit
Customer
Withdrawal
Loan
Bank
Next Step –
Creating the Data Structure
• Few rules - a lot of experience
• Can get quite complex (paramount for the
speed of the DB)
• Tables must be normalised – i.e.
redundancy is limited to the strict minimum
• In practice, normalisation is not always the
best
Data Structure Diagrams
• Describe the underlying structure of the DB: the
complete logical structure. A DSD provides a graphical
notation which documents entities, relationships, and
their constraints
• Data items are stored in tables linked by pointers
– attribute pointers: data fields in one table that will link
it to another (common information)
– logical pointers: specific links that exist between
tables
• Tables have a key
• Is it an attribute or an entity?
ORDER
Customer
order number
Item description
Item Price
Quantity ordered
Customer number
Item number
Customer number
Customer name
Customer address
Customer balance
Customer special rate
Item
Item number
Item description
Item cost
Quantity on hand
Normalisation
• Process of simplifying the relationships amongst data
items as much as possible (see handout)
• Through an iterative process, structure of data is
refined to 1NF, 2NF, 3NF etc.
• Reasons for normalisation:
– to simplify retrieval (speed of response)
– to simplify maintenance (updates, deletion,
insertions)
– to reduce the need to restructure the data for each
new application
First Normal Form
• Design record structure so that each record
looks the same (no repeating groups)
• Repetition within a record means one relation
was missed = create new relation
• Elements of repeating groups are stored as a
separate entity, in a separate table
• Normalised records have a fixed length and
expanded primary key
Second Normal Form
• Record must be in first normal form
• Each item in the record must be fully
dependent on the key for identification
• Functional dependency means a data
item’s value is uniquely associated with
another’s
• Only on-to-one relationship between
elements in the same file
• Otherwise split into more tables
Third Normal Form
• To remove transitive dependencies
• When one item is dependent on an item
which is dependent from the key in the file
• Relationship is split to avoid data being lost
inadvertently
• This will give greater flexibility for the
design of the application + eliminate
deletion problems
• In practice, 3 NF not used all the time speed of retrieval can be affected
Beyond Data Modeling
• Model must be normalised
– Optimised model
– “no surprise” model
– resilience
• Outcome is a set of tables = logical design
• Then, design can be warped until it meets
the realistic constraints of the system
• E.g.: what business problem are we trying
to solve?
Realistic Constraints
• Users cannot cope with too many tables
• Too much development required in hiding
complex data structure
• Too much administration
• Optimisation is impossible with too many
tables
• Actually: RDBs can be quite slow!
Key Practical Questions
• What are the most important tasks that the
DB MUST accomplish efficiently?
• How must the DB be configured physically
to address the business needs?
• What coding practices will keep the coding
clean and simple?
• What additional demands arise from the
need for resilience and security?
Analysis - Three Levels of Schema
External Schema 1 External Schema 2 External Schema …
Tables
Disk
Array
Logical Schema
Internal Schema
4 Way Trade-off
Security
Performance
Ease of use
Clarity of code
Key Decisions
• Oracle offers many different ways to do things
– Indexes
– Backups…
• Good analysis is not only about knowing these
=> understanding whether they are appropriate
• Failure to think it through => unworkable model
• Particularly, predicting performance must be
done properly
– Ok on the technical side, tricky on the business side
Design Optimisation
• Sources of problems:
– Network traffic
– Excess CPU usage
• But physical I/O is greatest threat (different from
logical I/O)
• Disks still the slowest in the loop
• Solution: minimise or re-schedule access
• Try to minimise the impact (e.g. mirroring,
internal consistency checks…)
Using Scenarios for Analysis
•
•
•
•
Define standard situation for DB use
Analyse their specific requirements
Understand the implications for DB design
Compare and contrast new problems with
old ones
Categories of Critical
Operations
• Manual transaction processing = complex DE by
small number of operators
• Automatic transaction processing: large number
of concurrent users performing simple DE
• High batch throughput: automatic batch input
into DB of very large number of complex
transactions
• Data warehousing: large volumes of new data
thrown on top every day at fixed intervals +
intensive querying
Manual Transaction Processing
• Insurance telemarketing broker
– Data entry
– Retrieving reference info
– Calculations
• On-line human-computer interaction!!
– Instant validation (field by field)
– Drop-down lists (DE accelerators)
• Quick response time
• Critical issue = user-friendly front end, but
minimise traffic between interface and back end!
Automatic Transaction
Processing
• Large number of user performing simple tasks
• Real-time credit card system (e.g. authorisation)
or check out (EPOS)
• Human interaction at its most simple – eg typing
a code or swiping a card
• Minimum validation, no complex feed back…
• Large numbers mean potential problems are:
– Connection opening / closing rate
– Contention between concurrent users
– SQL engine + data consistency costs
• Design with multiple servers
Automatic Transaction
Processing
• Another e.g.: on-line shopping
• What specific problems would arise from
shopping cart type applications?
• How do you handle lost customers?
High Batch Throughput
• E.g. mobile phone network operator
• Real time + huge volume of simultaneous
complex transactions
–
–
–
–
Number checks
Account info
Price info
Pattern checks
• Large processing capacity required + need to
tackle all transactions together in batches
– DB query may not be only solution (or quickest)
– Move customer account to cache
– Copy updated figures for accounts to a log and
updated accounts in slack periods (2.5GB an hour!)
– Indexing or partitioning for quicker access
“Data Warehouse”
• Huge store of data
• Large volume added every day
– 99% new data, 1% corrections to existing data
• Substantial analysis required prior to
development:
– What to include
– How to aggregate and organise it
– Where data comes from
• Real Oracle territory because schedule is lax –
i.e. not a real time application
• Key issues:
– Getting partitioning right
– Deciding how many summary levels
Partitioning
• Oldest trick in the book to speed up retrieval
– Smaller bunch of data
– Well labeled so it can be easily found
– Smaller index
• Data manipulation – maintenance, copy and
protection far easier
• Break down big table into small ones
Internet Databases
• Characteristics
– Many concurrent sessions
– Reduced interaction front end back end
– Internet = Extra response time (2 secs!)
• In practice, many sites are quite slow
• Key issues
– “thin client”
– Reduced dialogue
– Management of sessions (e.g. cookies) to avoid
multiple restarts
Conclusion: Key issues
• At one end: very large numbers of small
transactions
• Threat of network or process contention
• At other end: small number of processes
with complex data crunching and time
constraints
• Design of DB and application must reflect
these constraints