DATA WAREHOUSING AND DATA MINING
Download
Report
Transcript DATA WAREHOUSING AND DATA MINING
Introduction to
Relational Databases
S. Sudarshan
Comp. Sci. and Engg. Dept.
I.I.T., Bombay
[email protected]
http://www.cse.iitb.ernet.in/~sudarsha
1
OVERVIEW
Files and Index Structures
Relational Databases and SQL
Relational Database Features
System Architectures
CPU
Memory
Computer System
Disk
2
Storing Data on Disks
Byte: unit of information
one character
File:
sequence of bytes
File system:
stores multiple files
organizes files into directories/folders
/users/sudarsha/Talks/rbi-dbms.ppt
/users/sudarsha/Talks/rbi-wh.ppt
3
File Structure
Typical database information
large number of small pieces of data -records
records stored within large files
Sequential files
sorted on a key (e.g., account number)
Index needed for efficient access
e.g. find information of account 2345
similar to library card catalogs
4
Relational Model: Tabular
Data
Account
Number
23456
23457
23458
Account
Name
Address
H. Mehta
Bill Gates
S.Sudarshan
Date
1/4/99
1/4/99
Transactions 4/4/99
4/5/99
4/5/99
Balance
Madhuli, Mumbai
40,000,000
Microsoft Corp
9999,999,999
C-148, IIT Powai
1,000
Acct Num
23458
23456
23456
23456
23458
Debit
Credit
Balance
2000
3000
2000
3000
9000 12000
200000 212000
4000
7000
5
Tabular Data - more examples
Emp-num
Name
Address
Vajpayee
Bill Gates
S.Sudarshan
1
2
3
Employee
Date
1/4/99
1/4/99
4/4/99
4/5/99
4/5/99
Salary
PMO, Delhi
Microsoft Corp
C-148, IIT Powai
10,000
9999,999
10,000
Overdue-Loan
Loan-Num State
123
456
998
765
666
TN
MH
MH
GJ
AP
Balance
Category Repaid
3000
3000
12000
212000
7000
Farm
Smallbiz
Coop
Bigbiz
Govtbiz
0
100
50
25
10 6
Querying/Updating the
Database using SQL
SQL Query:
select name, balance
from account
where acct-num = 23456
Result is:
H. Mehta
40,000,000
insert into transactions
values (3/5/99, 23456, 5000, -, 14000)
7
More Queries
select acct-num, avg(balance)
from transactions
where date between 1/4/99 and 30/4/99
groupby acct-num
23456
23457
23458
41,000,000
99,000,000,000
1,000
8
More Queries: Joins
select name, balance
from account, transactions
where acct.acct-no = transactions.acct-no
H. Mehta
41,000,000
Bill Gates
99,000,000,000
Sudarshan
1,000
9
Terminology
Relation = Table
e.g. Account
Tuple = Record = Row of table
e.g. record for Sudarshan’s account
Attribute = Column of table
e.g. balance and account-number attributes
of Account
Primary key: uniquely identifies row
e.g. account number
10
Schema Design
Schema = set of tables and their columns
Bad schema design
may make it impossible to store some
records
result in duplication of data
e.g. address of customer stored with each deposit
account
Good schema design
via Entity-relationship model, and
via normalization
11
Relational Database
Features
12
Relational Databases
Provide:
Tabular Data model: simple, yet powerful
A Standard easy-to-use query Language: SQL
Mature Products with Reliable, Fault-Tolerant
Operations available
Good Performance
High number of transactions per second
Parallel operation for scalability (handle growth)
Distributed and Replicated Data Bases
Interoperation, High availability
13
In Contrast: Traditional
File Processing
COBOL:
Common Business Oriented language
Files contain sequence of records,
e.g. Record per account
Complex program for each task
e.g. withdrawal, deposit, average balance, …
File structures often very complex
motivated by efficiency, but become hard to understand
14
SQL
SQL is Intergalactic Dataspeak
Data Definition Language (DDL)
Define table column types and names
Integrity Constraint
Data Manipulation Language
query the data
update the data
Security & Access Control
Views, Permissions
Data Base Technologies
15
SQL Standards
Open Data Base Connectivity (ODBC):
Standard for client server interconnectivity,
using C language
JDBC: Like ODBC, for Java language
SQL Standards: 86, 89, 92, SQL-3 Draft
16
Transactions
Transactions:
e.g. Debit/credit
Problems:
Failures (e.g., power, disk storage)
Concurrent transactions
Solution:
Support for ACID properties
17
ACID Properties
Atomicity: Transaction appears to either run
completely or not at all -- no partial state
Consistency: Integrity checks (e.g., balance >=
0)
Isolation: Locks on data so that transactions do
not step on each others toes
Durability: Data/updates are never lost
18
Integrity Checks
Data value checks
balance >= 0
Foreign key constraints
If account number appears in Transactions
relation, it must also appear in Accounts
relation
Transactions.acct-no is a foreign key referencing
Accounts.acct-no
19
Authorization Mechanisms
Privileges
e.g., read table, update table, insert row in table, delete row
from table, privilege to grant privileges
Each user given specific set of privileges he/she
needs
Roles
privileges given to roles
(e.g., teller, manager)
users authorized to play roles
20
User Interfaces to
Databases
User Interfaces
Forms & Menus
Reports
Graphical user interfaces
Lots of Tools Available
Native to Product (eg Oracle, Microsoft)
also Independent Vendors (Powerbuilder)
no standards
21
Database Systems
22
Storage Devices
Main memory
volatile, lost on power failure
expensive and relatively small
Hard disk
non-volatile, reasonably fast access
relatively cheap, and large
main storage system for databases
Mean time to Failure: ~5 years
23
RAID Systems
Goal: improve storage reliability
Data stored on multiple disks
if one disk fails, data still available on others
Essential for safety of data
Hardware RAID
expensive, very high availability
for 24x7 applications (24 hrs X 7 days/wk)
Software RAID
cheaper, use if some downtime is allowable
24
DATABASE SERVERS
Major players
Oracle, IBM DB2, Microsoft SQL Server,
Informix, Sybase, Ingress
Wide range of performance, features, and
price
25
Database Application
Classes
OLTP: Online Transaction Processing
supports many small transactions
Decision Support
Summaries/aggregates
OLAP: Online Analytical Processing
26
Database System
Architectures
27
Database Architectures
Centralized
Dumb terminals connected to single server
Client Server
Smarter client machines connect to server
Main work still done at server
Parallel Servers
Work divided between multiple CPUs
Distributed
Multiple independent databases in cooperation
28
CLIENT - SERVER
TERMINOLOGY
Service: Provided by the Server
Each Client Is a Consumer
Shared Resources : Managed by Server
Client : Initiator of a Request
29
Client vs Server Functionality
Server Functions
Wait for Requests and process them when
they come
Handle Concurrent Transactions
Authentication, Authorization
Audit trails
Client Functions
Provide User Interface
Support Graphics, Multimedia
30
TWO TIER PARTITIONING
Distributed
Presentation
Remote
Presentation
Distributed
Application Logic
Remote Data
Management
Data
Management
Data
Management
Data
Management
Data
Management
Application
Logic
Application
Logic
Application
Logic
Distributed Data
Management
Data
Management
Data
Management
Presentation
Logic
Presentation
Logic
Presentation
Logic
Application
Logic
Application
Logic
Application
Logic
Presentation
Logic
Presentation
Logic
Presentation
Logic
31
Three Tier Applications
Tier One : Client
e.g., web browser
Tier Two : Application Server
e.g. enhanced web server
Tier Three: Database Server
32
Further Reading
Silberschatz, Korth and Sudarshan,
Database System Concepts, McGraw-Hill,
1997
33
Relational Databases
Motivation:
simplify storage structures
easy to use language for queries/updates
efficiency is job of system
automatic optimization
Legacy systems
Systems built using COBOL and older data models
Still in wide use, but declining usage
34