DATA WAREHOUSING AND DATA MINING

Download Report

Transcript DATA WAREHOUSING AND DATA MINING

Introduction to
Relational Databases
S. Sudarshan
Comp. Sci. and Engg. Dept.
I.I.T., Bombay
[email protected]
http://www.cse.iitb.ernet.in/~sudarsha
1
OVERVIEW
Files and Index Structures
Relational Databases and SQL
Relational Database Features
System Architectures
CPU
Memory
Computer System
Disk
2
Storing Data on Disks
Byte: unit of information
one character
File:
sequence of bytes
File system:
stores multiple files
 organizes files into directories/folders
/users/sudarsha/Talks/rbi-dbms.ppt
/users/sudarsha/Talks/rbi-wh.ppt
3
File Structure
Typical database information
 large number of small pieces of data -records
 records stored within large files
Sequential files
sorted on a key (e.g., account number)
Index needed for efficient access
e.g. find information of account 2345
similar to library card catalogs
4
Relational Model: Tabular
Data
Account
Number
23456
23457
23458
Account
Name
Address
H. Mehta
Bill Gates
S.Sudarshan
Date
1/4/99
1/4/99
Transactions 4/4/99
4/5/99
4/5/99
Balance
Madhuli, Mumbai
40,000,000
Microsoft Corp
9999,999,999
C-148, IIT Powai
1,000
Acct Num
23458
23456
23456
23456
23458
Debit
Credit
Balance
2000
3000
2000
3000
9000 12000
200000 212000
4000
7000
5
Tabular Data - more examples
Emp-num
Name
Address
Vajpayee
Bill Gates
S.Sudarshan
1
2
3
Employee
Date
1/4/99
1/4/99
4/4/99
4/5/99
4/5/99
Salary
PMO, Delhi
Microsoft Corp
C-148, IIT Powai
10,000
9999,999
10,000
Overdue-Loan
Loan-Num State
123
456
998
765
666
TN
MH
MH
GJ
AP
Balance
Category Repaid
3000
3000
12000
212000
7000
Farm
Smallbiz
Coop
Bigbiz
Govtbiz
0
100
50
25
10 6
Querying/Updating the
Database using SQL
SQL Query:
select name, balance
from account
where acct-num = 23456
Result is:
H. Mehta
40,000,000
insert into transactions
values (3/5/99, 23456, 5000, -, 14000)
7
More Queries
select acct-num, avg(balance)
from transactions
where date between 1/4/99 and 30/4/99
groupby acct-num
23456
23457
23458
41,000,000
99,000,000,000
1,000
8
More Queries: Joins
select name, balance
from account, transactions
where acct.acct-no = transactions.acct-no
H. Mehta
41,000,000
Bill Gates
99,000,000,000
Sudarshan
1,000
9
Terminology
Relation = Table
e.g. Account
Tuple = Record = Row of table
e.g. record for Sudarshan’s account
Attribute = Column of table
e.g. balance and account-number attributes
of Account
Primary key: uniquely identifies row
e.g. account number
10
Schema Design
Schema = set of tables and their columns
Bad schema design
may make it impossible to store some
records
result in duplication of data
e.g. address of customer stored with each deposit
account
Good schema design
via Entity-relationship model, and
via normalization
11
Relational Database
Features
12
Relational Databases
Provide:
Tabular Data model: simple, yet powerful
A Standard easy-to-use query Language: SQL
Mature Products with Reliable, Fault-Tolerant
Operations available
Good Performance
High number of transactions per second
Parallel operation for scalability (handle growth)
Distributed and Replicated Data Bases
Interoperation, High availability
13
In Contrast: Traditional
File Processing
COBOL:
Common Business Oriented language
Files contain sequence of records,
e.g. Record per account
Complex program for each task
e.g. withdrawal, deposit, average balance, …
File structures often very complex
motivated by efficiency, but become hard to understand
14
SQL
SQL is Intergalactic Dataspeak
Data Definition Language (DDL)
Define table column types and names
Integrity Constraint
Data Manipulation Language
query the data
update the data
Security & Access Control
Views, Permissions
Data Base Technologies
15
SQL Standards
Open Data Base Connectivity (ODBC):
Standard for client server interconnectivity,
using C language
JDBC: Like ODBC, for Java language
SQL Standards: 86, 89, 92, SQL-3 Draft
16
Transactions
Transactions:
 e.g. Debit/credit
Problems:
Failures (e.g., power, disk storage)
Concurrent transactions
Solution:
Support for ACID properties
17
ACID Properties
Atomicity: Transaction appears to either run
completely or not at all -- no partial state
Consistency: Integrity checks (e.g., balance >=
0)
Isolation: Locks on data so that transactions do
not step on each others toes
Durability: Data/updates are never lost
18
Integrity Checks
Data value checks
balance >= 0
Foreign key constraints
If account number appears in Transactions
relation, it must also appear in Accounts
relation
Transactions.acct-no is a foreign key referencing
Accounts.acct-no
19
Authorization Mechanisms
Privileges
e.g., read table, update table, insert row in table, delete row
from table, privilege to grant privileges
Each user given specific set of privileges he/she
needs
Roles
privileges given to roles
(e.g., teller, manager)

users authorized to play roles
20
User Interfaces to
Databases
User Interfaces
Forms & Menus
Reports
Graphical user interfaces
Lots of Tools Available
Native to Product (eg Oracle, Microsoft)
also Independent Vendors (Powerbuilder)
no standards
21
Database Systems
22
Storage Devices
Main memory
volatile, lost on power failure
expensive and relatively small
Hard disk
non-volatile, reasonably fast access
relatively cheap, and large
main storage system for databases
Mean time to Failure: ~5 years
23
RAID Systems
Goal: improve storage reliability
Data stored on multiple disks
 if one disk fails, data still available on others
Essential for safety of data
Hardware RAID
expensive, very high availability
for 24x7 applications (24 hrs X 7 days/wk)
Software RAID
cheaper, use if some downtime is allowable
24
DATABASE SERVERS
Major players
Oracle, IBM DB2, Microsoft SQL Server,
Informix, Sybase, Ingress
Wide range of performance, features, and
price
25
Database Application
Classes
OLTP: Online Transaction Processing
supports many small transactions
Decision Support
Summaries/aggregates
OLAP: Online Analytical Processing
26
Database System
Architectures
27
Database Architectures
Centralized
Dumb terminals connected to single server
Client Server
Smarter client machines connect to server
Main work still done at server
Parallel Servers
Work divided between multiple CPUs
Distributed
Multiple independent databases in cooperation
28
CLIENT - SERVER
TERMINOLOGY
Service: Provided by the Server
Each Client Is a Consumer
Shared Resources : Managed by Server
Client : Initiator of a Request
29
Client vs Server Functionality
Server Functions
Wait for Requests and process them when
they come
Handle Concurrent Transactions
Authentication, Authorization
Audit trails
Client Functions
Provide User Interface
Support Graphics, Multimedia
30
TWO TIER PARTITIONING
Distributed
Presentation
Remote
Presentation
Distributed
Application Logic
Remote Data
Management
Data
Management
Data
Management
Data
Management
Data
Management
Application
Logic
Application
Logic
Application
Logic
Distributed Data
Management
Data
Management
Data
Management
Presentation
Logic
Presentation
Logic
Presentation
Logic
Application
Logic
Application
Logic
Application
Logic
Presentation
Logic
Presentation
Logic
Presentation
Logic
31
Three Tier Applications
Tier One : Client
e.g., web browser
Tier Two : Application Server
e.g. enhanced web server
Tier Three: Database Server
32
Further Reading
Silberschatz, Korth and Sudarshan,
Database System Concepts, McGraw-Hill,
1997
33
Relational Databases
Motivation:
simplify storage structures
easy to use language for queries/updates
efficiency is job of system
automatic optimization
Legacy systems
Systems built using COBOL and older data models
Still in wide use, but declining usage
34