Itec 3220 - Computer Science

Download Report

Transcript Itec 3220 - Computer Science

ITEC 3220A
Using and Designing Database Systems
Instructor: Gordon Turpin
Course Website:
www.cse.yorku.ca/~gordon/itec3220S07
Office: CSEB3020
Chapter 13
The Data Warehouse
Transaction Processing Versus
Decision Support
• Transaction processing allows organizations
to conduct daily business in an efficient
manner
– Operational database
• Decision support helps management provide
medium-term and long-term direction for an
organization
3
Decision Support System (DSS)
Components
4
Operational vs. Decision Support
Data
• Operational data
– Relational, normalized database
– Optimized to support transactions
– Real time updates
• DSS
– Snapshot of operational data
– Summarized
– Large amounts of data
• Data analyst viewpoint
– Timespan
– Granularity
– Dimensionality
5
The DSS Database Requirements
• Database schema
– Support complex (non-normalized) data
– Extract multidimensional time slices
• Data extraction and filtering
• End-user analytical interface
• Database size
– Very large databases (VLDBs)
– Contains redundant and duplicated data
6
Data Warehouse
• Integrated
– Centralized
– Holds data retrieved from entire organization
• Subject-Oriented
– Optimized to give answers to diverse questions
– Used by all functional areas
• Time Variant
– Flow of data through time
– Projected data
• Non-Volatile
– Data never removed
– Always growing
7
Data Marts
• Single-subject data warehouse subset
• Decision support to small group
• Can be tested for exploring potential
benefits of Data warehouses
• Address local or departmental problems
8
Data Warehouse Versus Data Mart
9
Star Schema
• Data-modeling technique
• Maps multidimensional decision support into
relational database
• Yield model for multidimensional data analysis
while preserving relational structure of
operational DB
• Four Components:
–
–
–
–
Facts
Dimensions
Attributes
Attribute hierarchies
10
Simple Star Schema
11
Slice and Dice View of Sales
12
Star Schema Representation
• Facts and dimensions represented by
physical tables in data warehouse DB
• Fact table related to each dimension table
(M:1)
• Fact and dimension tables related by
foreign keys
• Subject to the primary/foreign key
constraints
13
Star Schema for Sales
14
Example
Canadian financial organization is interested in building a data
warehouse to analyze customers’ credit payments over time, location
where the payments were made, customers, and types of credit
cards. A customer may use the credit card to make a payment in
different locations across the country and abroad. If a payment is
made abroad it can be based on domestic currency and then
converted into Canadian dollars based on currency rate.
•Time is described by Time_ID, day, month, quarter and year.
•Location is presented by Location_ID, name of the organization
billing the customer, city and country where the organization is
located, domestic currency.
•A credit card is described by credit card number, type of the credit
account, and customer’s credit rate. The customer’s rate depends on
the type of the credit account.
•A customer is described by ID, name, address, and phone.
15
Performance-Improving Techniques
for Star Schema
• Normalization of dimensional tables
• Multiple fact tables representing
different aggregation levels
• Denormalization of the fact tables
• Table partitioning and replication
16
Normalization Example
• Normalize the star schema that you developed
for Canadian financial organization on page
16 into 3NF.
17
More Example
A supermarket chain is interested in building a data
warehouse to analyze the sales of different products in
different supermarkets at different times using different
payment method.
– Each supermarket is presented by location_ID, city,
country, and domestic currency.
– Time can be measured in time_ID, day, month,
quarter, and year.
– Each product is described by product_ID,
product_name, and vendor.
– Payment method is described by payment_ID,
payment_ type.
Design a star schema for this problem and then
normalize the star schema that you developed into 3NF.
18
Data Warehouse
Implementation Road Map
19
Chapter 12
Distributed Database
Management Systems
The Evolution of Distributed Database
Management Systems
• Distributed database management
system (DDBMS)
– Governs storage and processing of logically
related data over interconnected computer
systems in which both data and processing
functions are distributed among several
sites
21
Distributed Database
Environment
22
Database Systems: Levels of Data
and Process Distribution
23
Single-Site Processing, SingleSite Data (SPSD)
• All processing is done on single CPU or host computer
(mainframe, midrange, or PC)
• All data are stored on host computer’s local disk
• Processing cannot be done on end user’s side of the
system
• Typical of most mainframe and midrange computer
DBMSs
• DBMS is located on the host computer, which is
accessed by dumb terminals connected to it
• Also typical of the first generation of single-user
microcomputer databases
24
Single-Site Processing, Single-Site
Data (Centralized)
25
Multiple-Site Processing, SingleSite Data (MPSD)
• Multiple processes run on different computers
sharing a single data repository
• MPSD scenario requires a network file server
running conventional applications that are
accessed through a LAN
• Many multi-user accounting applications,
running under a personal computer network, fit
such a description
26
Multiple-Site Processing, SingleSite Data
27
Multiple-Site Processing,
Multiple-Site Data (MPMD)
• Fully distributed database management system
with support for multiple data processors and
transaction processors at multiple sites
• Classified as either homogeneous or
heterogeneous
• Homogeneous DDBMSs
– Integrate only one type of centralized DBMS over a
network
28
Multiple-Site Processing,
Multiple-Site Data (MPMD) (Cont’d)
• Heterogeneous DDBMSs
– Integrate different types of centralized
DBMSs over a network
• Fully heterogeneous DDBMS
– Support different DBMSs that may even
support different data models (relational,
hierarchical, or network) running under
different computer systems, such as
mainframes and microcomputers
29
Distributed Database Design
• Data fragmentation:
– How to partition the database into fragments
• Data replication:
– Which fragments to replicate
• Data allocation:
– Where to locate those fragments and replicas
30
Data Fragmentation
• Breaks single object into two or more
segments or fragments
• Each fragment can be stored at any site over a
computer network
• Information about data fragmentation is
stored in the distributed data catalog (DDC),
from which it is accessed by the TP to process
user requests
31
Data Fragmentation Strategies
• Horizontal fragmentation:
– Division of a relation into subsets (fragments) of
tuples (rows)
• Vertical fragmentation:
– Division of a relation into attribute (column)
subsets
• Mixed fragmentation:
– Combination of horizontal and vertical strategies
32
Data Replication
• Storage of data copies at multiple sites served
by a computer network
• Fragment copies can be stored at several sites
to serve specific information requirements
– Can enhance data availability and response time
– Can help to reduce communication and total query
costs
33
Replication Scenarios
• Fully replicated database:
– Stores multiple copies of each database fragment at
multiple sites
– Can be impractical due to amount of overhead
• Partially replicated database:
– Stores multiple copies of some database fragments at
multiple sites
– Most DDBMSs are able to handle the partially
replicated database well
• Unreplicated database:
– Stores each database fragment at a single site
– No duplicate database fragments
34
Data Allocation
• Deciding where to locate data
• Allocation strategies:
– Centralized data allocation
• Entire database is stored at one site
– Partitioned data allocation
• Database is divided into several disjointed parts
(fragments) and stored at several sites
– Replicated data allocation
• Copies of one or more database fragments are stored at
several sites
• Data distribution over a computer network is
achieved through data partition, data
replication, or a combination of both
35