Data warehouse
Download
Report
Transcript Data warehouse
Decision Support and
Date Warehouse
Jingyi Lu
Outline
Decision Support System
OLAP vs. OLTP
What is Date Warehouse?
Dimensional Modeling
Extract, Transform, and Load (ETL)
Decision Support System
Information technology to help the
knowledge worker(executive, manager,
analyst) make faster and better decisions.
–
What were the sales volumes by region and
product category for the last year?
–
Which orders should we fill to maximize
revenues?
–
Will a 10% discount increase sales volume
sufficiently?
Decision Support Systems
Created to facilitate the decision making
process
So much information that it is difficult to
extract it all from a traditional database
Need for a more comprehensive data
storage facility
Data Warehouse
Decision Support Systems
Extract Information from data to use as the
basis for decision making
Used at all levels of the Organization
Tailored to specific business areas
Ad Hoc queries to retrieve and display
information
Combines historical operation data with
business activities
Decision Support Systems
OLAP vs. OLTP
OLTP (On-line Transaction Processing): is characterized by a
large number of short on-line transactions .-----> Operational
database
OLAP (On-line Analytical Processing):is characterized by
relatively low volume of transactions. Queries are often very
complex and involve aggregations.------> Data Warehouse
OLAP vs. OLTP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
What is a Data Warehouse
The repository for the DSS is the DATA
WAREHOUSE
Definition: Integrated, Subject-Oriented,
Time-Variant, Nonvolatile database that
provides support for decision making.
Integrated
The data warehouse is a centralized,
consolidated database that integrated data
derived from the entire organization
Multiple Sources
Diverse Sources
Diverse Formats
Subject-Oriented
Data is arranged and optimized to provide
answer to questions from diverse functional
areas
Data is organized and summarized by topic
Sales / Marketing / Finance / Distribution /
Etc.
Time-Variant
The Data Warehouse represents the flow of
data through time
Can contain projected data from statistical
models
Data is periodically uploaded then timedependent data is recomputed
Nonvolatile
Once data is entered it is NEVER removed
Represents the company’s entire history
Near term history is continually added to it
Always growing
Must support terabyte databases and
multiprocessors
Read-Only database for data analysis and
query processing
Dimensional Modeling
Dimension
dimension is a data element that categorizes each item in a
data set into non-overlapping regions
Facts
a value or measurement, which represents a fact about the
managed entity or system.
typically numeric values that can be aggregated
Dimensional Modeling
Database is a set of facts (points) in a
multidimensional space
Fact tables
contains business facts or measures and foreign keys
which refer to primary keys in the dimension tables
Dimension tables
Each dimension table has a set of attributes
e.g., Day, Month, Year of Date
Attributes of a dimension may be related by partial
order
Hierarchy: e.g., Day > Month > Year
Example of Star Schema
Example of Snowflake Schema
ETL
ETL
Extraction Transformation Loading – ETL
To get data out of the source and load it into the data
warehouse – simply a process of copying data from one
database to other
Data is extracted from an OLTP database, transformed to
match the data warehouse schema and loaded into the
data warehouse database
Many data warehouses also incorporate data from nonOLTP systems such as text files, legacy systems, and
spreadsheets; such data also requires extraction,
transformation, and loading
When defining ETL for a data warehouse, it is important
to think of ETL as a process, not a physical
implementation
ETL
ETL is often a complex combination of process and
technology that consumes a significant portion of the data
warehouse development efforts and requires the skills of
business analysts, database designers, and application
developers
It is not a one time event as new data is added to the Data
Warehouse periodically – monthly, daily, hourly
Because ETL is an integral, ongoing, and recurring part of
a data warehouse
Automated
Well documented
Easily changeable
ETL Staging Database
ETL operations should be performed on a
relational database server separate from the
source databases and the data warehouse
database
Creates a logical and physical separation
between the source systems and the data
warehouse
Minimizes the impact of the intense periodic ETL
activity on source and data warehouse
databases