Data Warehousing

Download Report

Transcript Data Warehousing

Data Warehousing
Lecture-4
Introduction and Background
1
Introduction and Background
2
How is it Different?
• Starts with a 6x12 availability requirement ... but
7x24 usually becomes the goal.
 Decision makers typically don’t work 24 hrs a day and 7
days a week. An ATM system does.
 Once decision makers start using the DWH, and start
reaping the benefits, they start liking it…
 Start using the DWH more often, till want it available
100% of the time.
3
How is it Different?
• Starts with a 6x12 availability requirement ... but
7x24 usually becomes the goal.
 For business across the globe, 50% of the world may be
sleeping at any one time, but the businesses are up 100%
of the time.
 100% availability not a trivial task, need to take into
account loading strategies, refresh rates etc.
4
How is it Different?
• Does not follows the traditional development
model
Requirements

Program
Classical SDLC

 Requirements gathering
 Analysis
 Design
 Programming
 Testing
 Integration
 Implementation
5
How is it Different?
• Does not follows the traditional development
model
DWH
Program

Requirements
DWH SDLC (CLDS)
 Implement warehouse
 Integrate data
 Test for biasness
 Program w.r.t data
 Design DSS system
 Analyze results
 Understand requirement
6
Data Warehouse Vs. OLTP
OLTP (On Line Transaction Processing)
Select tx_date, balance from tx_table
Where account_ID = 23876;
7
Data Warehouse Vs. OLTP
DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = ‘graduate’ and
CustID.customer_table =
Customer_ID.tx_table;
8
Data Warehouse Vs. OLTP
OLTP
DWH
Primary key used
Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned
Many rows returned
May use a single table
Uses multiple tables
High selectivity of query
Low selectivity of query
Indexing on primary key
(unique)
Indexing on primary index
(non-unique)
9
Data Warehouse Vs. OLTP
OLTP: OnLine Transaction Processing (MIS or Database System)
Data Warehouse
OLTP
Scope
* Application –Neutral
* Single source of “truth”
* Evolves over time
* How to improve business
* Application specific
* Multiple databases with repetition
* Off the shelf application
* Runs the business
Data
Perspective
* Historical, detailed data
* Some summary
* Lightly denormalized
* Operational data
* No summary
* Fully normalized
Queries
* Hardly uses PK
* Number of results
returned in thousands
* Based on PK
* Number of results returned in
hundreds
Time factor
* Minutes to hours
* Typical availability 6x12
* Sub seconds to seconds
* Typical availability 24x7
10
Comparison of Response Times
• On-line analytical processing (OLAP) queries must be
executed in a small number of seconds.
– Often requires denormalization and/or sampling.
• Complex query scripts and large list selections can
generally be executed in a small number of minutes.
• Sophisticated clustering algorithms (e.g., data mining)
can generally be executed in a small number of hours
(even for hundreds of thousands of customers).
11
Putting the pieces together
Data
(Tier 0)
Data Warehouse Server
(Tier 1)

Semistructured
Sources
www data











IT
Users
Archived
data
Extract
Transform
Load
(ETL)
Clients
(Tier 3)
MOLAP
Query/Reporting

Meta
Data
Data
Warehouse
Operational
Data Bases
Data sources
OLAP Servers
(Tier 2)
Analysis


Business
Users
Data Mining
ROLAP

Data Marts



Tools
Business Users
12