Data Warehousing
Download
Report
Transcript Data Warehousing
Data Warehousing
Lecture-4
Introduction and Background
1
Introduction and Background
2
How is it Different?
• Starts with a 6x12 availability requirement ... but
7x24 usually becomes the goal.
Decision makers typically don’t work 24 hrs a day and 7
days a week. An ATM system does.
Once decision makers start using the DWH, and start
reaping the benefits, they start liking it…
Start using the DWH more often, till want it available
100% of the time.
3
How is it Different?
• Starts with a 6x12 availability requirement ... but
7x24 usually becomes the goal.
For business across the globe, 50% of the world may be
sleeping at any one time, but the businesses are up 100%
of the time.
100% availability not a trivial task, need to take into
account loading strategies, refresh rates etc.
4
How is it Different?
• Does not follows the traditional development
model
Requirements
Program
Classical SDLC
Requirements gathering
Analysis
Design
Programming
Testing
Integration
Implementation
5
How is it Different?
• Does not follows the traditional development
model
DWH
Program
Requirements
DWH SDLC (CLDS)
Implement warehouse
Integrate data
Test for biasness
Program w.r.t data
Design DSS system
Analyze results
Understand requirement
6
Data Warehouse Vs. OLTP
OLTP (On Line Transaction Processing)
Select tx_date, balance from tx_table
Where account_ID = 23876;
7
Data Warehouse Vs. OLTP
DWH
Select balance, age, sal, gender from
customer_table, tx_table
Where age between (30 and 40) and
Education = ‘graduate’ and
CustID.customer_table =
Customer_ID.tx_table;
8
Data Warehouse Vs. OLTP
OLTP
DWH
Primary key used
Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned
Many rows returned
May use a single table
Uses multiple tables
High selectivity of query
Low selectivity of query
Indexing on primary key
(unique)
Indexing on primary index
(non-unique)
9
Data Warehouse Vs. OLTP
OLTP: OnLine Transaction Processing (MIS or Database System)
Data Warehouse
OLTP
Scope
* Application –Neutral
* Single source of “truth”
* Evolves over time
* How to improve business
* Application specific
* Multiple databases with repetition
* Off the shelf application
* Runs the business
Data
Perspective
* Historical, detailed data
* Some summary
* Lightly denormalized
* Operational data
* No summary
* Fully normalized
Queries
* Hardly uses PK
* Number of results
returned in thousands
* Based on PK
* Number of results returned in
hundreds
Time factor
* Minutes to hours
* Typical availability 6x12
* Sub seconds to seconds
* Typical availability 24x7
10
Comparison of Response Times
• On-line analytical processing (OLAP) queries must be
executed in a small number of seconds.
– Often requires denormalization and/or sampling.
• Complex query scripts and large list selections can
generally be executed in a small number of minutes.
• Sophisticated clustering algorithms (e.g., data mining)
can generally be executed in a small number of hours
(even for hundreds of thousands of customers).
11
Putting the pieces together
Data
(Tier 0)
Data Warehouse Server
(Tier 1)
Semistructured
Sources
www data
IT
Users
Archived
data
Extract
Transform
Load
(ETL)
Clients
(Tier 3)
MOLAP
Query/Reporting
Meta
Data
Data
Warehouse
Operational
Data Bases
Data sources
OLAP Servers
(Tier 2)
Analysis
Business
Users
Data Mining
ROLAP
Data Marts
Tools
Business Users
12