Transcript Lecture_DW1

Data Warehousing
“An Introduction”
Dr. Akhtar Ali
School of Computing, Engineering and
Information Sciences
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
1
Lecture Outline

New Trends for data/information management
 Background
 Two Approaches

Data Warehousing (DW)
 Definitions and History

DW Architectures
 Strategies for building data warehouses

Problems and Issues
 Maintenance and Performance

DW Support in database management systems
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
2
1: New Trends for
data/information management

Secondary storage is becoming more and more affordable.
 So enterprises keep more and more data
 Data replication is becoming widespread to avoid single point
of failure

What to do with large volumes of data ?
 Decision makers want to get more of data
 Decision support systems (DSSs)
» Have long execution time
» Are CPU-intensive
» Involve Statistical Analysis/Analytical queries

Transaction-oriented databases are not suitable for DSSs.
 Transactional data usually change rapidly
 Database and application servers are already at peak loads
 Transactional data is usually normalized while DSSs require
summarised and highly aggregated data – and possibly denormalized
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
3
Data Management
Past, Present and Future

Past
 File Processing (e.g. COBOL)
 Network and Hierarchical Databases

Present
 Relational, Object-Relational and Object-Oriented Databases
 Fragmentation of Information Systems
» Subject/User/Application-Driven Transaction Processing Systems
» Stand-alone systems e.g.




Manufacturing (Inventory Control)
Finance (Payroll, Stock Management)
Sales Administration (Planning, Suppliers, Daily Sales)
Future
 Integration of Data and Applications
 Data Exchange, Interoperability and Homogeneity in the
presence of Heterogeneity.
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
4
Surviving in the Information
Jungle



Different interfaces and protocols
Different data models and representations
Duplicate and Inconsistent Information
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
5
Solution
Integrated Information Store

Integration Systems
 Collect and combine information from multiple sources
 Provide integrated view and uniform user interface
 Support sharing of data and processing capabilities
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
6
Two Approaches
1: On-Demand/Query-Driven

On-Demand (Lazy) Data Integration is a kind of Virtual Data
Warehouse
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
7
Disadvantages of On-Demand
Approach

Poor response time due to delay in query processing
 Slow or unavailable data sources
 Time consuming and complex filtering and integration



Inefficient and potentially expensive for frequent queries
Wrappers compete on resources with local applications at
data sources
There are only few notable systems based on this approach
e.g.
 TAMBIS: Transparent Access to Multiple Bio-informatics
Information Systems
 SRS: Sequence Retrieval System
 OPM (Object Protocol Model) based multi-database tools and
query language (OPM-QL)
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
8
Two Approaches
2: Data Warehousing


In advance/
Eager data
integration
Integrated
data is
persistently
stored in a
database –
data
warehouse
for direct
querying and
analysis
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
9
Advantages of Data Warehousing
Approach

High performance query processing
 Though the information returned may not be most up-to-date

Does not interfere with local data processing at sources
 Analytical Querying/Statistical Analysis or On-Line Analytical
Processing (OLAP) at warehouse
 On-Line Transaction Processing (OLTP) at data sources

Data Persistently Stored at Warehouse
 Data at the warehouse can be further re-structured,
aggregated, summarized and modified if necessary.
 A DW may store historical/archive data.

Data warehousing approach has been widely used e.g.





The Maryland ADMS Project
Supporting Data Integration and Warehousing Using H2O
The Stanford Data Warehousing Project
GIMS: Genome Information Management System
Marks & Spencer Data Warehouse
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
10
Trade-off between Query-Driven
and
Data Warehousing Approaches

Query-driven approach is still better for:
 Rapidly changing information/data sources;
 Accessing very large amounts of data from many sources;
 Clients with unpredictable and dynamic requirements

Data Warehousing is more suitable when:
 Data sources on which a data warehouse is based are not
frequently changing;
 Data up-to-dateness is not crucially important;
 Querying and Analysis is complex;
 Data needs to be highly summarized and aggregated;
 Fast access to integrated and derived data is vital; and
 Keeping data warehouse consistent with the underlying data
sources is efficient and does not compromise on expected
performance.
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
11
What is a Data Warehouse?
(a practitioner’s viewpoint)

“A data warehouse is simply a single, complete,
and consistent store of data obtained from a
variety of sources and made available to end
users in a way they can understand and use it in
a business context” – Barry Devlin, IBM
Consultant

“A data warehouse is a database of data gathered
from many systems and intended to support
management reporting and decision making” –
Michael Corey et al, CTO of OneWarranty.com
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
12
What is a Data Warehouse?
(classical viewpoint)
Subject
Oriented
Integrated
Data
Warehouse
Non Volatile
Time Variant
According to W. H. Inmon
(Building a Data
Warehouse, 1992)
“A DW is a subjectoriented, integrated,
time-varying, non-volatile
collection of data that is
used primarily in
organizational decision
making.”
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
13
In a Nutshell, a DW is

A persistent collection of diverse data
 Generally speaking, an efficient solution to data
integration
 A single repository of information

Subject-Oriented
 Organized by subject (not by application)
 Used for analysis, reporting, data mining, etc.


Structured and optimized differently from transactionoriented databases
User interface aimed at executive – decision makers
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
14
Data Warehouse History
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
15
Standard DB v. DW
Standard Database
Data Warehouse

Mix of updates and querying

Many small-medium
transactions



MBs to GBs in size
Most Current snapshot

Heavily indexed


Raw Data
Thousands of users (e.g.
clerical to mid-level-mangers)






Mostly reads (infrequent
updates, append-only – very
rarely data is deleted)
Queries are complex and longrunning
GBs to TBs in size
Not the most current
snapshot/Historical
Lots of scans (as data is
readily accessible)
Summarized/Aggregated
Hundreds of users (e.g.
decision-makers, analysts)
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
16
Architectures (I)
Simple

Metadata and raw data of a traditional OLTP system is present, as is an
additional type of data, summary data. Summaries are very valuable in
data warehouses because they pre-compute long operations in advance.
For example, a typical data warehouse query is to retrieve something like
December sales.
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
17
Architectures (II)
With Staging Area
We need to clean and process operational data before putting it into the
warehouse. We can do this programmatically, although most data
warehouses use a staging area instead. A staging area simplifies building
summaries and general warehouse management.
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
18
Architectures (III)
With Staging Area + Data Marts
This is a customized warehouse architecture for different groups within
an organization. By adding data marts, which are systems designed for a
particular line of business, we can build a more customized DW.
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
19
Problems and Issues

Warehouse Maintenance
 Data sources (DSs) on which a DW is based may change over
time.
 Changes at DSs may require changes at a DW.
 How often to propagate changes to a DW?
» At night, weekly/fortnightly/monthly, immediately, etc.
 How to propagate changes to a DW?
» Completely re-build all affected tables at the DW (easy but
inefficient)
» Apply changes to affected tables incrementally (efficient but
difficult)

Performance
 How to assess if a DW is performing well?
 How to improve performance?

Miscellaneous Issues
 Data Quality Assurance (How good is data in a DW?)
 How to cope with data warehouse evolution?
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
20
Data Systems Supporting DW








Oracle 8i, 9i
IBM DB2
Sybase
RedBrick Data Warehouse/Informix
MS SQL Server
Tandem (HP)
Teradata
MicroStrategy
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
21
Bibliography



Advanced Topics in Database Systems by Sharma
Chakravarthy, 2001, University of Texas at Arlington, USA.
Oracle9i Data Warehousing Guide Release 2 (9.2), 2002.
Oracle 8i Data Warehousing by Michael Corey, Michael
Abbey, Ian Abramson, Ben Taub, Oracle Press, 2001.
Database Administration (CG168) – Lecture 10a: Introduction to Data Warehousing
22