Data Warehousing – CG124
Download
Report
Transcript Data Warehousing – CG124
Fundamentals of
Data Warehousing
Dr. Akhtar Ali
School of Computing, Engineering and
Information Sciences
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
1
Lecture Outline
1.
Inmon’s Four Characteristics of a DW
Subject-Oriented, Integrated, Time Variant and Non-Volatile
2.
3.
Some Useful Definitions
Issues in Data Warehousing
DW Design
Extraction
Integration
Creation and Loading
DW Maintenance
Querying and Optimization
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
2
1.1 Subject-Oriented
Data is categorized and stored by business subject rather than
by application.
Equity
Plans
Savings
Shares
Loans
Operational Systems
Customer
Financial
Information
Data Warehouse
Subject Area
Insurance
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
3
1.2 Integrated
Data on a given subject is defined and stored once.
Savings
Application
No
Application
Flavor
Current
Accounts
Application
Loans
Application
Subject = Customer
Operational Environment
Data Warehouse
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
4
1.3 Time Variant
Data is stored as a series of snapshots, each representing a
period of time.
Time
Data
01/03 Data for January
02/03 Data for February
03/03 Data for March
Data
Warehouse
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
5
1.4 Non-Volatile
Typically data in the data warehouse is not directly updated or deleted.
Load
Operational Databases
INSERT
Warehouse Database
Read (e.g. SELECT)
Read
UPDATE
DELETE
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
6
2.1 What is an Operational Data
Store?
1.
An operational data store (ODS) is the point of integration
for operational/transaction-oriented systems.
For example Banks typically have several independent systems set
up to support different financial products e.g. loans, checking
accounts, savings accounts etc.
The advent of ATMs helped push many banks to create an ODS to
integrate current balances and recent transactional data from these
separate accounts under one customer number.
Such ODSs are normally kept separate from a DW.
2.
An ODS may be seen as a lowest layer of a DW for lowermanagement to access detailed as well as integrated data.
This means that an ODS may be seen as a front edge of a DW.
Such ODSs are normally kept as part of a DW.
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
7
2.2 What is a Data Mart?
1.
A data mart (DMT) is a logical subset of a complete DW.
A DMT is a complete “pie-wedge” of the overall DW pie.
A DW is a made up of the union of all its DMTs.
Some people take the definition literally. They create several
independent DMTs to meet the needs of several departments.
Will everyone be happy? Well, maybe. There may be serious issues of
integrating these DMTs together.
2.
A DMT is an extension of a DW.
Data is integrated as it enters the DW. DMTs then derive data from the
central source, the DW.
Each department gets its own DMT.
Each department determines which of the data warehouse contents are
of interest .
These subject areas are then replicated into the smaller and local DMT
so that users can get to the data they want with less interference from
other departments.
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
8
2.3 What is On-Line Analytical
Processing (OLAP) ?
OLAP is complementary to data warehousing.
OLAP embodies general activities of querying and presenting text and
number data from DWs.
OLAP is based on dimensional modelling as opposed to entityrelationship (ER) modelling.
A dimensional model may contain the same information as an ER model
but packages data in a symmetric form.
A dimensional model is geared towards user understandability and high
performance query processing.
ROLAP (Relational OLAP)
A set of user interfaces and applications that give a relational database
(RDB) a dimensional flavour.
MOLAP (Multi-dimensional OLAP)
A set of user interfaces, applications and proprietary database technologies
that have a strong dimensional flavour.
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
9
2.4 What is Data Mining (DM) ?
DM is often defined as “finding hidden information” in a database.
Alternatively, DM is “exploratory data analysis, data driven discovery, and
deductive learning”.
Data mining software is a class of tools that apply artificial intelligence
techniques to the analysis of data.
Given access to data, DM tools dig through the data looking for patterns and
discovering relationships that the user might have never suspected.
DM tools work against an operational database or a DW.
Since data in a DW is usually integrated and summarized it may be more
efficient to use it for DM.
But a DM tool may find more useful information from an operational database
(compared to a DW) as a DW usually hosts data to support anticipated DSS and
may miss out data useful for DM.
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
10
3.0 Generic Data Warehouse
Architecture
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
11
3.1 Warehouse Design
Influenced by both maintenance and querying
Many trade-offs
Space vs. update time vs. query performance
Logical model of data
ER vs. Dimensional, Relational vs. OR vs. OO, ROLAP vs.
MOLAP
Identify sources of data
Identify warehouse data – what to materialize?
Which summary tables?
Which fact/dimensional tables?
Which indices?
Choose software and hardware
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
12
3.2 Data Extraction
Selecting relevant data from data sources (DSs) and moving
it into DW.
DS types
Database (e.g. relational), flat file, WWW, XML, COBOL, etc
How to obtain the data?
Using data replication servers/tools
Dump file or Export tools
ODBC/JDBC/CORBA/RMI/COM and DCOM
Third party Wrappers/Middleware/Agents
Other activities: data transformation, change detection
(monitoring), cleansing etc.
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
13
3.2.1 Monitors
Detecting changes (of interest to a DW) in data sources and
propagate to DW.
How?
Triggers
Replication servers/tools
Log Sniffer
Compare query results
Compare snapshots/dumps/exported data
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
14
3.2.1 Data Cleansing
Finds and removes duplicate tuples
For example Judie Harris Morris vs. Judie H. Morris
Detect inconsistent or wrong data
Attribute values that do not match (because of wrong data types or
violating certain constraints e.g. for Gender attribute a value of ‘N’
meaning neither may be rejected).
Unreadable or incomplete data
Notify DSs of errors found during the cleansing process
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
15
3.3 Data Integration
Receive data (changes) from multiple
wrappers/monitors/data cleansers and integrate into DW.
Often Rule-based
Actions
Resolve inconsistencies
Eliminate duplicates
Integrate into DW
Summarize data
Fetch more data from DSs
Notify users that DW is now up-to-date
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
16
3.4 .1 Warehouse Loading
Includes all of the previous processes
Similar to loading/populating a database but complex due to
heterogeneity of data and dealing with multiple DSs,
possibly remote and external.
Building indices
Checking integrity constraints, etc.
Issue: huge volumes of data but small time window to
complete the process.
Computation of additional data
Auxiliary data to facilitate DW maintenance and support/speed up
querying and analysis.
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
17
3.4.2 Warehouse Creation
A DW can be seen as a collection of materialized views
(MVs) over DSs.
Contains a copy of data (collected from DSs) tailored to end-users.
Steps
Create DW schema (e.g. creating fact and dimensional tables,
defining MVs)
Load warehouse
Start monitoring for changes at DSs
Update/Maintain DW as needed.
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
18
3.5 DW Maintenance
DSs on which a DW is based may change over time.
Changes at DSs may require changes at a DW.
How often to propagate changes to a DW?
At night, weekly/fortnightly/monthly, immediately, etc.
Off-line or on-line
Most current vendor products take a warehouse off-line during
maintenance
How to propagate changes to a DW?
Completely re-build all affected tables at the DW (easy but
inefficient)
Apply changes to affected tables incrementally (efficient but
difficult)
Read my paper about MOVIE or wait until we discuss this
topic in detail.
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
19
3.6 Querying and Optimization
Queries are long-running and complex
Multiple/Nested joins and aggregation
Usually “touch all tuples” kind of queries
Query language or analysis tools must support multidimensional operations
Pivot, Slice/Dice, Rollup, Percentile, etc
Standard SQL does not provide adequate operations
Solution: pre-compute partial answers and reuse
Drawback: it may increase DW maintenance
Emergence of warehouse management systems (WHMS e.g.
ADMS, WHIPS)
Database Administration (CG168) – Lecture 10b: Fundamentals of Data Warehousing
20