Data Warehouse
Download
Report
Transcript Data Warehouse
Data Warehouses
Kathy S. Schwaig
Outline
Data Explosion
Data Warehouses
Multi-dimensional databases
Portions of this presentation are adapted from @ J. Han,
Simon Fraser University, Canada, 2000
Now that we have gathered so much
data, What do we do with it?
“ I never waste memory on things that can easily be
stored and retrieved from elsewhere.”
- Albert Einstein
Data Explosion Problem
Automated data collection tools and mature
database technology lead to tremendous amounts
of data stored in databases.
We are drowning in data, but starving for
knowledge!
What is a Data Warehouse?
An integrated and consistent
store of subject-oriented data,
structured for query and
retrieval in order to support
management decision making.
A data warehouse is where the information
systems department puts data to be turned
into information.
One cannot just dump masses of data into a
disk drive and expect it to be usable.
Goal of Data Warehousing
Resolve enormous data access difficulties:
Unavailable data hidden in transaction systems
Delays as underpowered systems try to perform
huge, complex queries
Complex, user-hostile interfaces
Difficulties in discovering patterns in large
amounts of data
Competition for computer resources between
transaction systems and decision support systems
On-line Transaction Processing (OLTP)
Traditional database management systems (DBMS) used
for on-line transaction processing (OLTP).
Order entry: update status field of order 445522
Banking: transfer $100 from account 55779 to account
99321
Characteristics:
detailed up-to-date data
structured, repetitive tasks
short transactions
read and/or update a few records
An OLTP example
You call retailer Land’s End, where you have done business
before. The exchange might be:
“Hi, this is Mr. Smith. I’d like to place an order”
“Your phone number, please?”
“555-555-1212”
(Pulls up your file) “Yes, Mr. Smith. What can I help you with?
“I’d like to order merchandise number 2222”
“I see you were a little late last year in getting your Christmas
presents ordered. Would you like some suggestions to get the process
started earlier?”
“Sure, Why not?”
“Last year, you bought your Aunt Jennifer a scarf. We have a lovely
pair of gloves to match --they are on special for only $19. Should I
add those to your order?”
“Uh...sure.”
“And would you like the card to say the same as last year?”
“Yes, please.”
Decision Support versus
Transaction Processing
Characteristics and usage patterns of
operational systems (transaction processing
systems) used to automate business processes
and those of a Decision Support System are
fundamentally different but linked. Why?
What is a Data Warehouse?
Facility for integrating data
Organizes and stores data for analytical
processing from historical perspective
Maintained separately from
organization’s operational database
Data Warehouse Architecture
DSS
Server
metadata
other
sources
Extract
Transform
Load
Refresh
Serve
Data
Warehouse
Analysis
Query
Reports
Data mining
Operational
DBs
Tools
Data Sources
Data Marts
Characteristics of a Data
Warehouse
Subject-oriented
Integrated
Non-volatile
Time-varying
1. Subject Oriented
Oriented to the major subject areas of the
corporation
E.g. insurance company: customer, product,
transaction, policy, claim, account
Operational database and applications
may be organized differently
E.g. based on type of insurance's: auto, life,
medical, fire.
2. Integrated
Inconsistencies in encoding and naming
conventions exist among data sources.
Why?
Data converted
3. Non-Volatile
Operational data regularly accessed and
manipulated a record at a time. Update
performed in operational environment.
Warehouse data loaded and accessed.
Update of data does not occur in the data
warehouse environment.
4. Time Variant...
A data warehouse is a “time-variant”
collection of data, meaning time is
a variable in accessing the data.
Time Variant
Time horizon for data longer than that of operational
systems.
Operational database contains current value data.
Data warehouse data is a sophisticated series of
snapshots.
The key structure of operational data may or may not
contain some element of time. The key structure of the
data warehouse always contains some element of time.
Data Mart
A data mart is a smaller version of a data
warehouse, typically containing data related
to a single functional area of the firm or
having limited scope in some other way.
It can be a useful first step to a full-scale data
warehouse.
Data: The Critical Issue
Users need to gather, analyze, report on business
information to help organizations gain competitive
advantage.
Most companies have a wealth of legacy data.
Worthless if:
existence unknown
cannot be found
cannot be understood
incorrect
Data Transformation
Simple transformation -- e.g. change data
type of field from integer to character
Cleansing & scrubbing -- consistent format,
valid values
Integration -- data from multiple sources
and map field by field into data warehouse.
Aggregation / summarization
Sample Operations
Roll up -- summarize data
total sales volume last year by product category by
region
Roll down, drill down, drill through -- go from
higher level summary to lower level summary or
detailed data
For a particular product category, find the detailed
sales data for each salesperson by date
Slice and dice
Sales of beverages in the West over the last 6
months
Why Multi-Dimensional Databases?
No single "best" data structure for all applications
within an enterprise. Need good conceptual fit
with the way end-users visualize business data
Most business people already think about their
businesses in multidimensional terms
Managers tend to ask questions about product
sales in different markets over specific time
periods
Adapted from Arun Rai 1999
What is a Multi-Dimensional
Database?
A multidimensional database (MDD) is a
computer software system designed for the
efficient and convenient storage and retrieval
of large volumes of data that are:
(1) intimately related
(2) stored, viewed and analyzed from different
perspectives. Perspectives called
dimensions.
Contrasting Relational and Multi-Dimensional
Models: An Example
SALES VOLUMES FOR GLEASON DEALERSHIP
MODEL
MINI VAN
MINI VAN
MINI VAN
SPORTS COUPE
SPORTS COUPE
SPORTS COUPE
SEDAN
SEDAN
SEDAN
COLOR
BLUE
RED
WHITE
BLUE
RED
WHITE
BLUE
RED
WHITE
SALES VOLUME
6
5
4
3
5
5
4
3
2
Sales Volumes
M
O
D
E
L
Mini Van
6
5
4
Coupe
3
5
5
Sedan
4
3
2
Blue
Red
White
COLOR
Mutlidimensional
Representation
Sales Volumes
M
O
D
E
L
Mini Van
Coupe
Carr
Gleason
Clyde
Sedan
Blue
Red
White
COLOR
DEALERSHIP
View Data – An Example
Sales Volumes
M
O
D
E
L
DEALERSHIP
COLOR
•Assume that each dimension has 10 positions, as shown in the cube above
•How many records would be there in a relational table?
•Implications for viewing data from an end-user standpoint?
Data Warehousing and
The World Wide Web
•Access and transfer large numbers of data relatively
easily and economically
• Integration of external data into data warehouse
•Issues of data integrity, accuracy, quality
•Quality rating versus price
Applications
•Data Mining
•Data Visualization
(Coming Next)
Summary
•Data versus Information
• Data Warehouse Architecture
•Characteristics
•Applications
Appendix: Operational Data Store and Data Warehouse
Characteristic
Characteristic
Operational Data Store
How is it built?
One application or subject area
at a time.
Typically multiple subject
areas at a time.
User requirements
Well defined prior to logical
design.
Often vague and conflicting.
Area of support
Day-to-day business operations.
Decision support for
managerial activities.
Type of access
Relatively small number of
records retrieved via a single
query.
Tuned for frequent access to
small amounts of data.
Large data sets scanned to
retrieve results from either
single or multiple queries.
Tuned for infrequent access to
large amounts of data.
Frequency of
access
Volume of data
Data Warehouse
Similar to typical daily volume of Much larger than typical daily
operational transactions.
transaction volume.
Appendix: Operational Data Store and Data Warehouse
Characteristic (cont’d)
Characteristic
Operational Data Store
Data Warehouse
Retention period Retained as necessary to meet Retention period is indeterminate
daily operating requirements. and must support historical
reporting, comparison, and analysis
Currency of data Up-to-the-minuet; real time.
Typically represents a static point in
time.
Availability of
data
High and immediate
availability may be required.
Immediate availability is less
critical.
Typical unit of
analysis
Small, manageable,
transaction-level units.
Large, unpredictable,variable
units.
Design focus
High-performance, limited
flexibility.
High flexibility, high-performance.
Appendix: Characteristics of a Data Warehouse
Subject orientation.
Integrated.
Nonvolatile.
Time variant.
Summarized.
Large volume.
Not normalized.
Metadata.
Data sources.
Data are organized based on how the
users to them.
All inconsistencies regarding naming
convention and value representations are
removed.
Data are stored in read-only format and do not
change over time.
Data are not current but normally time-series.
Operational data are mapped into a decision-usable
format.
Time-series data sets are normally quite large.
DW data can be, and often are, redundant.
Data about data are stored.
Data come from internal and external unintegrated
operational systems