Data Warehouse

Download Report

Transcript Data Warehouse

Data Warehouses
Kathy S. Schwaig
Outline
• Data Explosion
• Data Warehouses
• Multi-dimensional databases
Portions of this presentation are adapted from @ J. Han,
Simon Fraser University, Canada, 2000
Now that we have gathered so
much data, What do we do with it?
“ I never waste memory on things that can easily be
stored and retrieved from elsewhere.”
- Albert Einstein
Data Explosion Problem
• Automated data collection tools and mature
database technology lead to tremendous amounts
of data stored in databases.
• We are drowning in data, but starving for
knowledge!
What is a Data Warehouse?
An integrated and consistent store
of subject-oriented data,
structured for query and
retrieval in order to support
management decision making.
A data warehouse is where the information
systems department puts data to be turned
into information.
One cannot just dump masses of data into a
disk drive and expect it to be usable.
Goal of Data Warehousing
• Resolve enormous data access difficulties:
– Unavailable data hidden in transaction
systems
– Delays as underpowered systems try to
perform huge, complex queries
– Complex, user-hostile interfaces
– Difficulties in discovering patterns in large
amounts of data
– Competition for computer resources
between transaction systems and decision
support systems
On-line Transaction Processing (OLTP)
• Traditional database management systems (DBMS) used
for on-line transaction processing (OLTP).
• Order entry: update status field of order 445522
• Banking: transfer $100 from account 55779 to account
99321
• Characteristics:
– detailed up-to-date data
– structured, repetitive tasks
– short transactions
– read and/or update a few records
An OLTP example
• You call retailer Land’s End, where you have done business before.
The exchange might be:
–
–
–
–
–
–
–
–
–
–
–
“Hi, this is Mr. Smith. I’d like to place an order”
“Your phone number, please?”
“555-555-1212”
(Pulls up your file) “Yes, Mr. Smith. What can I help you with?
“I’d like to order merchandise number 2222”
“I see you were a little late last year in getting your Christmas presents ordered.
Would you like some suggestions to get the process started earlier?”
“Sure, Why not?”
“Last year, you bought your Aunt Jennifer a scarf. We have a lovely pair of gloves to
match --they are on special for only $19. Should I add those to your order?”
“Uh...sure.”
“And would you like the card to say the same as last year?”
“Yes, please.”
Decision Support versus Transaction
Processing
Characteristics and usage patterns of operational
systems (transaction processing systems) used to
automate business processes and those of a
Decision Support System are fundamentally
different but linked. Why?
What is a Data Warehouse?
• Facility for integrating data
• Organizes and stores data for analytical
processing from historical perspective
• Maintained separately from
organization’s operational database
Data Warehouse Architecture
DSS
Server
metadata
other
sources
Extract
Transform
Load
Refresh
Serve
Data
Warehouse
Analysis
Query
Reports
Data mining
Operational
DBs
Tools
Data Sources
Data Marts
Characteristics of a Data
Warehouse
•
•
•
•
Subject-oriented
Integrated
Non-volatile
Time-varying
1. Subject Oriented
• Oriented to the major subject areas of the
corporation
– E.g. insurance company: customer, product,
transaction, policy, claim, account
• Operational database and applications may
be organized differently
– E.g. based on type of insurance's: auto, life,
medical, fire.
2. Integrated
• Inconsistencies in encoding and naming
conventions exist among data sources. Why?
• Data converted
3. Non-Volatile
• Operational data regularly accessed and
manipulated a record at a time. Update
performed in operational environment.
• Warehouse data loaded and accessed.
• Update of data does not occur in the data
warehouse environment.
4. Time Variant...
A data warehouse is a “time-variant”
collection of data, meaning time is a
variable in accessing the data.
Time Variant
• Time horizon for data longer than that of
operational systems.
• Operational database contains current value data.
• Data warehouse data is a sophisticated series of
snapshots.
• The key structure of operational data may or may
not contain some element of time. The key
structure of the data warehouse always contains
some element of time.
Data Mart
A data mart is a smaller version of a data
warehouse, typically containing data related to a
single functional area of the firm or having
limited scope in some other way.
It can be a useful first step to a full-scale data
warehouse.
Data: The Critical Issue
• Users need to gather, analyze, report on business
information to help organizations gain competitive
advantage.
• Most companies have a wealth of legacy data.
Worthless if:
–
–
–
–
existence unknown
cannot be found
cannot be understood
incorrect
Data Transformation
• Simple transformation -- e.g. change data type
of field from integer to character
• Cleansing & scrubbing -- consistent format,
valid values
• Integration -- data from multiple sources and
map field by field into data warehouse.
• Aggregation / summarization
Sample Operations
• Roll up -- summarize data
– total sales volume last year by product category by region
• Roll down, drill down, drill through -- go from
higher level summary to lower level summary or
detailed data
– For a particular product category, find the detailed sales data
for each salesperson by date
• Slice and dice
– Sales of beverages in the West over the last 6 months
Why Multi-Dimensional Databases?
No single "best" data structure for all applications
within an enterprise. Need good conceptual fit with
the way end-users visualize business data
– Most business people already think about their
businesses in multidimensional terms
– Managers tend to ask questions about product
sales in different markets over specific time
periods
Adapted from Arun Rai 1999
What is a Multi-Dimensional Database?
A multidimensional database (MDD) is a
computer software system designed for the
efficient and convenient storage and retrieval
of large volumes of data that are:
(1) intimately related
(2) stored, viewed and analyzed from different
perspectives. Perspectives called
dimensions.
Contrasting Relational and MultiDimensional Models: An Example
SALES VOLUMES FOR GLEASON DEALERSHIP
MODEL
MINI VAN
MINI VAN
MINI VAN
SPORTS COUPE
SPORTS COUPE
SPORTS COUPE
SEDAN
SEDAN
SEDAN
COLOR
BLUE
RED
WHITE
BLUE
RED
WHITE
BLUE
RED
WHITE
SALES VOLUME
6
5
4
3
5
5
4
3
2
Sales Volumes
M
O
D
E
L
Mini Van
6
5
4
Coupe
3
5
5
Sedan
4
3
2
Blue
Red
White
COLOR
Mutlidimensional Representation
Sales Volumes
M
O
D
E
L
Mini Van
Coupe
Carr
Gleason
Cly de
Sedan
Blue
Red
White
COLOR
DEALERSHIP
View Data – An Example
Sales Volumes
M
O
D
E
L
DEALERSHIP
COLOR
•Assume that each dimension has 10 positions, as shown in the cube above
•How many records would be there in a relational table?
•Implications for viewing data from an end-user standpoint?
Data Warehousing and
The World Wide Web
•Access and transfer large numbers of data relatively
easily and economically
• Integration of external data into data warehouse
•Issues of data integrity, accuracy, quality
•Quality rating versus price
Applications
•Data Mining
•Data Visualization
(Coming Next)
Summary
•Data versus Information
• Data Warehouse Architecture
•Characteristics
•Applications
Appendix: Operational Data Store and Data Warehouse
Characteristic
Characteristic
Operational Data Store
How is it built?
One application or subject area
at a time.
Typically multiple subject
areas at a time.
User requirements
Well defined prior to logical
design.
Often vague and conflicting.
Area of support
Day-to-day business operations.
Decision support for
managerial activities.
Type of access
Relatively small number of
records retrieved via a single
query.
Tuned for frequent access to
small amounts of data.
Large data sets scanned to
retrieve results from either
single or multiple queries.
Tuned for infrequent access to
large amounts of data.
Frequency of
access
Volume of data
Data Warehouse
Similar to typical daily volume of Much larger than typical daily
operational transactions.
transaction volume.
Appendix: Operational Data Store and Data Warehouse
Characteristic (cont’d)
Characteristic
Operational Data Store
Data Warehouse
Retention period Retained as necessary to meet Retention period is indeterminate
daily operating requirements. and must support historical
reporting, comparison, and analysis
Currency of data Up-to-the-minuet; real time.
Typically represents a static point in
time.
Availability of
data
High and immediate
availability may be required.
Immediate availability is less
critical.
Typical unit of
analysis
Small, manageable,
transaction-level units.
Large, unpredictable,variable
units.
Design focus
High-performance, limited
flexibility.
High flexibility, high-performance.
Appendix: Characteristics of a Data Warehouse
• Subject orientation.
• Integrated.
• Nonvolatile.
• Time variant.
• Summarized.
•
•
•
•
Large volume.
Not normalized.
Metadata.
Data sources.
Data are organized based on how the
users to them.
All inconsistencies regarding naming
convention and value representations are
removed.
Data are stored in read-only format and do not
change over time.
Data are not current but normally time-series.
Operational data are mapped into a decision-usable
format.
Time-series data sets are normally quite large.
DW data can be, and often are, redundant.
Data about data are stored.
Data come from internal and external unintegrated
operational systems