Problems with Today`s Information Environment

Download Report

Transcript Problems with Today`s Information Environment

ITCS 6163 Data Warehousing
Xintao Wu
History
60s
C. Bachman GE network data model
Late 60s
IBM IMS hierarchical data model
70
E.Codd relational model
80s
SQL IBM R transaction J. Gray
Late 80s-90s DB2, Oracle, informix, sybase
90sDW, internet
Turing award and Turing test?
Evolution of Database Technology
(See Fig. 1.1)
1960s:

Data collection, database creation, IMS and network DBMS
1970s:

Relational data model, relational DBMS implementation
1980s:

RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial, scientific,
engineering, etc.)
1990s—2000s:

Data mining and data warehousing, multimedia databases, and
Web databases
Can You Easily Answer
These Questions?
What is the correlation
between expenditures
and collection of
delinquent taxes?
What is the impact on
revenues and expenditures
of changing the operating
hours of the Dept. of Motor
Vehicles?
What are Personnel
Services costs across
all departments for
all funding sources?
What are the effects
of outsourcing
specific services?
What is the economic
impact of the small
business initiative in our
district?
Overview: Data Warehousing and
OLAP Technology for Data Mining
What is a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
What is a Warehouse?
Collection of diverse data







subject oriented
aimed at executive, decision maker
often a copy of operational data
with value-added data (e.g., summaries, history)
integrated
time-varying
non-volatile
more
What is a Warehouse?
Collection of tools





gathering data
cleansing, integrating, ...
querying, reporting, analysis
data mining
monitoring, administering warehouse
What is a Warehouse?
Defined in many different ways, but not
rigorously.


A decision support database that is maintained separately from
the organization’s operational database
Support information processing by providing a solid platform of
consolidated, historical data for analysis.
“A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of management’s
decision-making process.”—W. H. Inmon
Data warehousing:

The process of constructing and using data warehouses
Data Warehouse—Subject-Oriented
Organized around major subjects, such as
customer, product, sales.
Focusing on the modeling and analysis of data
for decision makers, not on daily operations or
transaction processing.
Provide a simple and concise view around
particular subject issues by excluding data that
are not useful in the decision support process.
Data Warehouse—Integrated
Constructed by integrating multiple,
heterogeneous data sources

relational databases, flat files, on-line transaction
records
Data cleaning and data integration techniques
are applied.

Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
 E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is
converted.
Data Warehouse—Time Variant
The time horizon for the data warehouse is
significantly longer than that of operational
systems.


Operational database: current value data.
Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse


Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
contain “time element”.
Data Warehouse—Non-Volatile
A physically separate store of data transformed
from the operational environment.
Operational update of data does not occur in
the data warehouse environment.

Does not require transaction processing, recovery,
and concurrency control mechanisms

Requires only two operations in data accessing:
 initial loading of data and access of data.
Warehouse is specialized DB
Standard DB
Mostly updates
Many small
transactions
Mb-Tb of data
Current snapshot
Raw data
Clerical users
Warehouse
Mostly reads
Queries are long,
complex
Gb-Tb of data
History
Summarized,
consolidated data
Decision-makers,
analysts as users
Data Warehouse vs. Heterogeneous
DBMS
Traditional heterogeneous DB integration:

Build wrappers/mediators on top of heterogeneous databases

Query driven approach
 When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for
individual heterogeneous sites involved, and the results are
integrated into a global answer set
 Complex information filtering, compete for resources
Data warehouse: update-driven, high
performance

Information from heterogeneous sources is integrated in advance
and stored in warehouses for direct query and analysis
Data Warehouse vs. Operational
DBMS
OLTP (on-line transaction processing)


Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
manufacturing, payroll, registration, accounting, etc.
OLAP (on-line analytical processing)

Major task of data warehouse system

Data analysis and decision making
Distinct features (OLTP vs. OLAP):

User and system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated

Database design: ER + application vs. star + subject

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries
OLTP vs. OLAP
OLTP
OLAP
users
clerk, IT professional
knowledge worker
function
day to day operations
decision support
DB design
application-oriented
subject-oriented
data
current, up-to-date
detailed, flat relational
isolated
repetitive
historical,
summarized, multidimensional
integrated, consolidated
ad-hoc
lots of scans
unit of work
read/write
index/hash on prim. key
short, simple transaction
# records accessed
tens
millions
#users
thousands
hundreds
DB size
100MB-GB
100GB-TB
metric
transaction throughput
query throughput, response
usage
access
complex query
Overview: Data Warehousing and
OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Why Separate Data Warehouse?
High performance for both systems


DBMS— tuned for OLTP: access methods, indexing,
concurrency control, recovery
Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation.
Different functions and different data:



missing data: Decision support requires historical data
which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data from
heterogeneous sources
data quality: different sources typically use inconsistent
data representations, codes and formats which have to
be reconciled
Warehouse Architecture
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Source
Why a Warehouse?
Two Approaches:


Query-Driven (Eager)
Warehouse (Lazy)
?
Source
Source
Query-Driven Approach
Client
Client
Mediator
Wrapper
Source
Wrapper
Wrapper
Source
Source
Advantages of Warehousing
High query performance
Queries not visible outside warehouse
Local processing at sources unaffected
Can operate when sources unavailable
Can query data not stored in a DBMS
Extra information at warehouse


Modify, summarize (store aggregates)
Add historical information
Advantages of Query-Driven
No need to copy data


less storage
no need to purchase data
More up-to-date data
Query needs can be unknown
Only query interface needed at sources
Overview: Data Warehousing and
OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Modeling OLTP Systems
Goal -- Update as many transactions as possible
in the shortest period of time
Approach
Model to 3rd Normal Form (3NF)
 Minimize redundancy to optimize update

Result
Create many (hundreds) of tables
 Difficult for business users to understand and use
 Retrieval requires many JOINs = lousy performance

Modeling the Data Warehouse
Tuning the relational model

Denormalize
– Reduces the number of tables
– Improves usability
– Improves performance

Add aggregate data (typically separate tables)
– Improves performance
– Degrades usability
Modeling the Data Warehouse
“Entity relation data models are a disaster for
querying because they cannot be understood
by users and they cannot be navigated
usefully by DBMS software. Entity relation
models cannot be used as the basis for
enterprise data warehouses.”
Ralph Kimball, The Data Warehouse Toolkit,
1996, John Wiley & Sons, Inc., ISBN 0-471-15337-0
From Tables and Spreadsheets
to Data Cubes
A data warehouse is based on a multidimensional data
model which views data in the form of a data cube
A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions


Dimension tables, such as item (item_name, brand, type), or
time(day, week, month, quarter, year)
Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
In data warehousing literature, an n-D base cube is
called a base cuboid. The top most 0-D cuboid, which
holds the highest-level of summarization, is called the
apex cuboid. The lattice of cuboids forms a data cube.
Cube: A Lattice of Cuboids
all
time
time,item
0-D(apex) cuboid
item
time,location
location
item,location
time,supplier
time,item,location
supplier
1-D cuboids
location,supplier
2-D cuboids
item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
4-D(base) cuboid
time, item, location, supplier
Conceptual Modeling of Data
Warehouses
Modeling data warehouses: dimensions &
measures

Star schema: A fact table in the middle connected to a set of
dimension tables

Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake

Fact constellations: Multiple fact tables share dimension tables,
viewed as a collection of stars, therefore called galaxy schema or
fact constellation
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_type
location
location_key
street
city
province_or_street
country
Example of Snowflake
Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
branch_key
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
item_key
item_name
brand
type
supplier_key
supplier
supplier_key
supplier_type
location
location_key
street
city_key
city
city_key
city
province_or_street
country
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
Measures
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
location
to_location
location_key
street
city
province_or_street
country
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type
Multidimensional Data
Sales volume as a function of product,
month, and region Dimensions: Product, Location, Time
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Month
Month Week
Day
A Sample Data Cube
2Qtr
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
sum
Country
TV
PC
VCR
sum
1Qtr
Date
Total annual sales
of TV in U.S.A.
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
Typical OLAP Operations
Roll up (drill-up): summarize data

by climbing up hierarchy or by dimension reduction
Drill down (roll down): reverse of roll-up

from higher level summary to lower level summary or detailed
data, or introducing new dimensions
Slice and dice:

project and select
Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D planes.
Other operations


drill across: involving (across) more than one fact table
drill through: through the bottom level of the cube to its backend relational tables (using SQL)
Relational Operators
Select
Project
Join
Aggregates
• Add up amounts by day
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
ans
date
1
2
sum
81
48
Another Example
• Add up amounts by day, product
• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
rollup
drill-down
sale
prodId
p1
p2
p1
date
1
1
2
amt
62
19
48
Aggregates
Operators: sum, count, max, min,
median, avg
Type



Distributive
Algebraic
holistic
“Having” clause
Using dimension hierarchy


average by region (within store)
maximum by month (within date)
Cube Aggregation
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
Example: computing sums
...
c3
50
8
c2
4
8
rollup
drill-down
c3
50
p1
c1
67
c2
12
c3
50
129
p1
p2
c1
110
19
Aggregation Using Hierarchies
day 2
day 1
p1
p2 c1
p1
12
p2
11
c1
44
c2
4
c2
c3
c3
50
customer
region
8
country
p1
p2
region A region B
56
54
11
8
(customer c1 in Region A;
customers c2, c3 in Region B)
Pivoting
Fact table view:
sale
prodId storeId
p1
c1
p2
c1
p1
c3
p2
c2
p1
c1
p1
c2
Multi-dimensional cube:
date
1
1
1
1
2
2
amt
12
11
50
8
44
4
day 2
day 1
p1
p2 c1
p1
12
p2
11
p1
p2
c1
56
11
c1
44
c2
4
c2
c3
c3
50
8
c2
4
8
c3
50
Overview: Data Warehousing and
OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehousing to data mining
Design of a Data Warehouse: A
Business Analysis Framework
Four views regarding the design of a data
warehouse

Top-down view
 allows selection of the relevant information necessary for the
data warehouse

Data source view
 exposes the information being captured, stored, and
managed by operational systems

Data warehouse view
 consists of fact tables and dimension tables

Business query view
 sees the perspectives of data in the warehouse from the view
of end-user
Data Warehouse Design Process
Top-down, bottom-up approaches or a combination of
both


Top-down: Starts with overall design and planning (mature)
Bottom-up: Starts with experiments and prototypes (rapid)
From software engineering point of view


Waterfall: structured and systematic analysis at each step before
proceeding to the next
Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
Typical data warehouse design process




Choose a business process to model, e.g., orders, invoices, etc.
Choose the grain (atomic level of data) of the business process
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record
Multi-Tiered Architecture
other
Metadata
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
OLAP Engine Front-End Tools
Three Data Warehouse Models
Enterprise warehouse

collects all of the information about subjects spanning
the entire organization
Data Mart

a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
 Independent vs. dependent (directly from warehouse) data mart
Virtual warehouse


A set of views over operational databases
Only some of the possible summary views may be
materialized
What is a Data Mart ?
A data mart is a small-scale data warehouse that is
focused on a single department or single subject area to
provide a subset of data warehouse data to address
specific reporting and analysis requirements.



Smaller warehouses
Spans part of organization
Finance
Do not require enterprise-wide consensus
 but long term integration problems?
HR
Budget
Purchasing
Asset
Mgmt.
Info.
Tech.
Data Warehouse Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts
Data
Mart
Data
Mart
Model refinement
Enterprise
Data
Warehouse
Model refinement
Define a high-level corporate data model
OLAP Server Architectures
Relational OLAP (ROLAP)




ROLAP - provides a Multi-dimensional view of a relational DB (e.g.
MicroStrategy)
Use relational or extended-relational DBMS to store and manage warehouse data
and OLAP middle ware to support missing pieces
Include optimization of DBMS backend, implementation of aggregation
navigation logic, and additional tools and services
greater scalability
Multidimensional OLAP (MOLAP)


Array-based multidimensional storage engine (sparse matrix techniques)
fast indexing to pre-computed summarized data
Hybrid OLAP (HOLAP)

User flexibility, e.g., low level: relational, high-level: array
Specialized SQL servers

specialized support for SQL queries over star/snowflake schemas
MOLAP Databases





Data is stored using a proprietary
format(MOLAP)
Accessible only through the DB vendor’s tools
Suitable only for summarized data
Data may be summarized in advance or real-time
Examples:



PowerPlay
Holos
Essbase
RDBMS: Indexing Strategies

Select columns to be indexed:


Choose combinations of columns most often used to
constrain queries (“where …” clause)
Queries must use constraining columns in the same
order as the columns in the index

Unique more efficient than non-unique.

More indexes means faster query performance,
but also longer transformation/load times.

Types of Indexes:


B-tree -- many possible values (e.g., invoice number)
Bitmap -- few possible values (e.g., M/F, S/M/D/W)
MOLAP versus ROLAP
MOLAP
ROLAP
Multidimensional OLAP
Relational OLAP
Data stored in multidimensional cube
Transformation required
Data retrieved directly
from cube for analysis
Faster analytical
processing
Cube size limitations
Data stored in relational
database as virtual cube
No transformation
needed
Data retrieved via SQL
from database for
analysis
Slower analytical
processing
No size limitations
Data Warehouse Usage
Three kinds of data warehouse applications

Information processing
 supports querying, basic statistical analysis, and reporting
using crosstabs, tables, charts and graphs

Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting

Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models,
performing classification and prediction, and presenting the
mining results using visualization tools.
Differences among the three tasks
Data Mining & Forecasting
Mining the Warehouse




Choose data population
Select mining technique
Segment data into
groups
Identify data patterns
Forecasting Data




Select trend data
Choose forecast model
Run forecast
Display predictions
Accessing & Analyzing Data
Query & Reporting … retrieving data
directly from the warehouse and preparing it
for presentation
Online Analytical Processing (OLAP) …
analyzing aggregated data from a variety of
perspectives
Data Mining & Forecasting … analyzing
and predicting data using mathematical
models
Query & Reporting
Query the Data ...


Select & filter data
Retrieve results
Report the Results ...


Sort & group data
Format & present
data
Save or Export Data



Save queries &
reports
Export to other tools
Publish HTML pages
Query & Reporting Tools
Cognos
Impromptu
Business Objects
Crystal Info
BrioQuery
IQ
GQL
SAS
Online Analytical Processing
Slice and Dice ...



Select dimensions
Choose measures
Filter by dimensions
Drill Down ...


Drill down
hierarchies
Drill through to
details
Present the Results


Present as
spreadsheet
Display graphically
OLAP Tools
Cognos PowerPlay
Business Analyzer
Holos
BrioAnalyzer
Microstrategy
Oracle Express
SAS
Arbor Essbase
Data Mining & Forecasting
Mining the Warehouse




Choose data population
Select mining technique
Segment data into
groups
Identify data patterns
Forecasting Data




Select trend data
Choose forecast model
Run forecast
Display predictions
Data Mining
sales
records:
tran1
tran2
tran3
tran4
tran5
tran6
cust33
cust45
cust12
cust40
cust12
cust12
p2,
p5,
p1,
p5,
p2,
p9
p5, p8
p8, p11
p9
p8, p11
p9
• Trend: Products p5, p8 often bough together
• Trend: Customer 12 likes product p9
Mining and Forecasting Tools
Scenario
4Thought
Business Miner
Clementine
Darwin
Holos
SAS
Data Warehouse Back-End Tools and
Utilities
Data extraction:
 get data from multiple, heterogeneous, and external
sources
Data cleaning:
 detect errors in the data and rectify them when
possible
Data transformation:
 convert data from legacy or host format to warehouse
format
Load:
 sort, summarize, consolidate, compute views, check
integrity, and build indicies and partitions
Refresh
 propagate the updates from the data sources to the
warehouse
Data Cleaning
• Of primordial interest in the warehouse
•
•
•
•
creation
One of the biggest problems
Difficult to achieve
Probability of one or many of the
sources containing “dirty data” is high.
Lots of manual intervention
Data Cleaning Problems
Data quality problems
Single Source
Schema level
Instance level
(poor schema design) (data entry errors)
.Uniqueness
.Misspellings
Multi-source
Schema level Instance level
(heterogeneity) (overlapping
contradicting
data)
Naming conflicts Inconsistent
aggregation
Multisource problems
All the previous problems +
Schema differences (translation and
integration)

E.g.: EmpID, CID, Sex= M/F, Sex=0/1
Instance level conflicts



Duplicate records, contradicting records
Different measures ($, Euros)
Different aggregation levels (weeks, quarters)
Overview: Data Warehousing and
OLAP Technology for Data Mining
What a data warehouse?
Why a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
From data warehouse to data mining
Data Mining: A KDD Process
Pattern Evaluation

Data mining: the core of
knowledge discovery
Data Mining
process.
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Steps of a KDD Process
Learning the application domain:
 relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
 Find useful features, dimensionality/variable reduction,
invariant representation.
Choosing functions of data mining

summarization, classification, regression, association,
clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns,
etc.
Use of discovered knowledge
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
From OLAP to On Line Analytical Mining
(OLAM)
Why online analytical mining?




High quality of data in data warehouses
 DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data
warehouses
 ODBC, OLEDB, Web accessing, service facilities, reporting
and OLAP tools
OLAP-based exploratory data analysis
 mining with drilling, dicing, pivoting, etc.
On-line selection of data mining functions
 integration and swapping of multiple mining functions,
algorithms, and tasks.
Architecture of OLAM
An OLAM Architecture
Mining query
Mining result
Layer4
User Interface
User GUI API
OLAM
Engine
OLAP
Engine
Layer3
OLAP/OLAM
Data Cube API
Layer2
MDDB
MDDB
Meta Data
Filtering&Integration
Database API
Filtering
Layer1
Data cleaning
Databases
Data
Data integration Warehouse
Data
Repository
Summary
Data warehouse

A subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-making
process
A multi-dimensional model of a data warehouse

Star schema, snowflake schema, fact constellations

A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and
pivoting
OLAP servers: ROLAP, MOLAP, HOLAP
From OLAP to OLAM