On Line Analytical - Computer Science and Engineering

Download Report

Transcript On Line Analytical - Computer Science and Engineering

From Transaction Processing
to
Support for Decision Making
CIS 671
Decision Support Systems
1
Computerized Information
Systems
• Used to “run the business”.
• OSU Examples
– Personnel & Payroll (ARMS)
– Course Offerings
– Students, including course enrollments and grades
• (estimated $30M to replace)
– Inventory
• Transaction Processing
Decision Support Systems
2
st
1
Generation DBMS
• Designed for Transaction Processing
– Hierarchical – IBM – IMS
– Network
• Management Information Systems
– Added later
– Mostly standard summary reports
• Produced on a regular basis
Decision Support Systems
3
Relational DBMS
• Codd – particularly designed for “ad hoc” queries
• First uses for Transaction Processing
• Transaction Data now available on-line
– Use it to help Decision Making
– Ad Hoc
Decision Support Systems
4
Decision Support Systems (DSS)
• Use comprehensive view of all aspects of
business.
– Different business units
– Historical data
– Summary information
• Classes of analysis tools:
– Complex “traditional” SQL queries
– Many “group-by” and “aggregation” queries
(On Line Analytical Processing)
– Exploratory data analysis - Data Mining
Decision Support Systems
5
Data Warehousing
• Properties
– Consolidated data from many sources
– Spanning long time periods
– Augmented with summary information
• Size: several gigabytes to terabytes
Decision Support Systems
6
Data Warehouse Creation
• Integrate schemas from different groups
– Semantic mismatches
• Different currencies
• Different names for same attributes
• Different structures for similar tables
Decision Support Systems
7
Data Warehouse Creation, cont.
• Extract data from different operational
databases and other external sources
–
–
–
–
–
–
Clean data - correct errors, fill in missing data
Transform data to match integrated schema
Load data into warehouse
Refresh data in a timely fashion
Purge very old data
Create metadata repository
• May be so large that it is in a separate database
Decision Support Systems
8
Data Warehouse - Provide
Variety of Analytical Tools
–
–
–
–
–
–
Complex “traditional” SQL queries
OLAP query engine
Data mining algorithm
Information visualization tools
Statistical packages
Report generators
Decision Support Systems
9
Data Mart
• Departmental subset of a data warehouse
• Top-down approach
– Derive from the organization’s data warehouse
– May be too hard to do all at once
• Bottom-up approach
– Initially create departmental data marts
– Integrate data marts into organizational data warehouse
– If not done carefully, may be hard to integrate
Decision Support Systems
10
OLTP vs. Data Warehouse DBs
(from Toby J. Teorey, Database Modeling & Design, Morgan Kaufmann, 1999, p. 212)
OLTP
Data Warehouse
• Transaction oriented
• Subject oriented
• Thousands of users
• Few users ( 100)
• Small (MB to several GB)
• Large (hundreds of GB to
several TB)
• Current data
• Historical data
• Normalized data (many tables,
few columns per table)
• Denormalized data (few tables,
many columns per table)
• Continuous update
• Batch updates
• Simple to complex queries
• Usually very complex queries
Decision Support Systems
11
Complex “traditional” SQL
queries
• Relational DBMS optimized for decision
support
– in contrast to a DBMS optimized for
transaction processing
• Example:
– Teradata machine from NCR
Decision Support Systems
12
On Line Analytical Processing
(OLAP)
Multidimensional Databases
(MDD)
Decision Support Systems
13
Example from Finkelstein [Fink95]:
SALES_INFO
Branch ProdID Date Sales
Returns
BOS
1 1/2/98 $1,000.00
4
NY
1 1/2/98 $1,222.00
2
CMH
2 1/3/98 $555.00
1
SF
2 1/3/98 $1,777.00
9
BRANCH_INFO
Branch Region
BOS A
NY
A
CMH B
SF
C
PROD_INFO
ProdID Description Category
1 Widget
I
2 Super Widget II
• Note that
REGION_INFO
Region Territory Branch, ProdID, Date 
Sales, Returns
A
East
B
East
• Note the multidimensionality
C
West
of the SALES_INFO table.
Decision Support Systems
14
Dimension Hierarchies
LOCATION
TIME
Year
Territory
Quarter
Region
Branch
Decision Support Systems
Week
Month
Date
PRODUCT
Category
ProdID
15
Possible queries:
1. How did product Widget sell in the last
month, and how does this figure compare
with sales over the last five years? How
about by branch, region and territory?
2. Did this product sell better in different
regions, and are there any regional trends?
3. Were there more returns of Widgets over
the last year? Were these returns caused by
defects? Were they manufactured in any
particular plants?
Decision Support Systems
16
Additional Possible query:
4. Do commissions and pricing affect how
sales persons sell the product? Do particular
salespersons do a better job of selling the
product?
Note that a "multidimensional" spreadsheet would be useful.
Codd called this type of problem
On Line Analytical Processing (OLAP)
in contrast to
On Line Transaction Processing (TP).
Decision Support Systems
17
Codd's rules for OLAP: [Codd93]
1. Multi-Dimensional Concept View
The user should be able to see the data as being
multidimensional insofar as it should be easy to
'pivot' or 'slice and dice’. (See later.)
2. Transparency
The OLAP functionality should be provided behind
the user's existing software without adversely
affecting the functionality of the 'host'.
3. Accessibility
OLAP should allow the user to access diverse data
stores but see the data within a common 'schema'
provided by the OLAP tool.
Decision Support Systems
18
OLAP Rules, cont.
4. Consistent Reporting Performance
There should not be significant degradation in
performance with large numbers of dimensions or
large quantities of data.
5. Client-Server Architecture
Since much of the data is on mainframes, and the
users work on PCs, the OLAP tool must be able to
bring the two together!
6. Generic Dimensionality
Data dimensions must all be treated equally.
Functions available for one dimension must be
available for others.
Decision Support Systems
19
OLAP Rules, cont.
7. Dynamic Sparse Matrix Handling
The OLAP tool should be able to work out for
itself the most efficient way to store sparse
matrix data.
8. Multi User Support
This is self-evident.
9. Unrestricted Cross-Dimensional Operations
e.g., individual office overheads are allocated
according to total corporate overheads divided
in proportion to individual office sales.
Decision Support Systems
20
OLAP Rules, cont.
10. Intuitive Data Manipulation
Navigation should be done by operations on
individual cells rather than menus.
11. Flexible Reporting
Row and column headings must be capable of more
than one dimension each, and of displaying subsets of
any dimension.
12. Unlimited Dimensions and Aggregation Levels
At least 15 dimensions may be required, and within
each there may be many hierarchical levels.
Decision Support Systems
21
Example from Finkelstein [Fink95]:
SALES_INFO
Branch ProdID Date Sales
Returns
BOS
1 1/2/98 $1,000.00
4
NY
1 1/2/98 $1,222.00
2
CMH
2 1/3/98 $555.00
1
SF
2 1/3/98 $1,777.00
9
BRANCH_INFO
Branch Region
BOS A
NY
A
CMH B
SF
C
PROD_INFO
ProdID Description Category
1 Widget
I
2 Super Widget II
• Note that
REGION_INFO
Region Territory Branch, ProdID, Date 
Sales, Returns
A
East
B
East
• Note the multidimensionality
C
West
of the SALES_INFO table.
Decision Support Systems
22
“Pivoting”
Cross Tabulation
Sales by Date and Region
Region
A
B
C
Total
Date 1/2/98 $2,222
$0
$0 $2,222
1/3/98
$0
$555 $1,777 $2,332
Total $2,222
$555 $1,777 $4,554
Decision Support Systems
23
“Drill Down”
(narrower category)
Replace Region by Branch.
Region
A
B
C
Total
Date 1/2/98 $2,222
$0
$0 $2,222
1/3/98
$0
$555 $1,777 $2,332
Total $2,222
$555 $1,777 $4,554
“Rollup” (more general category)
Replace Region by Territory.
Decision Support Systems
24
OLAP Questions
1. Query language - how to say what's
wanted.
2. Processing language - how to specify
calculations: ratios, variances, . . . .
3. Data visualization - how to see the data.
4. Performance - time to process the query
(5 second rule).
Decision Support Systems
25
OLAP References
[Codd93] E. F. Codd, S. B. Codd, and C.T.
Salley, "Providing OLAP to User Analysts:
An IT Mandate," Codd & Date Inc., 1993.
[Fink95] Richard Finkelstein, "MDD:
Database Reaches the Next Dimension,"
DATABASE Programming and Design, 8(4),
April 1995.
Decision Support Systems
26
Exploratory Data Analysis
Data Mining
• Find interesting trends or patterns in large
data sets.
• Statistics - Exploratory Data Analysis
• Artificial Intelligence - Knowledge
Discovery and Machine Learning
• Much larger data sets
Decision Support Systems
27
Mining for Association Rules
• Classic example
• Market basket analysis
– Record each customer transaction at a grocery
store.
– Try and identify sets of items purchased
together.
Decision Support Systems
28
TransID
111
111
111
112
112
112
113
113
113
114
114
115
115
Item
coke
chips
dip
coke
chips
veggies
coke
beef
chicken
chips
beef
chips
chicken
Decision Support Systems
Association Rule:
{coke}  {chips}
People who buy coke usually buy chips.
Measures for Association Rule
{LHS}  {RHS}
• Support: % of transactions containing
this set of items. (2/5=40%)
• Confidence: given all transactions
containing LHS items, the % that also
contain the RHS (2/3=67%)
Want both to be “reasonably” large.
29
On-Line Analytical Processing
(OLAP)
Part II:
CIS 671
Elmasri & Navathe §26.1
Decision Support Systems
30
Multi-dimensional View of Data
• Fact Table (also called cubes)
– Dimension attributes
– Dependent attributes (functions of the
dimension attributes)
• Dimension Tables, potentially one for each
dimension
Decision Support Systems
31
OLAP Operations
• Roll-up – increase the level of aggregation
• Drill-down - decrease the level of aggregation
• Slice-and-dice - selection and projection,
i.e., reduce dimensionality of the data
• Pivot – re-orient the dimensional view
Decision Support Systems
32
Implementation Approaches
• Relational OLAP (ROLAP) Servers
– Data stored in a relational
– system
– SQL extended
• To allow easy OLAP query expression
• To provide efficient OLAP query execution.
• Multidimensional OLAP (MOLAP)
– Systems directly store multidimensional data in special data structures
– OLAP operations implemented directly on these data structures.
• Hybrid OLAP (HOLAP)
– Combines ROLAP and MOLAP.
– Detail records (largest volume) in relational database.
– Aggregations in separate, but connected”, MOLAP store.
Decision Support Systems
33
Example a Star Schema
Order
OrderNo
OrderDate
Customer
CustomerNo
CustomerName
CustomerAddress
City
Salesperson
SalespersonID
SalespersonName
City
Quota
Decision Support Systems
Sales (Fact) table
OrderNo
SalespersonID
CustomerNo
ProdNo
DateKey
CityName
Quantity
TotalPrice
Product
ProdNo
ProdName
ProdDescr
Category
CategoryDescr
UnitPrice
QOH
Date
DateKey
Date
Month
Year
City
CityName
State
Region
34
Snowflake Schema
Order
OrderNo
OrderDate
Customer
CustomerNo
CustomerName
CustomerAddress
City
Salesperson
SalespersonID
SalespersonName
City
Quota
Decision Support Systems
Sales (Fact) table
OrderNo
SalespersonID
CustomerNo
ProdNo
DateKey
CityName
Quantity
TotalPrice
Product
ProdNo
ProdName
ProdDescr
Category
UnitPrice
QOH
Category
CategoryName
CategoryDescr
Date
DateKey
Date
Month
Year
Month
Month
Year
City
CityName
State
Region
State
State
Region
Year
Region
35
Data Cubes
• Precompute all possible aggregations.
• Required extra storage is tolerable.
• Little penalty to keep aggregate up-to-date
if data does not change.
• Normally some aggregation of raw data is
done before it is entered into the data cube.
Decision Support Systems
36
Data Cube with
Orders Accumulated
Customer
CustomerNo
CustomerName
CustomerAddress
City
Salesperson
SalespersonID
SalespersonName
City
Quota
Sales table
SalespersonID
CustomerNo
ProdNo
DateKey
CityName
Quantity
TotalValue
Note that average for any
aggregate can be calculated from
TotalValue and Quantity.
Decision Support Systems
Product
ProdNo
ProdName
ProdDescr
Category
UnitPrice
QOH
Category
CategoryName
CategoryDescr
Date
DateKey
Date
Month
Month
Month
Year
City
CityName
State
State
State
Region
Year
Region
37
Sample of Aggregates in the CUBE
Sales
(SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue)
22
11
100
2
‘Columbus’
3
300
CUBE(Sales)
(SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue)
22
11
100
2
‘Columbus’
3
300
22
*
100
2
‘Columbus’
6
2222
22
*
*
2
‘Columbus’
25
33000
*
*
*
2
‘Columbus’
75
90000
*
*
*
*
‘Columbus’ 200
503444
Decision Support Systems
38
How to answer query given the
relation CUBE(Sales)
Choose tuples in CUBE(Sales) with the following
properties:
1. Query specifies value v for attribute a
 tuple t has v in its component for a.
2. Query groups by attribute a
 tuple t has any non-* value in its component for a.
3. Query has neither groups by attribute a nor specifies
value for a
 tuple t has * value in its component for a.
Decision Support Systems
39
How to answer query given the
relation
CUBE(Sales)
Cube(Sales)
(SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue)
22
11
100
2
‘Columbus’
3
300
22
*
100
2
‘Columbus’
6
2222
22
*
*
2
‘Columbus’
25
33000
*
*
*
2
‘Columbus’
75
90000
*
*
*
*
‘Columbus’ 200
503444
select CustomerNo, avg(Price)
from Sales
where SalespersonID = 22
Group by CustomerNo
Result(c, v/n)
Cube(Sales)
(SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue)
22
c
*
*
*
n
v
Decision Support Systems
40
Cube Implementation by
Materialized Views
• Dimensions may have hierarchies.
– Product, Category
– City, State, Region
Decision Support Systems
41
Example: Materialized Views
Cube(Sales)
(SalespersonID, CustomerNo, ProdNo, DateKey, CityName, Quantity, TotalValue)
City
(CityName, State, Region)
insert into SalesV1
select SalespersonID, CustomerNo, Month, State
sum(Quantity) as Quantity, sum(TotalValue) as TotalValue
from Sales join City on Sales.CityName = City.CityName
group by SalespersonID, CustomerNo, Month, State;
insert into SalesV2
select SalespersonID, CustomerNo, Month, Region
sum(Quantity) as Quantity, sum(TotalValue) as TotalValue
from Sales join City on Sales.CityName = City.CityName
group by SalespersonID, CustomerNo, Month, Region;
Decision Support Systems
42
Example: Query 1
select SalespersonID, sum(TotalValue)
from Sales
group by SalespersonID;
Answer by
select SalespersonID, sum(TotalValue)
from SalesV1
group by SalespersonID;
or by
select SalespersonID, sum(TotalValue)
from SalesV2
group by SalespersonID;
Decision Support Systems
43
Example: Query 2
select SalespersonID, State, sum(TotalValue)
from Sales
group by SalespersonID, State;
Answer only by
select SalespersonID, State, sum(TotalValue)
from SalesV1
group by SalespersonID, State;
Decision Support Systems
44
Example: Query 3
select SalespersonID, State, date, sum(TotalValue)
from Sales
group by SalespersonID, State, Date;
Cannot be answered by either SalesV1 or SalesV2.
Thus must use Sales itself.
Decision Support Systems
45
Lattice of Views
All
All
Years
Quarters
Weeks
Months
Days
Decision Support Systems
Region
State
City
46
Lattice of Materialized Views and
Queries
Q1
SalesV1
Q2
Q3
SalesV2
Sales
Decision Support Systems
47
OLAP Example
Garcia-Molina, Ullman & Widom, Database System Implementation,
Prentice Hall, 2000
Automobile Sales Company: analyze sales of cars
Fact Table
Sales(serialNo, date, dealer, price)
Autos(serialNo, model, color)
Dealers(name, city, state)
Days(day, week, month, year)
( 5,
27,
7, 2000)
Decision Support Systems
Dimension Tables
Time Dimension
Table, probably not
stored
48
Assume a particular car model, say ‘Gobi’,
is not selling as well as anticipated.
How to analyze?
Maybe it’s the color. Slice for ‘Gobi. Dice for color.
select color, sum(price)
from
Sales natural join Autos
where model = ‘Gobi’
group by color;
Doesn’t show anything interesting.
Decision Support Systems
49
Gobi analysis, continuing
What about time?
Drill down for month.
select color, month, sum(price)
from
Sales natural join Autos
join Days on date = day
where model = ‘Gobi’
group by color, month;
Suppose we discover red Gobis have not sold well recently.
Decision Support Systems
50
Gobi analysis, continuing
Are red Gobis selling poorly for all dealers or just some?
Drill down for dealer.
select dealer, month, sum(price)
from
Sales natural join Autos
join Days on date = day
where model = ‘Gobi’ and color = ‘red’
group by dealer, month;
Discover there are too few sales to show anything interesting.
Decision Support Systems
51
Gobi analysis, continuing
Rollup time from month to year and
slice for last two years.
select dealer, year, sum(price)
from
Sales natural join Autos
join Days on date = day
where model = ‘Gobi’ and color = ‘red’
and
(year = ‘1999’ or year = ‘2000’)
group by dealer, year;
Does show variation. Now understand the problem better.
Decision Support Systems
52
Administration
•
•
•
•
Lab assignments and HWs posted on the web.
Clarifications/Questions?
Please use appropriate online submit command
Teams of 2 allowed but make contribution of each
team member explicit especially in the lab
assignment.
• Extra Credit assignment in lab.
• Bring questions to class on Thursday
Decision Support Systems
53
•
•
•
•
•
•
•
•
•
•
(color codes , meaning  tuple
representation (time in quarters,
product,country,Tsales)
time, product, country are
dimension attributes, Tsales is total
sales
White squares (basic fact table)  (q, p, c, sales)
Green squares total annual sales
grouped by product and country.
(*, p, c, Tsales)
Dark Green squares total annual
sales grouped by product  (*, p,
*, Tsales)
Orange squares total annual sales
grouped by quarter and country. 
(q, *, c, Tsales)
Dark orange squares total annual
sales grouped by quarter.  (q, *,
*, Tsales)
Grey total annual sales grouped by
country.  (*, *, c, Tsales)
Other pair (quarter and product) not
shown (need to pivot).  (q, *, p,
Tsales)
Decision Support Systems
Dark blue (all sales) (*, *, *, sales)
Size of white cube = QXPXC, size of
colored cube = (Q+1)X (P+1)X(C+1)
Why? (*  think of it as another category
along each dimension
Size of colored cube with hierarchy 
Even larger!
54
Decision Support Systems
55
Aggregation causes Database
Explosion
in Large Multi-dimensional
Applications as the Number of
Dimensions Increases
Based on Nigel Pendse, “Database Explosion”,
www.olapreport.com/DatabaseExplosion.htm
Decision Support Systems
56
Factors not causing data explosion
• Poor handling of data sparsity.
– No more than factor of 4 vs. factors of 10s or 100s
• Type of database technology.
– Although optimized storage technology will be significantly
better.
• Lack of data compression.
– Compression is helpful, but explosion still occurs.
• Software errors
– Again, a different problem.
Decision Support Systems
57
Multi-dimensional Database (MDB)
can save significant space
• Keys, indexes & dimensional structures .
– Not required or take far less space.
• Sparsity better suppressed.
• Data compressed.
• Example:
–
–
–
–
–
–
6-dimensional (including measures) banking cube
13 million row fact table
Relational fact table incl. indexes, but not aggregates: 5188 Mb
MOLAP cube including aggregations: 336 Mb
Well under 10% the space.
Much faster query processing.
Decision Support Systems
58
Why is there a data explosion even
without sparsity?
• Take two dimensional
example
• n: data from original
source.
• m: data aggregations
precalculated.
• p: on-the-fly results,
not stored.
(n+m+p) 2
(n+m) 2
n2
n
m
p
Simplifying to n=m=p
1n2, 4n2, 9n2
In 3 dimensions this becomes
1n3, 8n3, 27n3
Decision Support Systems
59
When Data is Sparse it’s much worse.
• One-dimensional data.
• Simple hierarchy. Black - actual data, red - nulls.
• Detailed level: 8 of 25 or 32%.
• Aggregated levels: 5 of 6 or 83%.
• Growth factor: 1.625 (13 cells based on 8 input cells)
Decision Support Systems
60
Two dimensions: The problem gets worse
• Potential input cells:
25*25=625
Detail data
• Potential aggregated
cells:
6*6 + 6*25 +6*25
=336
Aggregated data
Decision Support Systems
• More than 1 derived
cell for every 2
possible input cells.
• In 6 dimensions,
could have 2 or 3
derived cells per 1
input cell.
61
What about higher dimensions?
•
•
•
•
•
One percent density, 6 of 625 input cells.
Yields 29 computed cells.
I.e., 35 total cells, only 6 input.
Growth factor: 5.83.
Growth factor per dimension: sqrt(5.83)=2.4.
– Called compound growth factor (CGF).
• CGF is typically in the range 1.5 to 2.5.
• CGF increases as sparsity increases.
• With large dimensions, will often be more consolidation.
– (Many thousands of products more levels of groupings.)
• With CGF of 2.0, extra dimension with no increase in input data,
will double size of fully computed database.
Decision Support Systems
62
So what is the problem?
• Disk space increases.
• Can software handle this much data?
• Time to load and update database increases.
– Could take days to load the database.
Decision Support Systems
63
What to do?
• Avoid fully pre-calculating any multidimensional object with more than 5 sparse
dimensions.
• Reduce sparsity of individual data objects:
– Use good application design.
What to pre-calculate?
• Data that is slow to calculate at run-time because it depends
on many other cells or complex formulae.
• Data that is frequently viewed.
• Data that is the basis of many other calculations.
• Note: If too much is precalculated, performance may decrease
because cache will not include as much useful data.
Decision Support Systems
64