Transcript Bird

D
A
T
A
B
A
S
E
Data Warehouse
Predefined
reports
Interactive
data analysis
Operations
data
Daily data
transfer
OLTP Database
3NF tables
Data warehouse
Star configuration
Flat files
1
D
A
T
A
B
A
S
E
Data Warehouse Goals
 Existing databases optimized for Online Transaction
Processing (OLTP)
 Online Analytical Processing (OLAP) requires fast retrievals,
and only bulk writes.
 Different goals require different storage, so build separate dta
warehouse to use for queries.
 Extraction, Transformation, Transportation (ETT)
 Data analysis
 Ad hoc queries
 Statistical analysis
 Data mining (specialized automated tools)
2
D
A
T
A
B
A
S
E
Extraction, Transformation, and
Transportation (ETT)
Customers
Convert Client
to Customer
Apply standard
product numbers
Convert
currencies
Fix region codes
Transaction data
from diverse
systems.
Data warehouse:
All data must be
consistent.
3
D
A
T
A
B
A
S
E
OLTP v. OLAP
Category
Data storage
Indexes
Joins
Duplicated data
Updates
Queries
OLTP
3NF tables
Few
Many
Normalized,
limited duplication
Constant, small data
Specific
OLAP
Multidimensional cubes
Many
Minimal
Denormalized DBMS
Overnight, bulk
Ad hoc
4
D
A
T
A
B
A
S
E
Multidimensional Cube
Pet Store
Item Sales
Amount = Quantity*Sale Price
Customer
Location
5
D
A
T
A
B
A
S
E
Sales Date: Time Hierarchy
Year
Levels
Quarter
Roll-up
To get higher-level totals
Month
Week
Drill-down
To get lower-level details
Day
6
D
A
T
A
B
A
S
E
Star Design
Dimension Tables
Products
Sales Date
Fact Table
Sales
Quantity
Amount=SalePrice*Quantity
Customer
Location
7
D
A
T
A
B
A
S
E
City
Snowflake Design
Merchandise
Sale
ItemID
Description
QuantityOnHand
ListPrice
Category
SaleID
SaleDate
EmployeeID
CustomerID
SalesTax
OLAPItems
SaleID
ItemID
Quantity
SalePrice
Amount
CityID
ZipCode
City
State
Customer
CustomerID
Phone
FirstName
LastName
Address
ZipCode
CityID
Dimension tables can join to
other dimension tables.
8
D
A
T
A
B
A
S
E
OLAP Computation Issues
Quantity
3
2
5
Price
5.00
4.00
9.00
Quantity*Price
15.00
8.00
45.00 or 23.00
Compute Quantity*Price in base query, then add to get $23.00
If you use Calculated Measure in the Cube, it will add first and
multiply second to get $45.00, which is wrong.
9
D
A
T
A
B
A
S
E
OLAP Data Browsing
10
D
A
T
A
B
A
S
E
Microsoft Excel Pivot Table
11
D
A
T
A
B
A
S
E
Excel Pivot Table Reports
Quarter
Month
Quarter 1 Quarter 2 Quarter 3 Quarter 4 Grand Total
LastName EmployeeIDData
Carpenter
8 Sum of Animal
1,668.91
Sum of Merchandise
324.90
Eaton
6 Sum of Animal
522.37
Sum of Merchandise
30.60
Farris
7 Sum of Animal
5,043.36
Sum of Merchandise
826.92
Gibson
2 Sum of Animal
4,983.51
Sum of Merchandise
668.25
Hopkins
4 Sum of Animal
3,747.96
Sum of Merchandise
476.91
James
5 Sum of Animal
3,282.77
Sum of Merchandise
505.89
O'Connor
9 Sum of Animal
2,643.69
Sum of Merchandise
263.70
Reasoner
3 Sum of Animal
4,577.43
Sum of Merchandise
762.30
Reeves
1 Sum of Animal
1,120.93
Sum of Merchandise
263.88
Shields
10 Sum of Animal
1,008.76
Sum of Merchandise
62.10
Total Sum of Animal
28,599.69
Total Sum of Merchandise
4,185.45
606.97
78.30
426.39
99.00
341.85
54.90
1,059.70
188.10
1,549.83
238.50
1,194.88
252.90
2,373.08
693.45
180.91
83.70
625.74
89.10
372.65
121.50
437.88
99.00
510.12
55.80
589.68
116.80
7,591.11
1,624.05
162.15
22.50
2,840.72
569.50
7.20
128.70
562.50
107.10
796.47
306.00
2,556.10
450.90
128.41
7.20
150.11
99.00
2,500.24
396.90
6,701.03
1,495.80
2,709.47
630.90
1,426.72
192.60
6,899.53
1,321.02
9,089.44
1,357.65
5,443.90
858.51
6,243.84
1,397.34
3,334.72
403.20
8,293.09
1,365.10
1,120.93
263.88
1,170.91
84.60
45,732.55
7,874.80
Can place data in rows or columns.
By grouping months, can instantly get quarterly or monthly totals.
12
D
A
T
A
B
A
S
E
Category Month Amount
OLAP in SQL 99
GROUP BY two columns
Gives you totals for each
month within each category.
You do not get superaggregate totals for the
category, or the month, or the
overall total.
Bird
1
$135.00
Bird
2
$45.00
Bird
3
$202.50
Bird
6
$67.50
Bird
7
$90.00
Bird
9
$67.50
Cat
1
$396.00
Cat
2
$113.85
Cat
3
$443.70
Cat
4
$2.25
SELECT Category, Month(SaleDate) AS Month,
Sum(Quantity*SalePrice) AS Amount
FROM Sale INNER JOIN (Merchandise INNER JOIN SaleItem
ON Merchandise.ItemID = SaleItem.ItemID)
ON Sale.SaleID = SaleItem.SaleID
GROUP BY Category, Month(SaleDate);
13
D
A
T
A
B
A
S
E
SQL ROLLUP
SELECT Category, Month…, Sum …
FROM …
GROUP BY ROLLUP (Category, Month...)
Category
Month
Amount
Bird
Bird
…
Bird
Cat
Cat
…
Cat
…
(null)
1
2
135.00
45.00
(null)
1
2
607.50
396.00
113.85
(null)
1293.30
(null)
8451.79
14
D
A
T
A
B
A
S
E
Missing Values Cause Problems
If there are missing values in the groups, it can be difficult
to identify the super-aggregate rows.
Category
Month
Amount
Bird
Bird
…
Bird
Bird
Cat
Cat
…
Cat
…
(null)
1
2
135.00
45.00
(null)
(null)
1
2
32.00
607.50
396.00
113.85
(null)
1293.30
(null)
8451.79
Missing date
Super-aggregate
15
D
A
T
A
B
A
S
E
GROUPING Function
SELECT Category, Month…, Sum …,
GROUPING (Category) AS Gc,
GROUPING (Month) AS Gm
FROM …
GROUP BY ROLLUP (Category, Month...)
Category Month Amount
Bird
1
135.00
Bird
2
45.00
…
Bird
(null)
32.00
Bird
(null)
607.50
Cat
1
396.00
Cat
2
113.85
…
Cat
(null)
1293.30
…
(null) (null)
8451.79
Gc
0
0
Gm
0
0
0
1
0
0
0
0
0
0
1
0
1
1
16
D
A
T
A
B
A
S
E
CUBE Option
SELECT Category, Month, Sum, GROUPING (Category) AS Gc,
GROUPING (Month) AS Gm
FROM …
GROUP BY CUBE (Category, Month...)
Category Month
Bird
1
Bird
2
…
Bird
(null)
Bird
(null)
Cat
1
Cat
2
…
Cat
(null)
(null)
1
(null)
2
(null)
3
…
(null) (null)
Amount
135.00
45.00
Gc
0
0
Gm
0
0
32.00
607.50
396.00
113.85
0
1
0
0
0
0
0
0
1293.30
1358.8
1508.94
2362.68
1
0
0
0
0
1
1
1
8451.79
1
1
17
D
A
T
A
B
A
S
E
GROUPING SETS: Hiding Details
SELECT Category, Month, Sum
FROM …
GROUP BY GROUPING SETS
( ROLLUP (Category),
ROLLUP (Month),
()
)
Category Month
Bird
(null)
Cat
(null)
…
(null)
1
(null)
2
(null)
3
…
(null) (null)
Amount
607.50
1293.30
1358.8
1508.94
2362.68
8451.79
18
D
A
T
A
B
A
S
E
SQL OLAP Analytical Functions
VAR_POP
VAR_SAMP
STDDEV_POP
STDEV_SAMP
COVAR_POP
COVAR_SAMP
CORR
REGR_R2
REGR_SLOPE
REGR_INTERCEPT
variance
standard deviation
covariance
correlation
regression r-square
regression data (many)
19
D
A
T
A
B
A
S
E
SQL RANK Functions
SELECT Employee, SalesValue
RANK() OVER (ORDER BY SalesValue DESC) AS rank
DENSE_RANK() OVER (ORDER BY SalesValue DESC) AS dense
FROM Sales
ORDER BY SalesValue DESC, Employee;
Employee
SalesValue
rank
dense
Jones
18,000
1
1
Smith
16,000
2
2
Black
16,000
2
2
White
14,000
4
3
DENSE_RANK
does not skip
numbers
20
D
A
T
A
B
A
S
E
SQL OLAP Windows
SELECT Category, SaleMonth, MonthAmount,
AVG(MonthAmount)
OVER (PARTITION BY Category
ORDER BY SaleMonth ASC ROWS 2 PRECEDING)
AS MA
FROM qryOLAPSQL99
ORDER BY SaleMonth ASC;
Category
Bird
Bird
Bird
Bird
…
Cat
Cat
Cat
Cat
…
SaleMonth
200101
200102
200103
200104
MonthAmount
1500.00
1700.00
2000.00
2500.00
200101
200102
200103
200104
4000.00
5000.00
6000.00
7000.00
MA
1600.00
1850.00
4500.00
5500.00
21
D
A
T
A
B
A
S
E
Ranges: OVER
SELECT SaleDate, Value
SUM(Value) OVER (ORDER BY SaleDate) AS running_sum,
SUM(Value) OVER (ORDER BY SaleDate RANGE
BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW) AS running_sum2,
SUM (Value) OVER (ORDER BY SaleDate RANGE
BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING) AS remaining_sum;
FROM …
Sum1 computes total from beginning through current row.
Sum2 does the same thing, but more explicitly lists the rows.
Sum3 computes total from current row through end of query.
22
D
A
T
A
B
A
S
E
LAG and LEAD Functions
LAG or LEAD: (Column, # rows, default)
SELECT SaleDate, Value,
LAG (Value 1,0) OVER (ORDER BY SaleDate) AS prior_day
LEAD (Value 1, 0) OVER (ORDER BY SaleDate) AS next_day
FROM …
ORDER BY SaleDate
SaleDate
1/1/2003
1/2/2003
1/3/2003
…
1/31/2003
Value prior_day
1000
0
1500
1000
2000
1500
3500
3200
next_day
1500
2000
2300
Prior is 0 from
default value
0
Not part of standard yet? But are in SQL Server and Oracle.
23
D
A
T
A
B
A
S
E
Data Mining
 Goal: To discover unknown relationships in the data
that can be used to make better decisions.
Transactions and operations
Reports
Specific ad hoc questions
Queries
Aggregate, compare, drill down
OLAP
Databases
Unknown relationships
Data Mining
24
D
A
T
A
B
A
S
E
Exploratory Analysis
 Data Mining usually works autonomously.
 Supervised/directed
 Unsupervised
 Often called a bottom-up approach that scans the data to
find relationships
 Some statistical routines, but they are not sufficient
 Statistics relies on averages
 Sometimes the important data lies in more detailed pairs
25
D
A
T
A
B
A
S
E
Common Techniques
 Classification/Prediction/Regression
 Association Rules/Market Basket Analysis
 Clustering
 Data points
 Hierarchies
 Neural Networks
 Deviation Detection
 Sequential Analysis
 Time series events
 Websites
 Textual Analysis
 Spatial/Geographic Analysis
26
D
A
T
A
B
A
S
E
Classification Examples
 Examples
 Which borrowers/loans are most likely to be successful?
 Which customers are most likely to want a new item?
 Which companies are likely to file bankruptcy?
 Which workers are likely to quit in the next six months?
 Which startup companies are likely to succeed?
 Which tax returns are fraudulent?
27
D
A
T
A
B
A
S
E
Classification Process
 Clearly identify the outcome/dependent variable.
 Identify potential variables that might affect the outcome.
 Supervised (modeler chooses)
 Unsupervised (system scans all/most)
 Use sample data to test and validate the model.
 System creates weights that link independent variables to
outcome.
Income Married Credit History Job Stability Success
50000
Yes
Good
Good
Yes
25000
Yes
Bad
Bad
No
75000
No
Good
Good
No
28
D
A
T
A
B
A
S
E
Classification Techniques
 Regression
 Bayesian Networks
 Decision Trees (hierarchical)
 Neural Networks
 Genetic Algorithms
 Complications
 Some methods require categorical data
 Data size is still a problem
29
D
A
T
A
B
A
S
E
Association/Market Basket
 Examples
 What items are customers likely to buy together?
 What Web pages are closely related?
 Others?
 Classic (early) example:
 Analysis of convenience store data showed customers often buy
diapers and beer together.
 Importance: Consider putting the two together to increase crossselling.
30
D
A
T
A
B
A
S
E
Association Details (two items)
 Rule evaluation (A implies B)
 Support for the rule is measured by the percentage of all
transactions containing both items: P(A ∩ B)
 Confidence of the rule is measured by the transactions with A that
also contain B: P(B | A)
 Lift is the potential gain attributed to the rule—the effect compared
to other baskets without the effect. If it is greater than 1, the effect
is positive:
 P(A ∩ B) / ( P(A) P(B) )
 P(B|A)/P(B)
 Example: Diapers implies Beer
 Support: P(D ∩ B) = .6
 Confidence: P(B|D) = .857
 Lift: P(B|D) / P(B) = 1.714
P(D) = .7
P(B) = .5
= P(D ∩ B)/P(D) = .6/.7
= .857 / .5
31
D
A
T
A
B
A
S
E
Association Challenges
 If an item is rarely purchased, any other item bought with it
seems important. So combine items into categories.
Item
Freq.
Item
Freq.
1 “ nails
2%
Hardware
15%
2” nails
1%
Dim. Lumber
20%
3” nails
1%
Plywood
15%
4” nails
2%
Finish lumber 15%
Lumber
50%
 Some relationships are obvious.
 Burger and fries.
 Some relationships are meaningless.
 Hardware store found that toilet rings sell well only when a new
store first opens. But what does it mean?
32
D
A
T
A
B
A
S
E
Cluster Analysis
 Examples
 Are there groups of customers? (If so, we can cross-sell.)
 Do the locations for our stores have elements in common? (So we
can search for similar clusters for new locations.)
 Do our employees (by department?) have common characteristics?
(So we can hire similar, or dissimilar, people.)
 Problem: Many dimensions and large datasets
Large
intercluster
distance
Small
intracluster
distance
33
D
A
T
A
B
A
S
E
Geographic/Location
 Examples
 Customer location and sales comparisons
 Factory sites and cost
 Environmental effects
 Challenge: Map data, multiple overlays
34