Data Warehouse Tutorial

Download Report

Transcript Data Warehouse Tutorial

Data Warehouse [ Example ]
J. Han and M. Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001, ISBN 1558604898
Data Mining: Concepts and Techniques
1
OLTP
• it design for optimal transaction
Data Mining: Concepts and Techniques
2
OLAP
• It design to give overview analysis
of what happened!
• It is uses to built report answer the following :
– Q1: who the supervisor that gave most discount?
– Q2:in which Zip code did product a sell the most?
To answer the questions, OLAP Cube are created.
Data Mining: Concepts and Techniques
3
Example
• Assume we have made a record of the weather
conditions during a two-week period, along with the
decisions of a tennis player whether or not to play tennis
on each particular day.
• We have values of four independent variables (outlook,
temperature, humidity, windy) and one dependent
variable (play)
Consider our data stored in a relational table as follows:
Data Mining: Concepts and Techniques
4
Example (cont.)
Day
outlook
Temperature
humidity
windy
Play
1
Sunny
85
85
false
No
2
Sunny
80
90
true
no
3
overcast
83
86
false
Yes
4
Rainy
70
96
false
Yes
5
Rainy
68
80
false
Yes
6
Rainy
65
70
true
No
7
overcast
64
65
true
yes
8
Sunny
72
95
false
no
9
Sunny
69
70
false
yes
10
Rainy
75
80
false
yes
11
Sunny
75
70
true
yes
12
overcast
72
90
true
yes
13
overcast
81
75
false
yes
14
Rainy
71
91
true
no
Data Mining: Concepts and Techniques
5
Example (cont.)
• By querying a DBMS containing the above table we may
answer questions like:
 What was the temperature in the sunny days?
{85, 80, 72, 69, 75}
 Which days the humidity was less than 75?
{6, 7, 9, 11}
 Which days the temperature was greater than 70?
{1, 2, 3, 8, 10, 11, 12, 13, 14}
 Which days the temperature was greater than 70 and the humidity
was less than 75? The intersection of the above two:
{11}
Data Mining: Concepts and Techniques
6
Example (cont.)
• OLAP: Using OLAP we can create a Multidimensional
Model of our data (Data Cube).
For example using the dimensions: time, outlook and
play we can create the following model.
Data Mining: Concepts and Techniques
7
Example (cont.)
Yes/No
sunny
rainy
overcast
Week 1
0/2
2/1
2/0
Week 2
2/1
1/1
2/0
Obviously here time represents the days grouped in weeks (week 1 - days 1, 2, 3, 4,
5, 6, 7; week 2 - days 8, 9, 10, 11, 12, 13, 14) over the vertical axis. The outlook is
shown along the horizontal axis and the third dimension play is shown in each
individual cell as a pair of values corresponding to the two values along this
dimension - yes / no.
Thus in the upper left corner of the cube we have the total over all weeks and all
outlook values.
Data Mining: Concepts and Techniques
8
Example (cont.)
• By apply "Drill-down" to our data cube over the time
dimension.
• This assumes the existence of a concept hierarchy
for this attribute. We can show this as a horizontal
tree as follows:
Data Mining: Concepts and Techniques
9
Example (cont.)
Time
week1
Data Mining: Concepts and Techniques
week2
day1
day1
day2
day2
day3
day3
day4
day4
day5
day5
day6
day6
day7
day7
10
Example (cont.)
• The drill-down operation is based on climbing down
the concept hierarchy, so that we get the following
data cube:
Yes/ No
1
2
3
4
5
6
7
8
9
10
11
12
13
Data Mining: Concepts and Techniques
sunny
rainy
overcast
0/1
0/1
0/0
0/0
0/0
0/0
0/0
0/1
1/0
0/0
1/0
0/0
0/0
0/0
0/0
0/0
1/0
1/0
0/1
0/0
0/0
0/0
1/0
0/0
0/0
0/0
0/0
0/0
1/0
0/0
0/0
0/0
1/0
0/0
0/0
0/0
0/0
1/0
1/0
11
Multidimensional data model
By using same example and change some values:
• play has just two values - yes and no, it can replace them by 1
and 0
 This will allows us to add up values and thus get the total number of
days when tennis was played and at the same time the number of
days tennis was not played
• Rename the day attribute into time, which is more general
and will allow us to use other time units (e.g. weeks).
Thus we get the following relational table:
Data Mining: Concepts and Techniques
12
Multidimensional data model (cont.)
time
outlook
temperature
humidity
windy
play
1
sunny
85
85
false
0
2
sunny
80
90
true
0
3
overcast
83
86
false
1
4
rainy
70
96
false
1
5
rainy
68
80
false
1
6
rainy
65
70
true
0
7
overcast
64
65
true
1
8
sunny
72
95
false
0
9
sunny
69
70
false
1
10
rainy
75
80
false
1
11
sunny
75
70
true
1
12
overcast
72
90
true
1
13
overcast
81
75
false
1
14
rainy
71
91
true
0
Data Mining: Concepts and Techniques
13
Concept hierarchies
1- attributes day, temperature and humidity we can
group values in subsets and name these subsets as
following :
Day:
all
______|_________
|
week 1
_____|_____
| | | | | | |
1 2 3 4 5 6 7
Data Mining: Concepts and Techniques
|
week 2
_______|_______
| | | | | | |
8 9 10 11 12 13 14
14
Concept hierarchies (cont.)
Temperature:
all
____________|_____________
|
|
|
hot
mild
cool
_ |___
__|____
___|____
| | | | | | | |
| | | |
80 81 83 85 70 71 72 75
64 65 68 69
Data Mining: Concepts and Techniques
15
Concept hierarchies (cont.)
Humidity:
all
___|___________
|
|
high
normal
______|_______
___|____
| | | | | |
| | | |
85 86 90 91 95 96
65 70 75 80
Data Mining: Concepts and Techniques
16
Concept hierarchies (cont.)
• We may also extend the sets of numbers or replace
them with intervals, which will make the hierarchy
complete (covering all possible values). For example,
humidity may look like this:
all
____|____
|
|
high
normal
|
|
[85,96]
[65,84]
Data Mining: Concepts and Techniques
17
Concept hierarchies (cont.)
2- For the nominal (non numeric) attributes outlook
and windy we define one-level hierarchies, as their
values cannot be ordered or grouped.
outlook:
all
_______|________
|
|
|
sunny rainy
overcast
Data Mining: Concepts and Techniques
18
Concept hierarchies (cont.)
windy:
all
___|____
|
|
true
false
Data Mining: Concepts and Techniques
19
Data cube
• The number of dimensions define the total number of data
cubes that can be created.
number of elements is 2N elements; N is an number attributes
Data Mining: Concepts and Techniques
20
Data cube (cont.)
To create a data cube we have to:
1- Select dimensions, that is select a subset of attributes.
 For example, select time and temperature. Thus we will create a twodimensional data cube.
2- Select levels in the concept hierarchies.
 For example, let us select weeks for time and degrees for
temperature.
3- Select a measure to populate the cube. This is the attribute
whose values will be aggregated across the dimensions
(obviously it has to be numeric).
 For example, Let us select play.
Data Mining: Concepts and Techniques
21
Data cube (cont.)
• By placing the time values in the rows and the
temperature values in the columns we get the
following cube:
64
65
68
69
70
71
72
75
80
81
83
85
Week1
1
0
1
0
1
0
0
0
0
0
1
0
week2
0
0
0
1
0
0
1
2
0
1
0
0
The numbers in the internal cells are obtained by adding up the values of the play
attribute, where the time and the temperature attribute are equal to the values in
the corresponding row and column
• For example the value 2 (row 2, column 8) means that tennis was played two days
during week 2 when the temperature was 75.
Data Mining: Concepts and Techniques
22
OLAP operations
Rollup:
• assume we want to change the level that we selected
for the temperature hierarchy to the intermediate
level (hot, mild, cool).
• Roll up produces the following cube:
cool
mild
hot
week 1
2
1
1
week 2
1
3
1
Data Mining: Concepts and Techniques
23
OLAP operations (cont.)
Drill-down
• the drill down of the pervious data cube over the
time dimension produces the following:
Data Mining: Concepts and Techniques
24
OLAP operations (cont.)
cool
mild
hot
day 1
0
0
0
day 2
0
0
0
day 3
0
0
1
day 4
0
1
0
day 5
1
0
0
day 6
0
0
0
day 7
1
0
0
day 8
0
0
0
day 9
1
0
0
day 10
0
1
0
day 11
0
1
0
day 12
0
1
0
day 13
0
0
1
day 14
0
0
0
Data Mining: Concepts and Techniques
25
Lattice of cubes, slice and dice operations
Lattice : there are five dimension: Time, outlook,
temperature, humidity, windy.
Data Mining: Concepts and Techniques
26
Lattice of cubes, slice and dice operations
(cont.)
 0-D (apex) cuboids : { all}
 1-D cuboids:{ Time}, {Outlook}, {Temperature}, {Humidity}, { Windy}
 2-D cuboids:
{
{Time, Outlook}, {Time, Temperature}, { Time, Humidity}, {Time,
Windy},
{Outlook, Temperature}, {Outlook, Humidity}, {Outlook, Windy},
{Temperature, Humidity}, { Temperature, Windy},
{Humidity, Windy}
}
Data Mining: Concepts and Techniques
27
Lattice of cubes, slice and dice operations
(cont.)
 3- D Cuboids : {
{ Time, Outlook, Temperature}, {Time, Outlook, Humidity},
{Time, Outlook, Windy},
{Time, Temperature, Humidity}, {Time, Temperature, Windy},
{Time, Humidity, Windy}
{Outlook, Temperature, Humidity}, { Outlook, Temperature,
Windy},
{Outlook, Humidity, Windy}
{Temperature, Humidity, Windy}
}
Data Mining: Concepts and Techniques
28
Lattice of cubes, slice and dice operations
(cont.)
 4-D cuboids: {
{ Time, Outlook, Temperature, Humidity},
{Time, Outlook, Temperature, Windy},
{Time, Outlook, Humidity, Windy},
{Time, Temperature, Humidity, Windy}
{Outlook, Temperature, Humidity, Windy}
}
 5- D cuboids { Time, Outlook, Temperature, Humidity, Windy}
Data Mining: Concepts and Techniques
29
Lattice of cubes, slice and dice operations
(cont.)
• There are two other OLAP operations that are related
to the selection of a cube - slice and dice.
Slice : performs a selection on one dimension of the
given cube, thus resulting in a subcube.
 For example, if we make the selection (temperature=cool) we will
reduce the dimensions of the cube from two to one, resulting in
just a single column from the pervious tables. So, the result will be
as following:
Data Mining: Concepts and Techniques
30
Lattice of cubes, slice and dice operations
(cont.)
Cool
Data Mining: Concepts and Techniques
day 1
0
day 2
0
day 3
0
day 4
0
day 5
1
day 6
0
day 7
1
day 8
0
day 9
1
day 10
0
day 11
0
day 12
0
day 13
0
day 14
0
31
Lattice of cubes, slice and dice operations
(cont.)
• The dice operation works similarly and performs a
selection on two or more dimensions.
 For example, applying the selection (time = day 3 OR time
= day 4) AND (temperature = cool OR temperature = hot)
to the original cube we get the following subcube (still
two-dimensional):
Cool
Hot
day 3
0
1
day 4
0
0
Data Mining: Concepts and Techniques
32
The End
Data Mining: Concepts and Techniques
33