Reducing the Response Time for Data Warehouse Queries Using
Download
Report
Transcript Reducing the Response Time for Data Warehouse Queries Using
Reducing the Response Time for Data
Warehouse Queries Using Rough Set
Theory
By
Mahmoud Mohamed Al-Bouraie
Yasser Fouad Mahmoud Hassan
Wesam Fathy Jasser
Outline
•
•
•
•
•
•
•
•
Aim
Pre-grouping Transformation
Hierarchical pre-grouping
Attribute Selection
A Heuristic Algorithm for Attribute Selection
Worked Example
Applications
Conclusion
Aim
• To reach to the optimization case.
• This case is happened when the response time for
processing a query become small as possible using
– Pre-grouping transformation and
– The size of any database become small as possible
(using rough set theory).
Pre-grouping Transformation (1)
• It dependents on some
concepts like star
schema structure as
shown in the figure it
contents of fact table
surrounded by
dimension tables. The
relationships between
them is 1: N.
Pre-grouping Transformation (2)
•Another concept that that this transformation dependents
on is hierarchical clustering of data.
•It is based on the idea that the hierarchy of one dimension
are encoded into hierarchical surrogates used in the fact
table.
•There is a compact representation of the hierarchy path
of a dimension member making it possible to use
hierarchy on the fact table without requiring residual joins.
The figure shows an example
Hierarchical pre-grouping
• We assume that the DBMS has information about
hierarchical relationships of the dimension attributes.
• We group on the highest hierarchy level to reduces
the number of resulting groups.
• The groups of the pre-grouping operation are joined
with the dimension tables, in order to get the values
for the grouping attributes.
Attribute Selection
• Depending on rough set theory, a database always
contains a lot of attributes that are redundant.
• To eliminate these redundant attributes we use
attribute selection that used to find an optimal subset
of attributes in a database according to some criterion,
so that a classifier with the highest possible accuracy
can be induced by learning algorithm using
information about data available only from the subset
of attributes .
A Heuristic Algorithm
for Attribute Selection
• Let R be a set of the selected attributes, P be the set of
unselected condition attributes, U be the set of all
objects, X be the set of contradictory objects, Va denotes
the attribute a values and EXPECT be the threshold of
accuracy.
• In the initial state, R = CORE(C),
P C CORE (C ), X U POS R (D)
k = 0.
Attribute Selection using RSH (1)
• Step 1. If k >= EXPECT, finish, otherwise calculate
the dependency degree, k,
| POS R ( D) |
k
.
|U |
• Step 2. For each p in P, calculate
v p | POS ( R{ p}) ( D) |
m p max_ size ( POS ( R{ p}) ( D) /( R { p} D))
• where max_size denotes the cardinality of the
maximal subset.
Attribute Selection using RSH (2)
• Step 3. Choose the best attribute p with the
largest v p m p ,and let
R R { p}
P P { p}.
• Step 4. Remove all consistent instances u in POS R (D)
from X.
• Step 5. Go back to Step 1.
Worked Example of Attribute Selection
U
u1
u2
u3
u4
u5
u6
u7
a
1
1
1
1
2
2
2
b c d e
0
0
2
2
1
1
1
2
2
0
2
0
1
2
1
0
0
1
0
0
1
1
1
2
0
2
2
1
Condition Attributes:
a: Va = {1, 2}
b: Vb = {0, 1, 2}
c: Vc = {0, 1, 2}
d: Vd = {0, 1}
Decision Attribute:
e: Ve = {0, 1, 2}
• After deleting all consistent objects we have:
T
U
u1
u2
u3
u4
u5
u6
u7
a
1
1
1
1
2
2
2
T’
R={b}
b c d
e
U’
b
e
0
0
2
2
1
1
1
1
1
2
0
2
2
1
u1
u2
u3
u4
u5
u6
u7
0
1
0
1
2
2
2
0
1
2
1
2
1
1
2
2
0
2
0
1
2
1
0
0
1
0
0
1
b0 e1
The instances containing b0 will not be considered.
U/{a,b}
1. Selecting {a}
R = {a,b}
U’
u3
u4
u5
u6
u7
a
1
1
2
2
2
b
e
2
2
1
1
1
2
0
2
2
1
a1b2 e2
a1b2 e0
u3
u5
u6
u4
u7
a 2b1 e2
a 2b1 e1
U/{e}
u3,u5,u6
POS
{a ,b}
X U /{e}
(X )
u4
u7
Also, we select {c} and {d}. Then finally we
found:
POS{b ,d } ({u3, u5, u 6}) /{b, d } {{u3},{u5, u 6}}
max_ size ( POS{b ,d } ({u3, u5, u 6}) /{b, d }) 2
Result: Subset of attributes= {b, d}
Application and Result
• We built a simple database depending on Heart diseases dataset using
Excel file. The attributes information and their types will be as following:
• Attribute Information:
• 1) age
2) sex
3) chest pain type (4 values)
• 4) resting blood pressure
5) serum cholesterol in mg/dl
• 6) fasting blood sugar > 120 mg/dl
• 7) resting electrocardiograph results (values 0, 1, and 2)
• 8) maximum heart rate achieved
9) exercise induced angina
• 10) old peak = ST depression induced by exercise relative to rest
• 11) the slope of the peak exercise ST segment
• 12) number of major vessels (0-3) colored by fluoroscopy
• 13) thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
• Attributes types
• Real: 1,4,5,8,10,12 Ordered: 11 Binary: 2,6,9 Nominal: 7,3,13
• Using Rosetta software which is used for analyze the data; Then the
database reduced. After that we generate the rules on the best reduct.
Finally we filtered the rules using Quality filtering loop. The result for our
experiment like this in Table .
Conclusion
• The star query: the most common type of query
in data warehouse,
• One of the most promising techniques for
efficiently evaluating such queries is the use of
fact table organizations that store data clustered
according to the dimension hierarchies.
– A special hierarchical encoding is imposed on star
joins are transformed to multidimensional range
queries on the underlying multidimensional structures.
The conventional star query evaluation plan changes
radically and new processing steps are required.