Transcript Document

DOLAP 2004
A New OLAP Aggregation
Based on the AHC Technique
R. Ben Messaoud, O. Boussaid, S. Rabaséda
Laboratoire ERIC – Université de Lyon 2
5, avenue Pierre-Mendès–France
69676, Bron Cedex – France
http://eric.univ-lyon2.fr
Complex data
Definition:
0
1
2
3
4
5
Data are considered complex if they are …
Multi-formats: information can be supported by different kind of
data (numeric, symbolic, texts, images, sounds, videos …)
Multi-structures: structured, unstructured or semi-structured
(relational databases, XML documents …)
Multi-sources: data come from different sources (distributed
databases, web …)
Multi-modals: the same information can be described differently
(data in different languages …)
Multi-versions: data are updated through time (temporal databases,
periodical inventory …)
November 13, 2004
Ben Messaoud et al.
2
General context
Complex data
0
1
Complex data
MDBMS
Analyze complex data
2
3
4
Huge volumes of complex data
Warehousing complex data …
OLAP facts as complex objects
Data mining
OLAP
Current OLAP tools aren’t suited to
process complex data
Data mining is able to process
complex data like images, texts,
videos …
Coupling OLAP and data mining
Analyze complex data on-line
New operator OpAC: Operator of
Aggregation by Clustering (AHC)
5
OpAC
November 13, 2004
Ben Messaoud et al.
3
Outline
0
Complex data and general context
1
Related work: Coupling OLAP and data mining
2
Objectives of the proposed operator
3
Formalization of the operator
4
Implementation and demonstration
5
Conclusion and future works
November 13, 2004
Ben Messaoud et al.
4
Related work
Three approaches for coupling OLAP and data mining
0
1
2
First approach:
approach Extending the query languages of decision
support systems
Second approach:
approach Adapting multidimensional environment to
classical data mining techniques
Third approach:
approach Adapting data mining methods for
multidimensional data
3
4
Data mining
5
OLAP
DBMS
November 13, 2004
Ben Messaoud et al.
5
Related work
These works proved that:
0
1
2
3
4
Associating data mining to OLAP is a promising way to involve rich
analysis tasks
Data mining is able to extend the analysis power of OLAP
Use data mining to enhance OLAP tools in order to
process complex data
OpAC: A new OLAP operator based on a data mining
technique
5
Data mining
November 13, 2004
OpAC
Ben Messaoud et al.
OLAP
6
Objectives
Classic OLAP aggregation Vs OpAC aggregation
0
1
2
Classic OLAP:
Summarizes numerical data in a fewer number of values
Computes additive measures (Sum, Average, Max, Min …)
Example: Sales cube
3
4
Count Sales
Count
+ Washington
+$2520
Bellingham120
$700
32
+ California
+$2410
Bremerton129
$400
20
+
- Washington
Washington
+$2520
Olympia
$850
44
+ Redmond
$250
9
+ Seattle
$320
15
+ Berkeley
$820
41
+$2410
Beverly Hills
129
$910
50
+ Los Angeles
$680
38
5
+
- California
California
November 13, 2004
Sales
120
Ben Messaoud et al.
7
Objectives
Classic OLAP aggregation Vs OpAC aggregation
0
1
2
OpAC aggregation:
What about aggregating complex objects?
How to aggregate images, texts or videos with classic OLAP tools?
Complex objects are not additive OLAP measures …
Example: Images cube
3
Orange coral
4
Nebraska, USA
5
Toco toucan
Maldives
November 13, 2004
Images
?
Size
ASM
3560px
0,016
2340px
0,021
4434px
0,014
3260px
0,012
Ben Messaoud et al.
8
Objectives
0
How to aggregate complex objects?
1
2
Using a data mining technique: AHC
(Agglomerative Hierarchical Clustering)
3
The AHC aggregates data
4
The hierarchical aspect of the AHC
5
November 13, 2004
Ben Messaoud et al.
9
0
Images
1
2
3
4
Very high
High
Medium
Low
Very low
5
L1Normalized for high homogeneity
Objectives
Homogeneity
November 13, 2004
L1Normalized for low entropy
Ben Messaoud et al.
10
Formalization
1
The set of individuals:
2
3
4
5
{gijt / gijt hij }
The set of variables:
Dimension retained for individuals can’t generate variables
Only one hierarchical level of a dimension is allowed to generate
variables
 X /X(gijt)= Measure of gsrv crossed with gijt
S 
 where gsrv  hsr , s  i and r is unique for each s
November 13, 2004
Ben Messaoud et al.



0
Di : the ith dimension of a data cube C
hij : the jth hirarchical level of the dimension Di
gijt : the tth modality of hij
11
Formalization
Evaluation tools
0
1
2
3
Minimize the intra-cluster distances
Maximize the inter-cluster distances
Inter and intra-cluster inertia
A1 , A2 , …, Ak is a partition of 
P(Ai) is the weight of Ai
G(Ai) is the gravity center of Ai
4
Iintra(k)
5
Iinter(k)
November 13, 2004
k
= I(Ai)
i=1
k
= P(Ai)d(G(Ai),G())
i=1
Ben Messaoud et al.
12
Formalization
- Inter-clusters - Intra-cluster
0
500
1
400
2
200
300
100
3
0
7
4
Individuals:
5
Variables:
6
5
4
3
2
1
Modalities from the dimension of images
Very high
High
Medium
Low
Very low
L1Normalized values of images for all possible modalities of the
entropy dimension
L1Normalized values of images for all possible modalities of the
Homogeneity
homogeneity dimension
November 13, 2004
Ben Messaoud et al.
13
Formalization
0
1
2
3
Results:
Exploits the cube’s facts describing images to construct
groups of similar complex objects
Highlights significant groups of objects by a clustering
technique
4
Clusters –aggregates- are defined both from dimensions
and measures of a data cube
5
Implementation of a prototype
November 13, 2004
Ben Messaoud et al.
14
Implementation
0
1
2
3
4
5
Prototype:
Data loading module:
Connects to a data cube on Analysis Services of MS SQL Server
Uses MDX queries to import information about the cube’s structure
Extract data selected by the user
Parameter setting interface:
Assists the user to extract individuals and variables from the cube
Selects modalities and measures
Defines the clustering problem
Clustering module:
Allows the definition of the clustering parameters like dissimilarity
metric and aggregation criterion
Constructs the AHC
Plots the results of the AHC on a dendrogram
November 13, 2004
Ben Messaoud et al.
15
Implementation
0
1
Images dataset:
3000 images collected from the web:
2
3
4
5
Semantic annotation: Description, subject and theme
Descriptors of texture like:
ENT: Entropy
CON: Contrast
L1Normalized: Medium Color Characteristic
…
Three color channels: RGB
November 13, 2004
Ben Messaoud et al.
16
Implementation
0
Demonstration:
1
2
3
4
5
November 13, 2004
Ben Messaoud et al.
17
Conclusion
0
1
2
3
4
5
OpAC is a possible way to realize on-line analysis over
complex data
OpAC aggregates complex objects
Aggregates –clusters- are defined from both dimensions
and measures of a data cube
Prototype available at :
http://bdd.univ-lyon2.fr/?page=logiciel&id=5
November 13, 2004
Ben Messaoud et al.
18
Future works
0
1
2
3
4
5
The current evaluation tool may present some limits
 Use other evaluation indicators to evaluate the quality of partitions
 Assist user to find the best number of clusters
Exploit the aggregates generated by OpAC in order to
reorganize the cube’s dimensions
 Get a new cube with remarkable regions
Use other data mining technique to enhance the OLAP
power with explanation and prediction capabilities
November 13, 2004
Ben Messaoud et al.
19
The End
November 13, 2004
Ben Messaoud et al.
20