Problem 1: Clustering data stream.

Download Report

Transcript Problem 1: Clustering data stream.

Techniques d’optimisation et de recherche
opérationnelle en fouille de données
évolutives et temporelles
Ph. D Student: TA Minh Thuy: USTH 2010
Director of thesis: Prof. LE Thi Hoai An
Co-director: Dr. Lydia Boujeloud – Assala
LITA, EA3097 - UFR MIM
University Paul Verlaine - Metz - France
About me
 Objective:
 Development new models
 Development new optimization methods




Problems: unsupervised classification and selection of variables
for data mining evolution and temporal (data stream).
Start date: 1 Dec 2010
Team work: Algorithms and Optimization
Category: Information Technology.
Fields of research: Data Mining, Data Stream, Clustering,
Classification, Feature Selection
2
Context
 For many recent applications, the concept of a data stream is more
appropriate than a data set.
 The volume of such data is so large that it may be impossible to
store the data on disk. Furthermore, even when the data can be
stored, the volume of the incoming data may be so large that it
may be impossible to process any particular record more than
once.
 The fact that the data in the streams show the temporal
correlations. Such temporal correlations can help detect the
important data evolution characteristics, and can used to develop
efficient mining algorithms.
3
Context
 The stream model is motivated by emerging applications involving
massive data sets;
 Examples: telephone records, customer click streams,
multimedia data, financial transactions,...
 In these cases, the data have a evolving continuously.
 Examples, the dynamism of the services: content, structure,
promotions,... or the change of user’s behavior, client’s
interest,...or depend on time: time of the day, day of the
week,...or depend on the events: summer vacations, new
year,...
 Therefore, the data stream poses some special challenges of data
mining algorithms. It its necessary to design the mining
algorithms effectively in order to account for changes in
underlying structure of the data stream.
4
Problems:
 Problem 1: Clustering data stream.
 The existing methods of mining data streams focus on the whole
period of data.
 Consequently : only detected those predominant in the entire
period of analysis. The behaviors occurring in short periods of
time are not detected.
 Model for clustering data stream problem: fix windows
 Dividing the analyzed time period into more significant sub
periods, with the aim of detect the evolution of old patterns or the
emergence of the new ones, which would not have been revealed
by a global analysis over the whole time period.
5
Problems:
 Problem 2: Detecting changes in data streams.
 In data stream, the data patterns may evolve over time.
How about the change of data over time?
- Disappears in a cluster of behavior
- Appearance in a cluster of behavior
- Splitting a cluster of behavior
- Combine two or more clusters of behavior
- No change
 Model for detection change data stream problem: sliding windows
6
Problems
 Problem 3: Feature selection based clustering.
 An object can be presented by variables of different types
(quantitative, qualitative or structured). The nature of the variables
is bound to influence the definition of similarity between objects
and the choice is very important.
 The question is to choose among those relevant variables and
eliminating those that are redundant.
 Applications include:
 medical diagnosis (cancer risk assessment, detection of cardiac
arrhythmia,…)
 text categorization (classification of email - spam or not,
classification of web pages,…)
 pattern recognition (face recognition, handwritten digit,...)
 ….
7
Methodology
 Using mathematic techniques to process the data mining problem,
including optimization techniques. A lot of optimization problems
in real-world is non convex.
 To solve the optimization problem non convex, we study
mathematical techniques DC programming and DCA (Difference
convex algorithm).
 DC Programming and DCA (DC Algorithms) introduced in 1985
by Pham Dinh Tao and developed by Le Thi Hoai An and Pham
Dinh Tao since 1994 to become a classic and now increasingly
popular.
8
Results:
• TA Minh Thuy, LE-THI Hoai An, Lydia Boudjeloud-Assala:
Clustering Data Stream Based on Sub-Windows: A DC
Programming Approach – 15th Austrian - French - German
conference on Optimization, International conference AFG11 Toulouse, France, 19-23 Septembre 2011, pp 135-136
9