A data stream - ComplexWorld

Download Report

Transcript A data stream - ComplexWorld

Ernestina Menasalvas Ruiz
Pedro Sousa
GOAL
• Extract knowledge from aviation data sources
to obtain patterns that help detection of
incidents
Learn behaviour models
What is Data Mining?
• Many Definitions
– Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
– Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
KDD process
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
3
CRISP-DM (www.crispdm.org)
Busines
Understanding
Data
Evaluate
Understanding
ARSS
….
fleet
Model
Data
Preparation
Challenges
• Data integration
• Aircraft information
• Context: sensors, space weather, location, weather
• Operations: pre-flight, departure, climb, enroute, arrival,
taxing, post-flight
• Aviation safety reports
• Dynamic and complex data:
– theoretical and practical aspects of the algorithms have
to be analyzed to discover the most appropriate
techniques:
• trend analysis, association of events, datastream methods,
context integration, resource awareness
GOAL (cont)
• apply algorithms to mine the various data
sources for information
– to identify patterns:
• atypical flights,
• anomalous cockpit procedures
• Groups of safety reports
• BUT:
– KDD is a process
• Static vs dynamic
KDD process
Aprox. 80% effort
Data Exploration and transformation
• Exploration of the data to better understand its
characteristics.
– Helping to select the right tool for preprocessing or analysis
– Making use of humans’ abilities to recognize patterns
– Integrate semantic of data
– Clustering and anomaly detection will be used as
exploratory techniques
• Transform data prior to mining so to be able to
extract the useful patterns
Data Mining Tasks
• Prediction (Supervised learning)
– Use some historical information to learn a model that can
help to predict unknown or future values of some variable.
– Base for forecasting
• Classification
• Regression
• Deviation Detection
• Description (Unsupervised)
– Find patterns that describe the data
– Clustering
– Association Rule Discovery
– Sequential Pattern Discovery
Classification
• Given a collection of records in which the class is known:
– Find a model able to describe the class given values of the
rest of attributes.
• Measurements have to be used to validate the model
and determine accuracy of prediction
– Train and test
• Techniques
– Induction tree
• C4.5 , ID3
• Very effcients if we look at the execution time
• Very intuitive results
– Neural networks
• The result is a neural network: black box
• Robust
• No intuitive
Clustering
• Given a set of records (unclassified), group records in
such a way that:
– records in one cluster are more similar to one another.
– records in separate clusters are less similar to one another.
• Similarity Measures have to be defined:
– Special attention to distance understanding
• Approaches
– Divisive Algorithms: They first build different partitions and
then these partitions are evaluated:
• K-means
– Hierarchical: They build a hierarchical descomposition
– Density based: density functions are used
– Kohonen networks [Kohonen ‘95]
Association Rule Discovery
• Given a set of records described by a set of attributes:
– Find associations in values of attributes
– Once associations are discovered, rules can be obtained
– Confidence vs support .
– Apriori Algoritm
At1=1 and At3=1 and At4=1
At1
0
0
1
1
0
0
1
At2
1
0
0
0
0
1
1
At3
0
0
1
1
0
0
1
At4
1
0
1
1
0
1
1
At5
1
1
0
0
0
1
1
At6
0
0
0
1
0
0
0
At7
0
0
0
1
1
1
0
Challenges of the algorithms
• Algorithm to find anomalies in large dataset :
– be fast
– scalable.
– Accurate
• Algorithms have to be able to deal with:
– continuous sequences, representing sensor data
such as airspeed and altitude
– discrete sequences, such as sequences of pilot
switch presses.
Data streams vs static data
Data streams
A data stream:
-
-
-
Challenges into algorithms:
is potentially unbound in size - Processing data in a single pass.
- Generation models in an
needs to be analyzed over
incremental way.
- Ability to detect model changes
time
over time.
arrives at very high rate
- Limit usage of memory and
computing time.
and its undelying model
- Possibility of automating the
evolves over time
evaluation process.
[Aggarwal et al.] “Data Streams: Models and Algorithms”. Advances in Database Systems, Springer, 2007
[Aguilar-Ruiz, Gama] “Data Streams”. Journal of UniversalComputer Science , 2005
[Barbará] “Requirements for clustering data streams”. SIGKDD’02.
Goal
• New challenges introduced by evolving data like:
–
–
–
–
resource aware learning,
change detection,
novelty detection
important application areas where data evolution must
be taken into account
– how learning under constraints (time, storage capacity
and other resources) is affected by data evolution
– how context can help learning process
sudden
drift
mean
Change and concept drift
time
mean
gradual
drift
mean
time
incremental
drift
reoccurring
contexts
mean
time
Concept drift:
the underlying concept
may shift unexpectedly
from time to time.
• Changes appear:
•Adversary actions
•Varying personal interest
•Changing population
•Complex environment
time
[Joao Gama 2010]
Required features
• Examples have to be processed as they arrive
• Each example should be processed:
–
–
–
–
Small constant time
Fixed amount of main memory
Single scan of the data
Without (or reduced) revisit old records.
• Produce models equivalent to the one that would
be obtained by a batch data-mining algorithm
• Detect and react to concept drift
[Joao Gama 2010]
Recurrent concepts
• Many learning algorithms to deal with concept
drift
– Based on: time windows, ensembles, drift
detection.
– FLORA, SEA, DWM, DMM, ...
• What about Recurrent concepts?
– Particular type of concept drift.
– Fogetting mechanisms, past data and models are
discarded.
– However, its common for concepts to reappear.
Context and data stream
Context
• Context representation:
• Context similarity:
numeric:
nominal:
Context integration
• We want to integrate context information with
previously learned models.
• freqC is the most frequent Context in a sequence
of context states {C1, C2, ... Cn}
• Concept history with associated context. h(Mk|Ci)
• Estimate that Mk represents the current
underlying concept given the current context.
Model Storage
• Model storage for a model Mk:
•
•
•
•
the period k where the model was used.
using NB requires storing the CV
the frequent context freqC for period k.
accuracy of the model when it was in use.
• Represented as the tuple:
Model Retrieval
• Model retrieval for a model Mk:
–
–
–
–
using a sample Sn of recent records,
compute the MSE for Mk
get the freqC for Sn
use history h(Mk|freqC)
• The utility is defined based on model accuracy
(highest) and with context similar (min distance) to the
current one.
• Retrieve the model with highest utility as:
CALDS: learning process
• Incrementally Learn the underlying concept
• When warning is signaled:
• Prepare a new base learner for the possible new concept
• Anticipate to drift
• When drift is detected:
• Store the current model
• Reuse a previously learned model when the underlying
concept is recurrent.
CALDS: learning process
Improvements integrating context
Overall accuracy: 72.5 %; 69,6%; 62,2%
SOME ALREADY PREVIOUS EXPERIENCE
Other current applications
• ESA- European Space Agency
– Event Reporting Tool for non-manned satellite passes
(Cryosat monitoring)
31
current applications
• ESA- European Space Agency / Galileo Industries
– Galileo - Ground Control Segment Central Monitoring & Control
Facility
32
Some current applications
• Portuguese Navy
– Singrar – Integrated System for Ship Repair and Resource allocation
33
The process
Integrated Risk
Input
Application
Plans Activation /
Maintenance
Drillings Training
34
Space Weather
Why – Space Weather?
• To protect systems and people that might be at
risk from space weather effects, we need to
understand the causes of space weather.
Space Weather Decision Support
System
• SWDSS Third project financed by the European Space Agency (ESA) about SW
• SWDSS main objective is to develop software capable of storing,
manipulating and reacting to adverse Space Weather situations in spacecrafts:
. Providing tools for analyzing the collected data;
. Supplying reporting facilities for systems management;
. Supplying a knowledge discovery tool for nowcast, forecast
and data mining.
Data sources and providers
• Mission’s telemetry (payload and/or housekeeping) data and
processed data
• Mission’s auxiliary data, e.g. orbital coordinates, apogee and
perigee crossings, station coverage and hand-over, events, 3D
models, metadata
• Data available from other sources, e.g.
NOAA, SIDC, SWENET, National Agencies
• Data from ground-based measurements
Satellite Monitoring
Conclusion
• Huge amount of aviation data
1. Integrate data (micro and macro level)
2. Enrich data with semantics
3. Map data with technique to discover patterns (static
and streams) :
1.
2.
3.
4.
•
•
Anomalities
predictive
Sequences
Context influence
Data mining in other similar domains has
obtained results
Next step: data mining for aviation safety
Ernestina Menasalvas Ruiz
Pedro Sousa