Anomaly Detection Using Data Mining Techniques

Download Report

Transcript Anomaly Detection Using Data Mining Techniques

Anomaly Detection Using Data
Mining Techniques
Margaret H. Dunham, Yu Meng, Donya
Quick, Jie Huang, Charlie Isaksson
CSE Department
Southern Methodist University
Dallas, Texas 75275
[email protected]
This material is based upon work supported by the National Science Foundation under
Grant No. IIS-0208741
12/9/08, Sandia
National Labs
1
Objectives/Outline
Develop modeling techniques which can
“learn/forget” past behavior of spatiotemporal
stream events. Apply to prediction of anomalous
events.





Introduction
EMM Overview
EMM Applications to Anomaly Detection
EMM Applications to Nuclear Testing
Future Work
12/9/08, Sandia
National Labs
2
12/9/08, Sandia
National Labs
3
Outline
 Introduction
 Motivation
 What is an anomaly?
 Spatiotemporal Data
 Modeling Spatiotemporal Data
 EMM Overview
 EMM Applications to Anomaly Detection
 Emm Applications to Nuclear Testing
 Future Work
12/9/08, Sandia
National Labs
4
Motivation
A growing number of applications generate streams
of data.
 Computer network monitoring data
 Call detail records in telecommunications
 Highway transportation traffic data
 Online web purchase log records
 Sensor network data
 Stock exchange, transactions in retail chains, ATM
operations in banks, credit card transactions.
Data mining techniques play a key role in modeling
and analyzing this data.
12/9/08, Sandia
National Labs
5
What is Anomaly?
 Event that is unusual
 Event that doesn’t occur frequently
 Predefined event
 What is unusual?
 What is deviation?
12/9/08, Sandia
National Labs
6
What is Anomaly in Stream Data?
 Rare - Anomalous – Surprising
 Out of the ordinary
 Not outlier detection
 No knowledge of data distribution
 Data is not static
 Must take temporal and spatial values into account
 May be interested in sequence of events
 Ex: Snow in upstate New York is not an anomaly
 Snow in upstate New York in June is rare
 Rare events may change over time
12/9/08, Sandia
National Labs
7
Statistical View of Anomaly
 Outlier
 Data item that is outside the normal
distribution of the data
 Identify by Box Plot
Image from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.
12/9/08, Sandia
National Labs
8
Statistical View of Anomaly
 Identify by looking at distribution
 THIS DOES NOT WORK with stream data
Image from www.wikipedia.org, Normal distribution.
12/9/08, Sandia
National Labs
9
Data Mining View of Anomaly
 Classification Problem




Build classifier from training data
Problem is that training data shows what is
NOT an anomaly
Thus an anomaly is anything that is not viewed
as normal by the classification technique
MUST build dynamic classifier
 Identify anomalous behavior


Signatures of what anomalous behavior looks
like
Input data is identified as anomaly if it is similar
enough to one of these signatures
 Mixed – Classification and Signature
12/9/08, Sandia
National Labs
10
Visualizing Anomalies
Temporal Heat Map (THM) is a visualization technique
for streaming data derived from multiple sensors.
 Two dimensional structure similar to an infinite table.
 Each row of the table is associated with one sensor
value.
 Each column of the table is associated with a point in
time.
 Each cell within the THM is a color representation of
the sensor value
 Colors normalized (in our examples)
 0 – While
 0.5 – Blue
 1.0 - Red
12/9/08, Sandia
National Labs
11
THM of VoIP Data
VoIP traffic data was provided by Cisco Systems and
represents logged VoIP traffic in their Richardson, Texas
facility from Mon Sep 22 12:17:32 2003 to Mon Nov 17
11:29:11 2003.
12/9/08, Sandia
National Labs
12
Spatiotemporal Stream Data
Records may arrive at a rapid rate
High volume (possibly infinite) of continuous data
Concept drifts: Data distribution changes on the fly
Data does not necessarily fit any distribution
pattern
Multidimensional
Temporal
Spatial
Data are collected in discrete time intervals,
Data are in structured format, <a1, a2, …>
Data hold an approximation of the Markov
property.
12/9/08, Sandia
National Labs
13
Spatiotemporal Environment
 Events arriving in a stream
 At any time, t, we can view the state
of the problem as represented by a
vector of n numeric values:
Vt = <S1t, S2t, ..., Snt>
V1
S1
S2
…
Sn
S11
S21
…
Sn1
V2
S12
S22
…
Sn2
…
…
…
…
…
Vq
S1q
S2q
…
Snq
Time
12/9/08, Sandia
National Labs
14
Data Stream Modeling











Single pass: Each record is examined at most once
Bounded storage: Limited Memory for storing synopsis
Real-time: Per record processing time must be low
Summarization (Synopsis )of data
Use data NOT SAMPLE
Temporal and Spatial
Dynamic
Continuous (infinite stream)
Learn
Forget
Sublinear growth rate - Clustering
12/9/08, Sandia
National Labs
15 15
MM
A first order Markov Chain is a finite or countably infinite
sequence of events {E1, E2, … } over discrete time
points, where Pij = P(Ej | Ei), and at any time the future
behavior of the process is based solely on the current
state
A Markov Model (MM) is a graph with m vertices or states,
S, and directed arcs, A, such that:
 S ={N1,N2, …, Nm}, and
 A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc,
Lij = <Ni,Nj> is labeled with a transition probability
Pij = P(Nj | Ni).
12/9/08, Sandia
National Labs
16
Problem with Markov Chains
 The required structure of the MC may not be certain
at the model construction time.
 As the real world being modeled by the MC
changes, so should the structure of the MC.
 Not scalable – grows linearly as number of events.
 Our solution:
 Extensible Markov Model (EMM)
 Cluster real world events
 Allow Markov chain to grow and shrink
dynamically
12/9/08, Sandia
National Labs
17
Outline
Introduction
EMM Overview
EMM Applications to Anomaly
Detection
EMM Applications to Nuclear Testing
Future Work
12/9/08, Sandia
National Labs
18
Extensible Markov Model (EMM)
 Time Varying Discrete First Order Markov Model
 Nodes are clusters of real world states.
 Learning continues during application phase.
 Learning:
 Transition probabilities between nodes
 Node labels (centroid/medoid of cluster)
 Nodes are added and removed as data arrives
12/9/08, Sandia
National Labs
19
Related Work
 Splitting Nodes in HMMs
 Create new states by splitting an existing state

M.J. Black and Y. Yacoob,”Recognizing facial expressions in image
sequences using local parameterized models of image motion”, Int. Journal
of Computer Vision, 25(1), 1997, 23-48.
 Dynamic Markov Modeling
 States and transitions are cloned

G. V. Cormack, R. N. S. Horspool. “Data compression using dynamic
Markov Modeling,” The Computer Journal, Vol. 30, No. 6, 1987.
 Augmented Markov Model (AMM)
 Creates new states if the input data has never been seen
in the model, and transition probabilities are adjusted

Dani Goldberg, Maja J Mataric. “Coordinating mobile robot group behavior
using a model of interaction dynamics,” Proceedings, the Third
International Conference on Autonomous Agents (agents ’99), Seattle,
Washington
12/9/08, Sandia
National Labs
20
EMM vs AMM
Our proposed EMM model is similar to AMM, but is more
flexible:
 EMM continues to learn during the application phase.
 State matching is determined using a clustering
technique.
 EMM not only allows the creation of new nodes, but
deletion (or merging) of existing nodes. This allows the
EMM model to “forget” old information which may not be
relevant in the future. It also allows the EMM to adapt to
any main memory constraints for large scale datasets.
 EMM performs one scan of data and therefore is suitable
for online data processing.
12/9/08, Sandia
National Labs
21
EMM
Extensible Markov Model (EMM): at any time t,
EMM consists of an MM and algorithms to modify
it, where algorithms include:
 EMMSim, which defines a technique for matching
between input data at time t + 1 and existing states
in the MM at time t.
 EMMIncrement algorithm, which updates MM at
time t + 1 given the MM at time t and classification
measure result at time t + 1.
Additional algorithms may be added to modify the
model or for applications.
12/9/08, Sandia
National Labs
22
EMMSim
 Find closest node to incoming event.
 If none “close” create new node
 Labeling of cluster is centroid/medoid of
members in cluster
 Problem
 Nearest Neighbhor O(n)
 BIRCH O(lg n)
• Requires second phase to recluster
initial
12/9/08, Sandia
National Labs
23
EMMIncrement
<18,10,3,3,1,0,0>
<17,10,2,3,1,0,0>
<16,9,2,3,1,0,0>
<14,8,2,3,1,0,0>
2/3
2/3
2/21
2/3
1/1
1/2
1/2
N3
N1
1/3
N2
1/1
1/2
1/1
<14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.>
12/9/08, Sandia
National Labs
24
EMMDecrement
N1
N3
1/3
1/3
2/2
1/3
N2
1/2
N5
12/9/08, Sandia
National Labs
N1
1/3
N3
1/3
1/6
Delete N2
1/6
1/3
N6
N5
1/6
N6
25
EMM Advantages





Dynamic
Adaptable
Use of clustering
Learns rare event
Scalable:
 Growth of EMM is not linear on size of
data.
 Hierarchical feature of EMM
 Creation/evaluation quasi-real time
 Distributed / Hierarchical extensions
12/9/08, Sandia
National Labs
26
Growth of EMM
number of state in model
800
700
600
threshold 0.994
500
threshold 0.995
400
threshold 0.996
300
threshold 0.997
200
threshold 0.998
100
1502
1423
1344
1265
1186
1107
1028
949
870
791
712
633
554
475
396
317
238
159
80
1
0
number of input data (total 1574)
Servent Data
12/9/08, Sandia
National Labs
27
EMM Performance – Growth Rate
Data
Ser
went
Ouse
Sim 0.99
Jaccrd 156
Dice
72
Cosine 11
Ovrlap 2
Jaccrd 56
Dice
40
Cosine 6
Ovrlap 1
12/9/08, Sandia
National Labs
Threshold
0.992 0.994 0.996 0.998
190
268
389
667
92
123
191
389
14
19
31
61
2
3
3
4
66
81
105
162
43
52
66
105
8
10
13
24
1
1
1
1
28
EMM Performance – Growth Rate
12/9/08, Sandia
National Labs
Minnesota Traffic Data
29
Outline
Introduction
EMM Overview
EMM Applications to Anomaly
Detection
EMM Applications to Nuclear Testing
Future Work
12/9/08, Sandia
National Labs
30
Datasets/Anomalies
 MnDot – Minnesota Department of Transportation
 Automobile Accident
 Ouse and Serwent – River flow data from England
 Flood
 Drought
 KDD Cup’99
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Intrusion
 Cisco VoIP – VoIP traffic data obtained at Cisco
 Unusual Phone Call
12/9/08, Sandia
National Labs
31
Rare Event Detection
Detected unusual
weekend traffic pattern
Weekdays Weekend
12/9/08, Sandia
National Labs
Minnesota DOT Traffic Data
32
Our Approach to Detect Anomalies
 By learning what is normal, the model can
predict what is not
 Normal is based on likelihood of occurrence
 Use EMM to build model of behavior
 We view a rare event as:
 Unusual event
 Transition between events states which
does not frequently occur.
 Base rare event detection on determining
events or transitions between events that do
not frequently occur.
 Continue learning
12/9/08, Sandia
National Labs
33
EMMRare
 EMMRare algorithm indicates if the current
input event is rare. Using a threshold
occurrence percentage, the input event is
determined to be rare if either of the following
occurs:
 The frequency of the node at time t+1 is
below this threshold
 The updated transition probability of the MC
transition from node at time t to the node at
t+1 is below the threshold
12/9/08, Sandia
National Labs
34
Determining Rare
 Occurrence Frequency (OFc) of a node Nc :
OFc =
CNc
 CN
i
i
 Normalized Transition Probability (NTPmn),
from one state, Nm, to another, Nn :
NTPmn = CL
mn
 CN
i
i
12/9/08, Sandia
National Labs
36
EMMRare
Given:
• Rule#1: CNi <= thCN
• Rule#2: CLij <= thCL
• Rule#3: OFc <= thOF
• Rule#4: NTPmn <= thNTP
Input: Gt: EMM at time t
i: Current state at time t
R= {R1, R2,…,RN}: A set of rules
Output: At: Boolean alarm at time t
Algorithm:
1 Ri = True
At =
0 Ri = False
12/9/08, Sandia
National Labs
37
Rare Event in Cisco Data
12/9/08, Sandia
National Labs
38
Risk assessment
 Problem: Mitigate false alarm rate while maintaining a
high detection rate.
 Methodology:
 Historic feedbacks can be used as a free resource to
take out some possibly safe anomalies
 Combine anomaly detection model and user’s
feedbacks.
 Risk level index
 Evaluation metrics: Detection rate, false alarm rate.
 Detection rate
Detection rate = TP/(TP+TN)
 False alarm rate
False alarm rate = FP/(TP+FP)
 Operational Curve
12/9/08, Sandia
National Labs
39
Reducing False Alarms
•Calculate Risk using historical feedback
•Historical Feedback:
•Count of true alarms:
12/9/08, Sandia
National Labs
40
Detection Rate Experiments
DETECTION RATE OF ANOMALY DETECTION AND RISK ASSESSMENT MODELS
1
0.5
0
16
ANOMALY DETECTION
RISK ASSESSMENT
18
20
22
24
26
28
(a) EUCLIDEAN THRESHOLD FOR CLUSTERING (th)
30
32
1
0.5
0
0
ANOMALY DETECTION
RISK ASSESSMENT
0.1
1
0.3
0.4
0.5
0.6
0.7
0.8
(b) RISK ASSESSMENT WEIGHT FACTOR (alpha)
0.9
1
ANOMALY DETECTION
RISK ASSESSMENT
0.5
0
0
0.2
50
100
150
200
250
300
(c) EMM STATE CARDINALITY THRESHOLD (thNode)
350
400
1
0.5
0
0
12/9/08, Sandia
National Labs
ANOMALY DETECTION
RISK ASSESSMENT
50
100
150
200
250
(d) EMM TRANSITION CARDINALITY THRESHOLD (thLink)
300
41
False Alarm Rate
FALSE ALARM RATE OF ANOMALY DETECTION AND RISK ASSESSMENT MODELS
1
ANOMALY DETECTION
RISK ASSESSMENT
0.5
0
16
18
1
20
22
24
26
28
(a) EUCLIDEAN THRESHOLD FOR CUSTERING (th)
0.1
1
1
0.5
0
0
12/9/08, Sandia
National Labs
0.2
0.3
0.4
0.5
0.6
0.7
0.8
(b) RISK ASSESSMENT WEIGHT FACTOR (alpha)
0.9
1
ANOMALY DETECTION
RISK ASSESSMENT
0.5
0
0
32
ANOMALY DETECTION
RISK ASSESSMENT
0.5
0
0
30
50
100
150
200
250
300
(c) EMM STATE CARDINALITY THRESHOLD (thNode)
350
400
ANOMALY DETECTION
RISK ASSESSMENT
50
100
150
200
250
(d) EMM TRANSITION CARDINALITY THRESHOLD (thLink)
300
42
Outline
Introduction
EMM Overview
EMM Applications to Anomaly
Detection
EMM Applications to Nuclear Testing
Future Work
12/9/08, Sandia
National Labs
43
Research Objectives
 Apply proven spatiotemporal modeling technique to seismic
data
 Construct EMM to model sensor data
 Local EMM at location or area
 Hierarchical EMM to summarize lower level models
 Represent all data in one vector of values
 EMM learns normal behavior
 Develop new similarity metrics to include all sensor data types
(Fusion)
 Apply anomaly detection algorithms
12/9/08, Sandia
National Labs
44
Input Data Representation
•Vector of sensor values (numeric) at
precise time points or aggregated over
time intervals.
•Need not come from same sensor
types.
•Similarity/distance between vectors
used to determine creation of new nodes
in EMM.
12/9/08, Sandia
National Labs
45
EMM with Seismic Data
 Input – Wave arrivals (all or one per sensor)
 Identify states and changes of states in seismic
data
 Wave form would first have to be converted into a
series of vectors representing the activity at
various points in time.
 Initial Testing with RDG data
 Use amplitude, period, and wave type
12/9/08, Sandia
National Labs
46
New Distance Measure
 Data = <amplitude, period, wave type>
 Different wave type = 100% difference
 For events of same wave type:
 50% weight given to the difference in amplitude.
 50% weight given to the difference in period.
 If the distance is greater than the threshold, a state
change is required.
 amplitude =
| amplitudenew – amplitudeaverage | / amplitudeaverage
 period =
| periodnew – periodaverage | / periodaverage
12/9/08, Sandia
National Labs
47
EMM with Seismic Data
States 1, 2, and
3 correspond to
Noise, Wave A,
and Wave B
respectively.
12/9/08, Sandia
National Labs
48
Preliminary Testing
 RDG data February 1, 1981 – 6
earthquakes
 Find transition times close to known
earthquakes
 9 total nodes
 652 total transitions
 Found all quakes
12/9/08, Sandia
National Labs
49
Outline
Introduction
EMM Overview
EMM Applications to Anomaly
Detection
EMM Applications to Nuclear Testing
Future Work
12/9/08, Sandia
National Labs
50
Ongoing/Future Work
 Extend to Emerging Patterns
 Extend to Hierarchical/Distributed
 Yu Su
 Test with more data – KDD Cup
 Compare to other approaches
 Charlie Isaksson
 Apply to nuclear testing
12/9/08, Sandia
National Labs
51
Hierarchical EMM
Summary
EMM
Regional
EMM
Local
EMM
12/9/08, Sandia
National Labs
Local
EMM
Regional
EMM
Local
EMM
Local
EMM
Local
EMM
52
12/9/08, Sandia
National Labs
53
References







Zhigang Li and Margaret H. Dunham, “ STIFF: A Forecasting Framework for Spatio-Temporal
Data”, Proceedings of the First International Workshop on Knowledge Discovery in Multimedia
and Complex Data, May 2002, pp 1-9.
Zhigang Li, Liangang Liu, and Margaret H. Dunham, “ Considering Correlation Between Variables
to Improve Spatiotemporal Forecasting,” Proceedings of the PAKDD Conference, May 2003, pp
519-531.
Jie Huang, Yu Meng, and Margaret H. Dunham, “Extensible Markov Model,” Proceedings IEEE
ICDM Conference, November 2004, pp 371-374.
Yu Meng and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic
Spatiotemporal,” Proceedings of the IEEE PAKDD Conference, April 2006, Singapore. (Also in
Lecture Notes in Computer Science, Vol 3918, 2006, Springer Berlin/Heidelberg, pp 750-754.)
Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data
Streams,” Journal of Computers, Vol 1, No 3, June 2006, pp 43-50.
Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic
Anomalies,” International Journal of Computer Science and Network Security, Vol 6, No 6, June
2006, pp 258-265.
Margaret H. Dunham and Vijay Kumar, “Stream Hierarchy Data Mining for Sensor Data,”
Innovations and Real-Time Applications of Distributed Sensor Networks (DSN) Symposium,
November 26, 2007, Shreveport Louisiana.
12/9/08, Sandia
National Labs
54