Data Mining in Cyber Threat Analysis
Download
Report
Transcript Data Mining in Cyber Threat Analysis
Data Mining for Cyber Threat Analysis
Vipin Kumar
Army High Performance Computing Research Center
Department of Computer Science
University of Minnesota
http://www.cs.umn.edu/~kumar
Project Participants:
A. Lazarevic, V. Kumar, J. Srivastava
H. Ramanni, L. Ertoz, M. Joshi, E. Eilertson, S. Ketkar
Mining Large Data Sets - Motivation
Examples:
Computational simulations
Information Assurance &
Network intrusion
Sensor networks
Computational Simulations
Homeland Defense
There is often information
“hidden” in the data that is
not readily evident.
Sensor Networks
Human analysts may take
weeks to discover useful
information.
Much of the data is never
analyzed at all.
Network Intrusion Detection
Data Mining for Homeland Defense
“Data mining and data
warehousing are part of a much
larger FBI plan to … discover
patterns and relationships that
indicate criminal activity” (network
intrusions, cyber attacks,
terroristic calls, …)” Federal
Computer Week, June 3, 2002
FBI Director Robert Mueller: “New
information technology is critical
top conducting business in a
different way, critical to analyzing
and sharing information on a real
time basis”
Vision Statement
to Guide Research in
Question & Answering (Q&A) and Text Summarization
by
John D. Prange1, Eduard Hovy2, and Steve Maiorano3
1. INTRODUCTION
Recent developments in natural language processing R&D have made it clear that
formerly independent technologies can be harnessed together to an increasing degre
order to form sophisticated and powerful information delivery vehicles. Information
retrieval engines, text summarizers, question answering systems, and language
translators provide complementary functionalities which can be combined to serve a
variety of users, ranging from the casual user asking questions of the web (such as a
schoolchild doing an assignment) to a sophisticated knowledge worker earning a livin
(such as an intelligence analyst investigating terrorism acts).
A particularly useful complementarity exists between text summarization and question
answering systems. From the viewpoint of summarization, question answering is one
way to provide the focus for query-oriented summarization. From the viewpoint of
question answering, summarization is a way of extracting and fusing just the relevant
information from a heap of text in answer to a specific non-factoid question. Howeve
both question answering and summarization include aspects that are unrelated to the
other. Sometimes, the answer to a question simply cannot be summarized: either it i
brief factoid (the capital of Switzerland is Berne) or the answer is complete in itself (g
me the text of the Pledge of Allegiance). Likewise, generic (author’s point of view
summaries) do not involve a question; they reflect the text as it stands, without input f
the system user.
This document describes a vision of ways in which Question Answering and
Summarization technology can be combined to form truly useful information delivery
tools. It outlines tools at several increasingly sophisticated stages. This vision, and t
staging, can be used to inform R&D in question answering and text summarization. T
purpose of this document is to provide a background against which NLP research
sponsored by DRAPA, ARDA, and other agencies can be conceived and guided. An
important aspect of this purpose is the development of appropriate evaluation tests a
measures for text summarization and question answering, so as to most usefully focu
research without over-constraining it.
1
Technical Director, Advanced Research and Development Activity (ARDA), R&E Building STE 6530, 9800
Savage Road, Fort Meade, MD 20755-6530. [email protected]
2
Natural Language Group, University of Southern California-Information Sciences Institute, 4676 Admiralty Wa
Marina del Rey, CA 90292-6695. [email protected]
3
Advanced Analytic Tools, LF-7, Washington, DC 20505. [email protected]
Homeland Defense: Key issues
Information fusion from diverse data sources including
intelligence, agencies, law enforcement, profile …
Data mining on this information base to uncover latent
models and patterns
Visualization and display tools for understanding the
relationships between persons, events and patterns of
behavior
S
h
e
l
v
e
s
Dor
ChairBasebordHat ChCahirair Plant
Window
Event
recog
Assoc
analysis
Threat
Visualizer
Threat
predictor
C
r
e
d
e
n
z
a
Information
Fusion
Cultural Intelligence Law enforcement
Data
Data
Data
Information Assurance: Introduction
As the cost of the information processing and Internet
accessibility falls, more and more organizations are
becoming vulnerable to potential cyber threats
“unlawful attacks and threats of attack against computers,
networks, and the information stored therein when done to
intimidate or coerce a government or its people“ – D. Denning
Incidents Reported to CERT/CC
60000
50000
40000
30000
20000
10000
0
90
91
92
93
94
95
96
97
98
99
00
01
Information Assurance: Intrusion Detection
Intrusion Detection: Detecting a set of actions
that compromise the integrity, confidentiality, or
availability of information resources.
Viruses and Internet worms
Theft of classified information from DOD computers
Problem of identifying individuals
who are using computers without authorization
who have legitimate access but are abusing their privileges
Intrusion Detection System (IDS)
combination of software and hardware that attempts to perform
intrusion detection
raise the alarm when
possible intrusion happens
Data Mining on Intelligence Databases
Purpose:
Develop methods to identify potential threats
Mine intelligence database
Example: Forecasting Militarized Interstate Disputes (FORMIDs).
Data: social, political, economic, geographical information for pairs of
countries
ratio of military capability
democracy index
level of trade
distance
Predict: the likelihood of militarized interstate disputes (MIDs).
Overall Objective: predict likely instabilities involving pairs of countries.
Collaborators: Sean O’Brien, Center of Army Analysis (CAA),
Kosmo Tatalias (NCS).
Data Mining in Commercial Word
Given its success in commercial applications, data mining holds great
promise for analyzing large data sets.
Employed
No
NO
Yes
# of years
2
<2
4
# of years
in school
NO
Yes
>4
Married
YES
Classification / Predictive Modeling {Direct Marketing,
Fraud Detection}
TID
Items
1
2
3
4
5
Bread, Milk
Beer, Diaper, Bread, Eggs
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Bread, Diaper, Milk
Clustering (Market segmentation)
{Diaper , Milk } {Beer }
Association Patterns
Marketing / Sales Promotions
Key Technical Challenges
Large data size
Gigabytes of data are common
High dimensionality
Thousands of dimensions are possible
Spatial/temporal nature of the data
Data points that are close in time and space
are highly related
Skewed class distribution
“Mining needle in a haystack.
So much hay and so little time”
Interesting events are very rare looking for the “needle in a haystack”
Data Fusion & Data Preprocessing
Data from multiple sources
Data of multiple types (text, images, voice, … )
Data cleaning – missing value imputation, scaling, mismatch handling
Intrusion Detection Research at AHPCRC
Misuse Detection - Predictive models
Mining needle in a haystack – models must be able to handle skewed class
distribution, i.e., class of interest is much smaller than other classes.
Learning from data streams - intrusions are sequences of events
Anomaly and Outlier Detection
Able to detect novel attacks through outlier detection schemes
Detect deviations from “normal” behavior as anomalies
Construction of useful features that may be used in data mining
Modifying signature based intrusion detection (SNORT) systems to
incorporate anomaly detection algorithms
Summer Institute Projects
Implementing Anomaly/Outlier detection algorithms
Investigating algorithms for classification of rare classes
Visualizing tool for monitoring network traffic and suspicious network behavior
Learning from Rare Class
Key Issue: Examples from rare classes get
missed in standard data mining analysis
Over-sampling the small class or under-sampling
the large class
PNrule and related work [Joshi, Agarwal, Kumar, SIAM
2001, SIGMOD 2001]
RareBoost [Joshi, Agarwal, Kumar, ICDM 2001, KDD 2002]
SMOTEBoost [Lazarevic, et al, in review]
Classification based on association - add frequent items
as “meta-features” to original data set
PN-rule Learning and Related Algorithms*
P-phase:
cover most of the positive examples with high support
seek good recall
N-phase:
remove FP from examples covered in P-phase
N-rules give high accuracy and significant support
C
NC
C
NC
Existing techniques can possibly
PNrule can learn strong signatures
learn erroneous small signatures for
for presence of NC in N-phase
absence of C
* - SIAM 2001, SIGMOD 2001, ICDM 2001, KDD 2002
SMOTE and SMOTEBoost
SMOTE (Synthetic Minority Oversampling Technique)
generates artificial examples from minority (rare)
class along the boundary line segment
Generalization of over-sampling technique
Combination of SMOTE and boosting further improves
the prediction performance or rare classes
SMOTE and SMOTEBoost Results
Experimental Results on modified KDDCup 1999 data set
Final values for recall, precision and F-value for U2R class on KDDCup-99 intrusion data set
Method
Standard RIPPER
SMOTE Nu2r=100, Nr2l=100
+
Nu2r=300, Nr2l=100
RIPPER Nu2r=500, Nr2l=100
Recall Precision F-value
57.35
61.76
80.15
75.74
84.78
86.60
85.16
88.03
Method
Recall Precision F-value
68.42
Standard Boosting
80.147 90.083
72.1
N =100, Nr2l=100 87.5
87.5
SMOTE- u2r
82.58 Boost Nu2r=300, Nr2l=100 88.24 90.91
81.42
Nu2r=500, Nr2l=100 87.5
92.97
84.83
87.5
89.55
90.15
Final values for recall, precision and F-value for R2L class on KDDCup-99 intrusion data set
Method
Standard RIPPER
SMOTE Nu2r=100, Nr2l=100
+
Nu2r=300, Nr2l=100
RIPPER Nu2r=500, Nr2l=100
Recall Precision F-value
75.98
87.94
77.45
71.75
96.72
94.47
92.03
89.38
Method
85.11
Standard Boosting
86.90
N =100, Nr2l=100
SMOTE- u2r
84.11
Nu2r=300, Nr2l=100
Boost
79.06
Nu2r=500, Nr2l=100
Recall Precision F-value
95.46
97.02
97.07
97.38
96.83
96.54
95.25
95.84
96.14
96.78
96.15
96.72
Classification Based on Associations
Current approaches use confidence-like measures
to select the best rules to be added as features into
the classifiers.
This may work well only if each class is well-represented in
the data set.
For the rare class problems, some of the high recall
itemsets could be potentially useful, as long as their
precision is not too low.
Our approach:
Apply frequent itemset generation algorithm to each class.
Select itemsets to be added as features based on precision,
recall and F-Measure.
Apply classification algorithm, i.e., RIPPER, to the new data
set.
Experimental Results (on modified KDD Cup 1999 data)
RIPPER with high Precision rules
Original RIPPER
Precision
dos
u2r
r2l
probe
normal
99.50%
84.78%
96.72%
97.22%
95.52%
Recall
F-Measure
99.16%
57.35%
75.98%
90.27%
99.30%
99.33%
68.42%
85.11%
93.62%
97.37%
RIPPER with high Recall rules
Precision
dos
u2r
r2l
probe
normal
99.51%
90.09%
92.90%
96.96%
96.29%
Recall
98.95%
73.53%
75.88%
96.69%
98.88%
Precision
dos
u2r
r2l
probe
normal
99.53%
88.57%
96.54%
97.40%
96.80%
Recall
99.65%
68.38%
78.91%
94.89%
99.25%
F-Measure
99.59%
77.18%
86.84%
96.13%
98.01%
RIPPER with high F-measure rules
F-Measure
99.23%
80.97%
83.53%
96.83%
97.57%
Precision
dos
u2r
r2l
probe
normal
99.48%
94.17%
96.14%
98.25%
97.18%
Recall
99.79%
83.09%
84.31%
92.11%
99.26%
F-Measure
99.63%
88.28%
89.84%
95.08%
98.21%
For rare classes, rules ordered according to F-Measure produce the best results.
Anomaly and Outlier
Detection
Anomalous activities
Missed
attacks
False
alarm
Normal profile
Main Assumptions
All anomalous activities need closer inspection
Determine “normal activity profile” and flag an alarm when the
state differs from the “normal profile”
Expert analyst examines suspicious activity to make final
determination whether activity is indeed an intrusion
Drawbacks
Possible large number of false alarms and not recognizing
attacks
Supervised (with access to normal data) vs.
Unsupervised (with NO access to normal data)
determining “normal behavior”
Outlier Detection
Outlier is defined as a data point which is very different
from the rest of the data (“normal data”) based on
some measure of similarity
Outlier detection approaches:
Statistics based approaches
Distance based techniques
Clustering based approaches
Density based schemes
Distance and density based schemes
Distance based approaches (NN approach) - Outliers
are points that do not have enough neighbors
Density based approach (LOF approach) finds outliers
based on the densities of local neighborhoods
Concept of locality becomes difficult to define due to data
sparsity in high dimensional space
Clustering based approaches define outliers as points
which do not lie in clusters
Implicitly define outliers as background noise
Outlier Detection Results (on DARPA’98 data)
Detection Rate (False alarm rate was fixed to 1%)
NN approach
LOF
Bursty attacks
15/19 (78.9%)
14/19 (73.7%)
Non-bursty attacks 9 / 25 (36.0%) 14 / 25 (56.0%)
1
0.9
The score values assigned to
network connections from
bursty attacks
Connection score
0.8
0.7
0.6
0.5
0.4
LOF approach
0.3
0.2
NN aproach
0.1
0
0
10
20
30
40
50
60
Connection number
70
80
90
100
Modifying SNORT
SNORT contains simple SPADE (Statistical Packet
Anomaly Detection Engine)
SPADE only compares the statistics of packets
Our approach
Integrate our implemented outlier detection schemes into the
SNORT since it is and open source code
Improve the detection of novel intrusions and suspicious
behavior by using sophisticated outlier detection schemes
SNN Clustering – Finding Patterns
in Noisy Data
• Finds clusters of arbitrary
shapes, sizes and densities
• Handles high dimensional data
• Density invariant
• Built-in noise removal
Topics from Los Angeles Times (Jan. 1989)
• 3204 articles, 31472 words (LA Times, January 1989)
• afghanistan embassi guerrilla kabul moscow rebel soviet troop
ussr withdraw
• chancellor chemic export german germani kadafi kohl libya libyan
plant poison weapon west
• ahead ball basket beate brea chri coach coache consecut el final
finish foul fourth free game grab half halftim hill host jef lead
league led left los lost minut miss outscor overal plai player
pointer quarter ralli rank rebound remain roundup scor score
scorer season shot steal straight streak team third throw ti tim trail
victori win won
• ab bengal bowl cincinnati craig dalla denver esiason field football
francisco giant jerri joe miami minnesota montana nfl oppon pass
pittsburgh quarterback rice rush super table taylor terri
touchdown yard
Topics from FBI web site
• uss rocket cabin aircraft fuel hughes twa missile redstone
•
(1999 - Congressional Statement - TWA 800)
• theft art stolen legislation notices recoveries
•
(FBI - Art Theft Program)
• signature ink writings genuine printing symbols
•
(forensic science communications)
• forged memorabilia authentic bullpen
•
(FBI Major Investigation - operation bullpen)
• arabia saudi towers dhahran
•
(June 25 1996 bombing of the Khobar Towers military housing complex in
Dhahran Kingdom of Saudi Arabia REWARD)
• classified philip quietly ashcroft hanssen drop cia affidavit tenet
dedication compromised kgb successor helen volunteered
•
(Agent Robert Philip Hanssen - espionage)
Topics from FBI web site
• afghanistan ramzi bin yousef islamic jihad egyptian bombings
egypt pakistan hamas yemen headquartered usama laden kenya
tanzania nairobi embassies dar salaam rahman mohamed abdel
affiliated camps opposed deserve legat enemies vigilance plots
casualties
• enterprise asian chinese enterprises korean vietnamese italian
cartels heroin cosa nostra sicilian lcn
• firearms firearm bullets ammunition cartridges
• perpetrating bioterrorism responders credible exposed biological
articulated covert hoax wmd assumes
Future Applications
Unclassified telephone calls data to be provided by
INSCOM
Goal: to determine a terrorist in a haystack
Nodes
people
Edges
telephone calls (date / time / duration)
Conclusions
Predictive models specifically designed for rare class
can help in improving the detection of small attack
types
Simple outlier detection approaches appear capable of
detecting anomalies
Clustering based approaches show promise in
identifying novel attack types
Integration data mining techniques into SNORT should
improve the detection rate
Data Mining Process
• Data mining – “non-trivial extraction of implicit, previously
unknown and potentially useful information from data”
Back
Modified KDDCup 1999 Data Set
• KDDCup 1999 data is based on DARPA 1998
data set
• Remove duplicates and merge new train and
test data sets
• Sample 69,980 examples from the merged data
set
– Sample from neptune and normal subclass. Other
subclasses remain intact.
• Divide in equal proportion to training and test
sets
Back
DARPA 1998 Data Set
DARPA 1998 data set (prepared and managed by MIT
Lincoln Lab) includes a wide variety of intrusions
simulated in a military network environment
9 weeks of raw TCP dump data
7 weeks for training (5 million connection records)
2 weeks for training (2 million connection records)
Connections are labeled as normal or attacks (4 main
categories of attacks - 38 attack types)
DOS- Denial Of Service
Probe - e.g. port scanning
U2R - unauthorized access to root privileges,
R2L - unauthorized remote login to machine,
Two types of attacks
Bursty attacks
- involve multiple network connections
Non-bursty attacks - involve single network connections
Back to KDDCup
Back to Experiments
Terrorist Threat Analyzer & Predictor (T-TAP)
Terrorist Threat Analyzer & Predictor
Operational Capability:
Event
recog
Assoc
analysis
Threat
Visualizer
Threat
predictor
S
h
e
l
v
e
s
Dor
ChairBasebordHt ChCahirair Plant
Window
C
r
e
d
e
n
z
a
T-TAP
System
Cultural
Data
Information
Fusion
Intelligence
Data
AHPCRC
University of Minnesota
•Ability to match data from multiple sources, resolving
structural and semantic conflicts.
•Ability to recognize events and behaviors, each of
whose (partial) information is available in different data
streams.
•Ability to identify latent associations between suspects
and their linkages to events/behaviors.
•Capability to predict threatening activities and
behaviors with high probability.
•Ability to visualize interesting and significant events
and behaviors.
Law enforcement
Data
Proposed Technical Approach: New Effort
Rough Order of Magnitude Cost and Schedule:
Key Technologies:
•High dimensional data clustering – METIS.
•Spatio-temporal change point detection.
•Association analysis and belief revision.
•High dimensional classification – SVM, NN, boosting.
Task List:
•T1: Develop data fusion algorithms to match diverse
intelligence, law enforcement, and cultural data.
•T2: Develop event & behavior recognition algorithms
across multiple, multi-media, data streams.
•T3: Association analysis algorithms to determine
hidden connections between suspects, events and
behaviors; develop networks of associations.
•T4: Predictive models of threatening events and
behaviors.
•T5: Interestingness & relevance based visualization
models of significant events and behaviors.
•Tasks 1, 2, 3 will each proceed in parallel for the first 12
months, with version 1 released at the end of 6 months.
At this point tasks 4, 5 will start and proceed in parallel.
Cross feedback across various tasks will lead to final,
refined tools at the end of the 18 month period.
Deliverables:
•Database schema and structures to store terrain info.
•Software implementation of concealed cavities
prediction model.
•User manuals, test reports, database schema
•Quarterly technical and status reports, and final report.
Corporate Information:
•Vipin Kumar (POC)
•Army Research Center, University of Minnesota, 1100
Washington Avenue SE, Minneapolis, MN 55455
•Phone (612)626-8095; E-mail: [email protected]
AHPCRC
University of Minnesota
Distributed Virtual Integrated Threat Analysis Database (DVITAD)
Distributed Virtual Integrated Threat Analysis Database
S
h
e
l
v
e
s
Dor
ChairBasebordHt ChCahirair Plant
Window
C
r
e
d
e
n
z
a
Information integration:
matching, clustering,
profiles, networks, tracks
Database Connectors
Threat Analysis
Tools
Operational Capability:
•Global integrated view across multiple databases:
integrated schema, global dictionary and directory,
common ontology
•Threat analysis object repository
•comprehensive suspect dossier
•temporal activity tracks
•association networks
•Interactive database exploration
•Field and value based querying
•Approximate match based retrieval
Virtual Integrated Database
Proposed Technical Approach: New Effort
Key Technologies:
•Semantic object matching
•Clustering large datasets – METIS
•Latent/Indirect association analyzer
Task List:
•T1: Data normalization through the development of
wrappers/connectors for various databases.
•T2: Integrated schema creation to model information
found in all databases.
•T3: Matching entities (suspects, events, etc.) across
multiple data sources; resolving conflicting attributes
values for entities across databases.
•T4: Clustering of suspect profiles.
•T5: Building networks of hidden associations between
suspects.
•T6: Constructing temporal activity tracks of events and
linkage of suspects to the tracks.
Rough Order of Magnitude Cost and Schedule:
•Tasks 1 and 2 will each proceed in parallel for the first
12 months, with version 1 released at the end of 6
months. At this point tasks 3,4,5,6 will start and proceed
in parallel. Cross feedback across various tasks will
lead to final, refined tools at the end of the 18 month
period.
Deliverables:
•Global schema, dictionary, and directory for the
integrated database.
•Software that realizes the virtual integrated DB view.
•User manuals, test reports, database schema
•Quarterly technical and status reports, and final report.
Corporate Information:
•Vipin Kumar (POC)
•Army Research Center, University of Minnesota, 1100
Washington Avenue SE, Minneapolis, MN 55455
•Phone (612)626-8095; E-mail: [email protected]