Sunu Mathew - Recent Advances in Intrusion Detection

Download Report

Transcript Sunu Mathew - Recent Advances in Intrusion Detection

A Data-Centric Approach to Insider
Attack Detection in Database Systems
Sunu Mathew
Joint Work with: Michalis Petropoulos, Hung Q. Ngo,
Shambhu Upadhyaya
Computer Science and Engineering, Univ. at Buffalo
13th International Symposium on Recent Advances in Intrusion Detection (RAID 2010)
Ottawa, Canada (September 2010)
Insider Threats against Database Systems
 Insider Threats – Hard problem with vast scope
 Operate within existing rules and policies
 High potential for damage
 Impersonation
 Information Leak
 Data Harvesting
 Most critical threat  insider access to
sensitive information
 Hence, focus on effective solution to insider
threats against Database Systems
2
Contributions
 Modeling user queries from a data-centric
viewpoint (as opposed to prior syntax centric)
 Extracting feature vectors based on the results
of query execution
 Sampling strategies to address performance
issues
 Taxonomy for analyzing query anomalies from
a data-centric viewpoint
3
Outline
 Limitations of Syntax Centric Approaches
 Data-Centric User Profiling
 Taxonomy of Query Anomalies (Data-Centric)
 Experimental Evaluation
 Conclusions and Future work
4
Syntax Centric Modeling
 Models queries using a characteristic or
frequency vector based on parsing syntax
elements (e.g., [Kamra et al., VLDB J. 2008] )
SELECT p.product_name, p.product_id
FROM PRODUCT p
WHERE p.cost = 100 and p.weight > 80;
 SQL command – SELECT, Select Clause Relations –
PRODUCT, Select Clause Attributes – product_name,
product_id, Where Clause Relations – PRODUCT,
Where Clause Attributes – cost, weight
5
Limitations of Syntax Centric Approach
 Two similar looking queries can produce
completely different results
SELECT p.product_name, p.product_id FROM PRODUCT p
WHERE p.cost > 100 and p.weight = 80;
 False negative – similar syntax, different results
 Two different looking queries can produce
similar results
SELECT p.product_name, p.product_id FROM PRODUCT p
WHERE p.cost = 100 and p.weight > 80
AND p.product_name is not null;
 False positive – different syntax, same results
 Syntax Analysis is hard!!([Calvanese et al. PODS 98])
6
Outline
 Limitations of Syntax Centric Approaches
 Data-Centric User Profiling
 Taxonomy of Query Anomalies (Data-Centric)
 Experimental Evaluation
 Conclusions and Future work
7
Alternate viewpoint – Data Centric
 When are two queries similar/different?
 Model based on what data is accessed (i.e.,
results of query execution)
 Visualize a database as a single relation
(Universal Relation, [Ullman et al. , TODS 84] )
 Compute a statistical summary (S-Vector) of a
query’s result tuples
 Fixed size feature vector (depends on DB
schema, independent of DB size and number of
result tuples)
 S-Vectors for past queries used to profile user
8
Statistics Vector For Sample Database
Database Schema
Relation
Attribute
Product.type (varchar)
S-Vector Features
Product.type.ncount
Product.type.ndistinct
Product.cost.Min
Product
Product.cost.Max
Product.cost (numeric)
Product.cost.Mean
Product.cost.StdDev
Product.cost.Median
9
Outline
 Limitations of Syntax Centric Approaches
 Data-Centric User Profiling
 Data-Centric Analysis of Query Anomalies
 Experimental Evaluation
 Conclusions and Future work
10
Analyzing query anomalies -- 1
 Based on schema, statistics of execution result
 Type 1: Different Result Schema/Different
Result Tuples (Different Rows and Columns)
Query 1: SELECT p.cost FROM PRODUCT p
WHERE p.type = ’abc’;
Query 2: SELECT p.type FROM PRODUCT p
WHERE p.cost < 1000;
 Both syntax-centric/data-centric should be effective
 Typically studied in the context of Role Based Access
Control (RBAC) and Masquerade detection
11
Analyzing query anomalies -- 2
 Type 2: Similar Result Schema/Different Result
Tuples (Similar Columns, Different Rows)
Base Query: SELECT * FROM PRODUCT p
WHERE p.cost == 1000;
 Type 2(a): Distinct Syntax
SELECT * FROM PRODUCT p WHERE p.cost < 1000 AND
p.type == `abc’;
 Type 2(b): Similar Syntax
SELECT * FROM PRODUCT p WHERE p.cost < 1000 AND
p.cost > 1000;
 Syntax-centric schemes “blind” to 2(b)
12
Analyzing query anomalies --3
 Type 3: Similar Result Schema/Similar Result
Tuples (Similar rows and columns )
Base Query: SELECT p.type FROM PRODUCT p WHERE p.cost
< 1000;
 Type 3(a): Different Syntax/Similar Semantics
SELECT p.type FROM PRODUCT p WHERE p.cost < 1000 AND
p.type IN (SELECT q.type FROM PRODUCT q);
 Syntax-centric generates false positives
 Type 3(b): Different Syntax/Different Semantics
SELECT p.type FROM PRODUCT p WHERE true;
 Attacker gains additional knowledge (detected by
syntax-centric, undetected by data-centric)
 3(b) is rare (attacker has to control result statistics)
13
Data-Centric Classification of Query
Anomalies
Anomaly Cases
Types
Detected by
Syntax
Centric?
Detected by
DataCentric?
Yes
Yes
a) Distinct Syntax
Yes
Yes
b) Similar Syntax
No
Yes
a) Different Syntax/ Similar
Semantics
False positive
Yes (allowed)
b) Different Syntax/ Different
Semantics
Yes
No (Rare)
Type 1 (Different
Schema/Different
Results
Type 2 (Similar
Schema/Different
Results)
Type 3 (Similar
Schema/Similar
Results)
14
Outline
 Limitations of Syntax Centric Approaches
 Data-Centric User Profiling
 Data-Centric Analysis of Query Anomalies
 Experimental Evaluation
 Conclusions and Future work
15
Experimental Evaluation
 Goals
 Type 1 and 2 (a) anomalies (Detected by both
syntax-centric, data-centric):
 Show that data-centric performs at least as well
as syntax-centric
 Type 2(b) (Undetected by syntax-centric, Detected
by data-centric):
 Show good performance for the data-centric
approach
 Type 3 is outside scope (addressed in other
research)
 Compare with syntax centric ([Kamra et al. 2008])
16
Test Setup
17
Query Result Sampling -- Strategies
 A potential drawback of data-centric – Requires
query execution
 Database server load may not be significant
 However, some examination of query results is
needed for online detection
 Compromise – use only k tuples of the execution
results to approximate the S-Vector
 Top (Initial) – k
 Random – k
 Suitable for the pipelined query execution
model for commercial RDBMS
18
Evaluating Type 1, 2(a) (Different Query
Result Schema/Different Syntax)
 Role based anomaly (masquerade) detection
 Algorithms: Naïve Bayes, Decision Tree, SVM and
Euclidean clustering
 Syntax Centric formats – C (crude) quiplet , M
(medium) quiplet, F (fine) quiplet ( [Kamra et al.
VLDB J. 2008] )
 Data-Centric formats (S-Vector, S-V Initial and S-V
Random with 20, 10 and 5 result tuples)
19
Results – Type 1, 2(a) (Different Query
Result Schema/Different Syntax)
Roles
Chair Vs
Faculty
Algorithm
C.
quip
%
M.
quip
F.qui
p
S-V
(all)
S-V
I-20
S-V
R-20
S-V
I-10
S-V
R-10
S-V
I-5
S-V
R-5
N-Bayes
81.67
85.33
75
85
85
82.67
78.33
77
81.67
90
Dec-Tree
88
87.67
87.67
96.33
88.3
88.3
89
88.67
88.67
88.67
SVM
83.3
81
87.67
82.33
74.67
77
71.33
75.67
68
74.33
Clustering
73.3
72
65.6
92
92.6
92.33
94
94
92.6
93.33
 Similar performance as expected, with high detection
rates (close to 100%)
 Clustering performs better with S Vector
 Good results with SV Initial and SV Random
20
Evaluating Type 2(b) anomalies (Different
Query Result Stats/ Similar Syntax)
 Volume anomalies, Sensitive tuples
 Supervised two-class training is not feasible
 Manually created `attack set’ based on dataset
 Base query: SELECT * FROM vApplicants WHERE
reviewStatusID = `a’ AND reviewStatusID = `b’;
 Variation: SELECT * FROM vApplicants WHERE
reviewStatusID = `b’ OR reviewStatusID = `b’;
 One-class Anomaly Detection
 Cluster-based Outlier Detection
 Attrib-Deviation
21
Evaluating Type 2(b) (Different Query
Result Stats/ Similar Syntax) -- Schemes
 Cluster based outlier detection
 Queries can be considered as points in some
space, construct role (e.g., Faculty) clusters
 Approximate a cluster by a single point (e.g.,
centroid of cluster), use Euclidean distance &
threshold for anomaly detection
 Attrib-Deviation
 Data anomalies  Anomalies in S-V dimensions
 Perform anomaly detection on a per-dimension
basis (e.g., 3*StdDev threshold), number of
anomaly dimensions used to rank test queries
22
Results – Type 2(b) anomalies (Different
Query Result Stats/ Similar Syntax)
Algorithm
ClusterOutlier
AttribDeviation


Syntax-Centric
Data-Centric
C
qlet
M
qlet
F
qlet
S-V
all
S-V
I20
S-V
R20
S-V
I-10
S-V
R10
S-V
I-5
S-V
R5
Detection
23.5
26.4
17.6
83.8
12
67.7
6.4
45.1
6.4
35.4
False
Positive
14.4
11.8
15.8
10.5
3.9
6.5
3.9
5.2
2.6
6.6
Detection
0
17.6
2.9
87
87
87
87
87
12.9
64.5
False
Positive
0
4.8
4.8
22.6
26
15
23.8
15.8
23.8
20.4
Cluster based -- Data-centric better, Initial-k poor
Attrib-Deviation – Data-centric better, Initial-k effective!
23
Initial-k Sampling -- Performance
 But why does Initial-k perform well with AttribDeviation?
 Single anomalous attribute  many
anomaly dimensions
 Per dimension basis -- selective ordering
affects anomaly attributes differently
 Different statistical measurements may be
affected differently (e.g., Max, Median)
 Performance drops only when result sample
size is too low (e.g., k = 5)
 Positive practical implications of Initial-k
performance
24
Outline
 Limitations of Syntax Centric Approaches
 Data-Centric User Profiling
 Data-Centric Analysis of Query Anomalies
 Experimental Evaluation
 Conclusions and Future Work
25
Conclusions and Future Work
 Queries other than Select-Project-Join
 Union Queries
 SELECT g.name, g.gpa FROM GRADS g

UNION
 SELECT u.name, u.gpa FROM UGRADS u;
 Computed Attributes (Avg, etc.)
 GROUP-BY attributes
 ORDER-BY may lead to false positives
 Dynamic databases, correlated queries
 Feasibility of randomized result set
 Combining Syntax and Data Centric schemes
26
Questions?
Thank You!
27