Sunu Mathew - Recent Advances in Intrusion Detection
Download
Report
Transcript Sunu Mathew - Recent Advances in Intrusion Detection
A Data-Centric Approach to Insider
Attack Detection in Database Systems
Sunu Mathew
Joint Work with: Michalis Petropoulos, Hung Q. Ngo,
Shambhu Upadhyaya
Computer Science and Engineering, Univ. at Buffalo
13th International Symposium on Recent Advances in Intrusion Detection (RAID 2010)
Ottawa, Canada (September 2010)
Insider Threats against Database Systems
Insider Threats – Hard problem with vast scope
Operate within existing rules and policies
High potential for damage
Impersonation
Information Leak
Data Harvesting
Most critical threat insider access to
sensitive information
Hence, focus on effective solution to insider
threats against Database Systems
2
Contributions
Modeling user queries from a data-centric
viewpoint (as opposed to prior syntax centric)
Extracting feature vectors based on the results
of query execution
Sampling strategies to address performance
issues
Taxonomy for analyzing query anomalies from
a data-centric viewpoint
3
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Taxonomy of Query Anomalies (Data-Centric)
Experimental Evaluation
Conclusions and Future work
4
Syntax Centric Modeling
Models queries using a characteristic or
frequency vector based on parsing syntax
elements (e.g., [Kamra et al., VLDB J. 2008] )
SELECT p.product_name, p.product_id
FROM PRODUCT p
WHERE p.cost = 100 and p.weight > 80;
SQL command – SELECT, Select Clause Relations –
PRODUCT, Select Clause Attributes – product_name,
product_id, Where Clause Relations – PRODUCT,
Where Clause Attributes – cost, weight
5
Limitations of Syntax Centric Approach
Two similar looking queries can produce
completely different results
SELECT p.product_name, p.product_id FROM PRODUCT p
WHERE p.cost > 100 and p.weight = 80;
False negative – similar syntax, different results
Two different looking queries can produce
similar results
SELECT p.product_name, p.product_id FROM PRODUCT p
WHERE p.cost = 100 and p.weight > 80
AND p.product_name is not null;
False positive – different syntax, same results
Syntax Analysis is hard!!([Calvanese et al. PODS 98])
6
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Taxonomy of Query Anomalies (Data-Centric)
Experimental Evaluation
Conclusions and Future work
7
Alternate viewpoint – Data Centric
When are two queries similar/different?
Model based on what data is accessed (i.e.,
results of query execution)
Visualize a database as a single relation
(Universal Relation, [Ullman et al. , TODS 84] )
Compute a statistical summary (S-Vector) of a
query’s result tuples
Fixed size feature vector (depends on DB
schema, independent of DB size and number of
result tuples)
S-Vectors for past queries used to profile user
8
Statistics Vector For Sample Database
Database Schema
Relation
Attribute
Product.type (varchar)
S-Vector Features
Product.type.ncount
Product.type.ndistinct
Product.cost.Min
Product
Product.cost.Max
Product.cost (numeric)
Product.cost.Mean
Product.cost.StdDev
Product.cost.Median
9
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Data-Centric Analysis of Query Anomalies
Experimental Evaluation
Conclusions and Future work
10
Analyzing query anomalies -- 1
Based on schema, statistics of execution result
Type 1: Different Result Schema/Different
Result Tuples (Different Rows and Columns)
Query 1: SELECT p.cost FROM PRODUCT p
WHERE p.type = ’abc’;
Query 2: SELECT p.type FROM PRODUCT p
WHERE p.cost < 1000;
Both syntax-centric/data-centric should be effective
Typically studied in the context of Role Based Access
Control (RBAC) and Masquerade detection
11
Analyzing query anomalies -- 2
Type 2: Similar Result Schema/Different Result
Tuples (Similar Columns, Different Rows)
Base Query: SELECT * FROM PRODUCT p
WHERE p.cost == 1000;
Type 2(a): Distinct Syntax
SELECT * FROM PRODUCT p WHERE p.cost < 1000 AND
p.type == `abc’;
Type 2(b): Similar Syntax
SELECT * FROM PRODUCT p WHERE p.cost < 1000 AND
p.cost > 1000;
Syntax-centric schemes “blind” to 2(b)
12
Analyzing query anomalies --3
Type 3: Similar Result Schema/Similar Result
Tuples (Similar rows and columns )
Base Query: SELECT p.type FROM PRODUCT p WHERE p.cost
< 1000;
Type 3(a): Different Syntax/Similar Semantics
SELECT p.type FROM PRODUCT p WHERE p.cost < 1000 AND
p.type IN (SELECT q.type FROM PRODUCT q);
Syntax-centric generates false positives
Type 3(b): Different Syntax/Different Semantics
SELECT p.type FROM PRODUCT p WHERE true;
Attacker gains additional knowledge (detected by
syntax-centric, undetected by data-centric)
3(b) is rare (attacker has to control result statistics)
13
Data-Centric Classification of Query
Anomalies
Anomaly Cases
Types
Detected by
Syntax
Centric?
Detected by
DataCentric?
Yes
Yes
a) Distinct Syntax
Yes
Yes
b) Similar Syntax
No
Yes
a) Different Syntax/ Similar
Semantics
False positive
Yes (allowed)
b) Different Syntax/ Different
Semantics
Yes
No (Rare)
Type 1 (Different
Schema/Different
Results
Type 2 (Similar
Schema/Different
Results)
Type 3 (Similar
Schema/Similar
Results)
14
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Data-Centric Analysis of Query Anomalies
Experimental Evaluation
Conclusions and Future work
15
Experimental Evaluation
Goals
Type 1 and 2 (a) anomalies (Detected by both
syntax-centric, data-centric):
Show that data-centric performs at least as well
as syntax-centric
Type 2(b) (Undetected by syntax-centric, Detected
by data-centric):
Show good performance for the data-centric
approach
Type 3 is outside scope (addressed in other
research)
Compare with syntax centric ([Kamra et al. 2008])
16
Test Setup
17
Query Result Sampling -- Strategies
A potential drawback of data-centric – Requires
query execution
Database server load may not be significant
However, some examination of query results is
needed for online detection
Compromise – use only k tuples of the execution
results to approximate the S-Vector
Top (Initial) – k
Random – k
Suitable for the pipelined query execution
model for commercial RDBMS
18
Evaluating Type 1, 2(a) (Different Query
Result Schema/Different Syntax)
Role based anomaly (masquerade) detection
Algorithms: Naïve Bayes, Decision Tree, SVM and
Euclidean clustering
Syntax Centric formats – C (crude) quiplet , M
(medium) quiplet, F (fine) quiplet ( [Kamra et al.
VLDB J. 2008] )
Data-Centric formats (S-Vector, S-V Initial and S-V
Random with 20, 10 and 5 result tuples)
19
Results – Type 1, 2(a) (Different Query
Result Schema/Different Syntax)
Roles
Chair Vs
Faculty
Algorithm
C.
quip
%
M.
quip
F.qui
p
S-V
(all)
S-V
I-20
S-V
R-20
S-V
I-10
S-V
R-10
S-V
I-5
S-V
R-5
N-Bayes
81.67
85.33
75
85
85
82.67
78.33
77
81.67
90
Dec-Tree
88
87.67
87.67
96.33
88.3
88.3
89
88.67
88.67
88.67
SVM
83.3
81
87.67
82.33
74.67
77
71.33
75.67
68
74.33
Clustering
73.3
72
65.6
92
92.6
92.33
94
94
92.6
93.33
Similar performance as expected, with high detection
rates (close to 100%)
Clustering performs better with S Vector
Good results with SV Initial and SV Random
20
Evaluating Type 2(b) anomalies (Different
Query Result Stats/ Similar Syntax)
Volume anomalies, Sensitive tuples
Supervised two-class training is not feasible
Manually created `attack set’ based on dataset
Base query: SELECT * FROM vApplicants WHERE
reviewStatusID = `a’ AND reviewStatusID = `b’;
Variation: SELECT * FROM vApplicants WHERE
reviewStatusID = `b’ OR reviewStatusID = `b’;
One-class Anomaly Detection
Cluster-based Outlier Detection
Attrib-Deviation
21
Evaluating Type 2(b) (Different Query
Result Stats/ Similar Syntax) -- Schemes
Cluster based outlier detection
Queries can be considered as points in some
space, construct role (e.g., Faculty) clusters
Approximate a cluster by a single point (e.g.,
centroid of cluster), use Euclidean distance &
threshold for anomaly detection
Attrib-Deviation
Data anomalies Anomalies in S-V dimensions
Perform anomaly detection on a per-dimension
basis (e.g., 3*StdDev threshold), number of
anomaly dimensions used to rank test queries
22
Results – Type 2(b) anomalies (Different
Query Result Stats/ Similar Syntax)
Algorithm
ClusterOutlier
AttribDeviation
Syntax-Centric
Data-Centric
C
qlet
M
qlet
F
qlet
S-V
all
S-V
I20
S-V
R20
S-V
I-10
S-V
R10
S-V
I-5
S-V
R5
Detection
23.5
26.4
17.6
83.8
12
67.7
6.4
45.1
6.4
35.4
False
Positive
14.4
11.8
15.8
10.5
3.9
6.5
3.9
5.2
2.6
6.6
Detection
0
17.6
2.9
87
87
87
87
87
12.9
64.5
False
Positive
0
4.8
4.8
22.6
26
15
23.8
15.8
23.8
20.4
Cluster based -- Data-centric better, Initial-k poor
Attrib-Deviation – Data-centric better, Initial-k effective!
23
Initial-k Sampling -- Performance
But why does Initial-k perform well with AttribDeviation?
Single anomalous attribute many
anomaly dimensions
Per dimension basis -- selective ordering
affects anomaly attributes differently
Different statistical measurements may be
affected differently (e.g., Max, Median)
Performance drops only when result sample
size is too low (e.g., k = 5)
Positive practical implications of Initial-k
performance
24
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Data-Centric Analysis of Query Anomalies
Experimental Evaluation
Conclusions and Future Work
25
Conclusions and Future Work
Queries other than Select-Project-Join
Union Queries
SELECT g.name, g.gpa FROM GRADS g
UNION
SELECT u.name, u.gpa FROM UGRADS u;
Computed Attributes (Avg, etc.)
GROUP-BY attributes
ORDER-BY may lead to false positives
Dynamic databases, correlated queries
Feasibility of randomized result set
Combining Syntax and Data Centric schemes
26
Questions?
Thank You!
27