Transcript Type 1

A Data-Centric Approach to
Insider Attack Detection in
Database Systems
Authors: Sunu Mathew, Michalis Petropoulos, Hung Q. Ngo,
Shambhu Upadhyaya
Recent Advances in Intrusion Detection Symposium
(RAID 2010)
Reporter: Jing Chiu
Advisor: Yuh-Jye Lee
Email: [email protected]
2016/4/7
Data Mining & Machine Learning Lab
1
Insider Threats against Database Systems
Insider Threats – Hard problem with vast scope
Operate within existing rules and policies
High potential for damage
Impersonation
Information Leak
Data Harvesting
Most critical threat  insider access to sensitive
information
Hence, focus on effective solution to insider
threats against Database Systems
2
Contributions
Modeling user queries from a data-centric
viewpoint (as opposed to prior syntax centric)
Extracting feature vectors based on the results
of query execution
Sampling strategies to address performance
issues
Taxonomy for analyzing query anomalies from a
data-centric viewpoint
3
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Taxonomy of Query Anomalies (Data-Centric)
Experimental Evaluation
Conclusions and Future work
4
Syntax Centric Modeling
Models queries using a characteristic or
frequency vector based on parsing syntax
elements (e.g., [Kamra et al., VLDB J. 2008] )
SELECT p.product_name, p.product_id
FROM PRODUCT p
WHERE p.cost = 100 and p.weight > 80;
▫ SQL command – SELECT, Select Clause Relations – PRODUCT,
Select Clause Attributes – product_name, product_id, Where
Clause Relations – PRODUCT, Where Clause Attributes – cost,
weight
▫ SQL command – SELECT, Select Clause Relations – PRODUCT, Select Clause
Attributes – product_name, product_id, Where Clause Attributes - cost
5
Limitations of Syntax Centric Approach
Two similar looking queries can produce
completely different results
SELECT p.product_name, p.product_id FROM PRODUCT p
WHERE p.cost > 100 and p.weight = 80;
False negative – similar syntax, different results
Two different looking queries can produce
similar results
SELECT p.product_name, p.product_id FROM PRODUCT p
WHERE p.cost = 100 and p.weight > 80
AND p.product_name is not null;
False positive – different syntax, same results
Syntax Analysis is hard!!([Calvanese et al. PODS 98])
6
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Taxonomy of Query Anomalies (Data-Centric)
Experimental Evaluation
Conclusions and Future work
7
Alternate viewpoint – Data Centric
When are two queries similar/different?
Model based on what data is accessed (i.e.,
results of query execution)
Compute a statistical summary (S-Vector) of a
query’s result tuples
Fixed size feature vector (depends on DB
schema, independent of DB size and number of
result tuples)
S-Vectors for past queries used to profile user
8
Statistics Vector For Sample Database
Database Schema
Relation
Attribute
Product.type (varchar)
S-Vector Features
Product.type.ncount
Product.type.ndistinct
Product.cost.Min
Product
Product.cost.Max
Product.cost (numeric)
Product.cost.Mean
Product.cost.StdDev
Product.cost.Median
9
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Data-Centric Analysis of Query Anomalies
Experimental Evaluation
Conclusions and Future work
10
Analyzing query anomalies -- 1
Based on schema, statistics of execution result
Type 1: Different Result Schema/Different
Result Tuples (Different Rows and Columns)
Query 1: SELECT p.cost FROM PRODUCT p
WHERE p.type = ’abc’;
Query 2: SELECT p.type FROM PRODUCT p
WHERE p.cost < 1000;
Both syntax-centric/data-centric should be effective
Typically studied in the context of Role Based Access Control
(RBAC) and Masquerade detection
11
Analyzing query anomalies -- 2
• Type 2: Similar Result Schema/Different Result
Tuples (Similar Columns, Different Rows)
Base Query: SELECT * FROM PRODUCT p
WHERE p.cost == 1000;
Type 2(a): Distinct Syntax
SELECT * FROM PRODUCT p WHERE p.cost < 1000 AND p.type ==
`abc’;
Type 2(b): Similar Syntax
SELECT * FROM PRODUCT p WHERE p.cost < 1000 AND p.cost >
1000;
Syntax-centric schemes “blind” to 2(b)
12
Analyzing query anomalies --3
• Type 3: Similar Result Schema/Similar Result
Tuples (Similar rows and columns )
Base Query: SELECT p.type FROM PRODUCT p WHERE p.cost < 1000;
Type 3(a): Different Syntax/Similar Semantics
SELECT p.type FROM PRODUCT p WHERE p.cost < 1000 AND p.type IN
(SELECT q.type FROM PRODUCT q);
Syntax-centric generates false positives
Type 3(b): Different Syntax/Different Semantics
SELECT p.type FROM PRODUCT p WHERE true;
Attacker gains additional knowledge (detected by syntax-centric,
undetected by data-centric)
3(b) is rare (attacker has to control result statistics)
13
Data-Centric Classification of Query Anomalies
Anomaly Cases
Types
Type 1 (Different
Schema/Different
Results
Type 2 (Similar
Schema/Different
Results)
Type 3 (Similar
Schema/Similar
Results)
Detected by
Detected by
Syntax Centric? Data-Centric?
Yes
Yes
a) Distinct Syntax
Yes
Yes
b) Similar Syntax
No
Yes
a) Different Syntax/ Similar
Semantics
False positive
Yes (allowed)
b) Different Syntax/ Different
Semantics
Yes
No (Rare)
14
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Data-Centric Analysis of Query Anomalies
Experimental Evaluation
Conclusions and Future work
15
Experimental Evaluation
Goals
Type 1 and 2 (a) anomalies (Detected by both syntax-centric,
data-centric):
 Show that data-centric performs at least as well as
syntax-centric
Type 2(b) (Undetected by syntax-centric, Detected by datacentric):
Show good performance for the data-centric
approach
Type 3 is outside scope (addressed in other
research)
Compare with syntax centric ([Kamra et al. 2008])
16
Test Setup
17
Query Result Sampling -- Strategies
A potential drawback of data-centric – Requires
query execution
Database server load may not be significant
However, some examination of query results is
needed for online detection
Compromise – use only k tuples of the execution results to
approximate the S-Vector
Top (Initial) – k
Random – k
Suitable for the pipelined query execution
model for commercial RDBMS
18
Evaluating Type 1, 2(a) (Different Query Result
Schema/Different Syntax)
Role based anomaly (masquerade) detection
Algorithms: Naïve Bayes, Decision Tree, SVM and Euclidean
clustering
Syntax Centric formats – C (crude) quiplet , M (medium)
quiplet, F (fine) quiplet ( [Kamra et al. VLDB J. 2008] )
Data-Centric formats (S-Vector, S-V Initial and S-V Random
with 20, 10 and 5 result tuples)
19
Results – Type 1, 2(a) (Different Query Result
Schema/Different Syntax)
Roles
Chair Vs
Faculty
Algorithm
C.
quip
%
M.
quip
F.qui
p
S-V
(all)
S-V
I-20
S-V
R-20
S-V
I-10
S-V
R-10
S-V
I-5
S-V
R-5
N-Bayes
81.67
85.33
75
85
85
82.67
78.33
77
81.67
90
Dec-Tree
88
87.67
87.67
96.33
88.3
88.3
89
88.67
88.67
88.67
SVM
83.3
81
87.67
82.33
74.67
77
71.33
75.67
68
74.33
Clustering
73.3
72
65.6
92
92.6
92.33
94
94
92.6
93.33
Similar performance as expected, with high detection rates
(close to 100%)
Clustering performs better with S Vector
Good results with SV Initial and SV Random
20
Evaluating Type 2(b) anomalies (Different Query
Result Stats/ Similar Syntax)
Volume anomalies, Sensitive tuples
Supervised two-class training is not feasible
Manually created `attack set’ based on dataset
Base query: SELECT * FROM vApplicants WHERE
reviewStatusID = `a’ AND reviewStatusID = `b’;
Variation: SELECT * FROM vApplicants WHERE reviewStatusID
= `b’ OR reviewStatusID = `b’;
One-class Anomaly Detection
Cluster-based Outlier Detection
Attrib-Deviation
21
Evaluating Type 2(b) (Different Query Result
Stats/ Similar Syntax) -- Schemes
Cluster based outlier detection
Queries can be considered as points in some space,
construct role (e.g., Faculty) clusters
Approximate a cluster by a single point (e.g.,
centroid of cluster), use Euclidean distance &
threshold for anomaly detection
Attrib-Deviation
Data anomalies  Anomalies in S-V dimensions
Perform anomaly detection on a per-dimension
basis (e.g., 3*StdDev threshold), number of
anomaly dimensions used to rank test queries
22
Results – Type 2(b) anomalies (Different Query
Result Stats/ Similar Syntax)
Algorithm
ClusterOutlier
AttribDeviation
Syntax-Centric
Data-Centric
C
qlet
M
qlet
F
qlet
S-V
all
S-V
I20
S-V
R20
S-V
I-10
S-V
R10
S-V
I-5
S-V
R5
Detection
23.5
26.4
17.6
83.8
12
67.7
6.4
45.1
6.4
35.4
False
Positive
14.4
11.8
15.8
10.5
3.9
6.5
3.9
5.2
2.6
6.6
Detection
0
17.6
2.9
87
87
87
87
87
12.9
64.5
False
Positive
0
4.8
4.8
22.6
26
15
23.8
15.8
23.8
20.4
 Cluster based -- Data-centric better, Initial-k poor
 Attrib-Deviation – Data-centric better, Initial-k effective!
23
Initial-k Sampling -- Performance
But why does Initial-k perform well with AttribDeviation?
Single anomalous attribute  many anomaly
dimensions
Per dimension basis -- selective ordering affects
anomaly attributes differently
Different statistical measurements may be affected
differently (e.g., Max, Median)
Performance drops only when result sample size is too
low (e.g., k = 5)
Positive practical implications of Initial-k
performance
24
Outline
Limitations of Syntax Centric Approaches
Data-Centric User Profiling
Data-Centric Analysis of Query Anomalies
Experimental Evaluation
Conclusions and Future Work
25
Conclusions and Future Work
Queries other than Select-Project-Join
Union Queries
SELECT g.name, g.gpa FROM GRADS g

UNION
SELECT u.name, u.gpa FROM UGRADS u;
Computed Attributes (Avg, etc.)
GROUP-BY attributes
ORDER-BY may lead to false positives
Dynamic databases, correlated queries
Feasibility of randomized result set
Combining Syntax and Data Centric schemes
26
Questions?
Thank You!
27