CS548S16_Showcase_Clustering_II

Transcript CS548S16_Showcase_Clustering_II

CS 548 - Spring 2016
Clustering Showcase II
Presented by
Alexander W. Witt and Chengle Zhang
Showcasing work by
Alaa El Masri, Harry Weschler, Peter Likarish, Christopher Grayson, Calton Pu, Dalal AlArayed, and Brent ByungHoonKang
A. El Masri, H. Weschler, P. Likarish, C. Grayson, C. Pu, D. Al-Arayed, and B.
ByungHoonKang. “Active authentication using scrolling behaviors,” Information and
Communication Systems (ICICS), 2015 6th International Conference on, Amman, 2015, pp.
257-262. doi: 10.1109/IACS.2015.7103185
Presentation Outline
•
References
•
Abstract
•
Introduction
•
Terminology
•
The Dataset
•
Feature Representation
•
Classification
•
Clustering
•
Clustering Approaches
•
Classification Results
•
Clustering Results
2
References
1. A. El Masri, H. Weschler, P. Likarish, C. Grayson, C.
Pu, D. Al-Arayed, and B. ByungHoonKang. “Active
authentication using scrolling behaviors,” Information
and Communication Systems (ICICS), 2015 6th
International Conference on, Amman, 2015, pp. 257262. doi: 10.1109/IACS.2015.7103185
2. P.N Tan, M. Steinbach, V. Kumar, “Cluster Analysis:
Basic Concepts and Algorithms,” in Introduction to
Data Mining, 1st Edition, Boston, MA; Pearson
Education, Inc. 2006, pp. 496-515
3
Part I
The Domain
Abstract
•
Active authentication
•
Scrolling behavior biometrics
•
Event-driven temporal data
•
Monitoring user reading habits
•
Classification evaluation
•
K-means clustering evaluation
•
Contribution features
•
Contribution use of k-means
5
Introduction
•
Passive authentication does
guarantee the identity of the
user throughout a session
•
Scrolling is caused by
keyboard or mouse events on
a traditional computer
•
The goal is to investigate the
existence of patterns related to
how users scroll electronic
documents
•
There are many factors that
influence scrolling behavior
Example of passive authentication
6
Terminology
•
Biometric
•
Authentication
•
Active Authentication
•
Reading Session (RS)
•
Feature Vector
7
Part II
Dataset & Feature
Representation
The Dataset
•
Repurposed a dataset obtained
from a previous research
project at Georgia Tech
•
The dataset was from an
experiment aimed at detecting
document access activities that
indicate an insider threat
•
Subjects logged into a web
application and read a PDF
document in their browser
•
84 subjects, 54 documents, 529
reading sessions
9
Feature Representation
•
Three different feature vectors
were derived from the data
collected in the dataset
•
The first feature vector contains
statistical measurements for
scrolling behavior
•
The second feature vector
contains information on the polarity
of scrolling and the pauses
between scrolls
•
The 20 features present in the third
feature vector were not mentioned
specifically by the authors
10
Part III
Authentication Methods
Classification
•
•
“Is the current user the authenticated
user or not?”
Q1
Binary classification problem with two
different classes Authenticated and
Imposters
•
Random Forest (RF) with feature vector
type 1
•
SMOTE and AdaBoost applied
independently to RF to decrease
classifier bias toward the majority class
•
Applied sub-sequencing to reduce
classifier bias. Authors built a continuous
feature vector across all reading
sessions for a given user of the second
feature vector type
Q2
P
1
12
Q3
P
2
P
3
P
4
Clustering
•
Attention turned from
authenticating user identity to
narrowing down the possibility
that the current reading
pattern belongs to a small set
of registered users
•
User profiles created from
reading session data
•
Calculating distance to a
profile
13
Clustering Approaches
•
Approach I :
Simple ranking by distance
•
Approach II:
Filtering possible users by
profile standard error
14
Part V
Results
Classification Results
•
Preprocessing: Eliminated sessions with less than 150 observations and
for users with less than 6 reading sessions. This left 29 users.
•
RF with 10 fold cross-validation was run on the stratified data set. Best
results with 40 trees and 9 features. F-measure was poor (0.27) for the
Authenticated class and excellent (0.99) for the Imposter class. Poor
performance due to imbalance.
•
Applying SMOTE yeild slight improvement: F-measure of 0.29 for the
Authenticated class
•
AdaBoost increased F-measure for Authenticated class to 0.35 with
DecisionStump and 0.39 with ADTree.
•
When subsequencing was used for 44 users the F-measure was 0.50 for
the Authenticated class, meaning only neutral prediction capability
16
Clustering Results
•
Approach I : Simple ranking by distance
19.75% of the time actual user profile predicted
- 58% - 80% actual user in top 5 and 10 profiles
•
Approach II: Filtering by profile standard error
33% of profiles marked as possible
83.5% of the time actual user in possible profiles
17
-
Questions ?
Image Sources
•
http://www.gadgetreview.com/wp-content/uploads/2012/11/applewireless-keyboard-2.jpg
•
http://nakedconsumerism.com/wp-content/uploads/2016/02/mousejackexplained.jpg
•
http://www.mrgeek.me/wp-content/uploads/2013/06/Passcode-iOS-7.png
•
http://www.thegoodguys.com.au/cs/groups/public/documents/graphic/appl
e-iphone6s-img-7.png
•
http://www.aoa.org/Images/public/GoodPosture.jpg
•
http://www.mathworks.com/matlabcentral/mlcdownloads/downloads/submissions/24616/versions/12/screenshot.jpg

CS548S16_Showcase_Clustering_II

Transcript CS548S16_Showcase_Clustering_II

Directory