pptx - UMIACS - University of Maryland

Download Report

Transcript pptx - UMIACS - University of Maryland

5. Machine Learning
ENEE 759D | ENEE 459D | CMSC 858Z
Prof. Tudor Dumitraș
Assistant Professor, ECE
University of Maryland, College Park
http://ter.ps/759d
https://www.facebook.com/SDSAtUMD
Today’s Lecture
• Where we’ve been
– Big Data
– Statistics
– MapReduce
– Interpretation of results
• Where we’re going today
– Machine learning
• Where we’re going next
– Part 2 of course: Security and InSecurity in the Real World
– 2 readings each lecture
2
Machine Learning
Systems that automatically learn programs from data
P. Domingos, CACM 2012
• Supervised learning: have inputs and associated outputs
– Learn relationships between them using available training data (also
called “labeled data”, “ground truth”)
– Predict future values
– Classification: The output (learned attribute) is categorical
– Regression: The output (learned attribute) is numeric
• Unsupervised learning: have only inputs
– Learn “latent” labels
– Clustering: Identify natural groups in the data
3
Rules
Weather and golf
outlook
overcast
overcast
overcast
overcast
rainy
rainy
rainy
rainy
rainy
sunny
sunny
sunny
sunny
sunny
temp
cool
hot
hot
mild
cool
mild
cool
mild
mild
hot
hot
mild
cool
mild
humidity windy
normal
TRUE
high
FALSE
normal
FALSE
high
TRUE
normal
TRUE
high
TRUE
normal
FALSE
high
FALSE
normal
FALSE
high
FALSE
high
TRUE
high
FALSE
normal
FALSE
normal
TRUE
play
yes
yes
yes
yes
no
no
yes
yes
yes
no
no
no
yes
yes
• Want to decide when to play
– Create rules based on attributes
• Example: 1 attribute
if (outlook == “rainy”)
then play = “no”
else play = “yes”
– Errors: 6/14
• Can refine rule by adding
conditions on other attributes
– Create a decision tree
4
Entropy
Which attribute do we choose at each level?
• Consider two sequences of coin flips
– How much information do we get after flipping each coin once?
– We want some function “Information” that satisfies:
Information1&2(p1p2) = Information1(p1) + Information2(p2)
– Expected Information = “Entropy”
I(X) = -log2 px
H (X) = E(I(X)) = -å px log2 px
x
• Examples
– Flipping a coin
H(X) = - ( 0.5log2 0.5+ 0.5log2 0.5) =1
! In learning the outcome of the coin flip we learned 1 bit of information
æ1
1ö
– Rolling a fair die
H (X) = 6 ´ - ç log 2 ÷ » 2.58
è6
6ø
! A die is more unpredictable than a coin
H(X) » 2.16
– Rolling a weighted die with p1..5=0.1, p6=0.5
! A weighted die is less unpredictable than a fair die
Decision Tree
• At each level, choose the attribute
with the highest information gain
Weather and golf
outlook
overcast
overcast
overcast
overcast
rainy
rainy
rainy
rainy
rainy
sunny
sunny
sunny
sunny
sunny
temp
cool
hot
hot
mild
cool
mild
cool
mild
mild
hot
hot
mild
cool
mild
humidity windy
normal
TRUE
high
FALSE
normal
FALSE
high
TRUE
normal
TRUE
high
TRUE
normal
FALSE
high
FALSE
normal
FALSE
high
FALSE
high
TRUE
high
FALSE
normal
FALSE
normal
TRUE
play
yes
yes
yes
yes
no
no
yes
yes
yes
no
no
no
yes
yes
– The one that reduces the
unpredictability the most
• Before: 9/14 “yes” outcomes =>
H=0.94
– Outlook: H=0.69
• 4/4 “yes” for overcast (H=0)
• 3/5 “yes” for rainy (H=0.97)
• 2/5 “yes” for sunny (H=0.97)
– Temperature: H=0.91
– Humidity: H=0.94
– Windy: 0.87
• Outlook provides highest
information gain: 0.94 – 0.69 = 0.25
6
Resulting Decision Tree
• Putting the decision tree together
– Choose the attribute with the highest Information Gain
– Create branches for each value of attribute
– Discretize continuous attributes (choose partition with highest gain)
– R package: rpart
• Not a perfect classification (still makes some incorrect decisions)
7
Overfitting
• Low error on training data and high error on test data
– “If the knowledge and data we have
are not sufficient to completely
determine the correct classifier, […]
we run the risk of just hallucinating
a classifier that […] simply encodes
random quirks in the data.”
– P. Domingos, CACM’12
Underfitting
• Some algorithms can prune the tree to avoid overfitting
Overfitting
8
Confusion Matrix
How to determine if the classifier does a good job?
• You need a training set (ground truth) and a testing set
– Or you can split your ground truth into two data sets
– Even better: K-fold cross-validation
• Select K samples without replacement and train classifier multiple times
• You can make a mistake in two different ways
Predicted Predicted +
True True Negative (TN)
Correct decision
False Positive (FP)
Type 1 error
True +
False Negative (FN)
Type 2 error
True Positive (TP)
Correct decision
9
Evaluating Results
Is it better to have low FPs or low FNs?
• There is usually a trade-off between FPs and FNs
• Sensitivity = TP / (TP+FN)
– Ability to identify true positives
– Also called true positive rate
• Specificity = TN / (FP + TN)
– Ability to rule out true negatives
TP rate (Sensitivity)
– Reducing type 1 errors causes more type 2 errors, and vice-versa
Evaluating keystroke dynamics
[Killourhy & Maxion, DSN’09]
– Also called true negative rate
FP rate (1 – Specificity)
• Can plot a Receiver Operating Characteristic (ROC) curve
– R package: ROCR
10
Unsupervised Learning
• Agglomerative hierarchical clustering (R: hclust)
– No ground truth; goal is to identify patterns that describe the data
– Start from individual points and progressively merge nearby clusters
– Distance metric (e.g. Euclidian, rank correlation, Gower)
– Linkage: how to aggregate pairwise point distances into cluster distances
• Average? Minimum (single)? Maximum (complete)? Variance decrease (Ward)?
Dendrogram of 1970 cars
(features: MPG, weight, drive ratio)
! Choose classification or
clustering features carefully
Additional Machine Learning Resources
• Classification
– We saw: decision trees
– Other classifiers: naïve Bayes, Support Vector Machines (SVM)
– Natural language processing
• Text mining (R package: tm)
• Sentiment analysis (annotated English wordlist:
http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010)
• Clustering
– We saw: hierarchical clustering
– Other clustering techniques: k-means, k-medoids, time series clustering
– Dimensionality reduction: principal component analysis (PCA)
• Machine learning tools
– For R: http://cran.r-project.org/web/views/MachineLearning.html
– For Hadoop: Mahout (http://mahout.apache.org/)
12
Project Peer-Reviews
• Pilot project reports
– Reports due today
• Discuss hypothesis (security problem and data analyzed to solve it)
• Feasibility study
• Report data volume, velocity, variety and quality
– Post report on Piazza
• Pilot project peer reviews
– Review at least 2 project reports from other students
• Use skills learned from paper reviews
– Peer reviews are a part of your grade
– Post reviews on Piazza (as follow-ups to report posts) by Monday
13
Review of Lecture
• What did we learn?
– Classification
– Clustering
• What’s next?
– Paper discussion: ‘Sex, Lies and Cyber-crime Surveys’
– Next lecture: start of part 2 of course – 2 readings / lecture
• Deadline reminders
– Pilot project reports due today
– Pilot project reviews due Monday
– Group project proposals due Monday, 09/30
14