CS 590M: Security Issues in Data Mining
Download
Report
Transcript CS 590M: Security Issues in Data Mining
CS 590M Fall 2001: Security
Issues in Data Mining
Lecture 3: Classification
What is Classification?
• Problem: assign items to pre-defined
classes
– Sample Y = Y1 … Yn
– Set of classes X
– Given Y, choose C that contains Y
• How do we know how to do this?
– Training data: set of items for which proper
Xi is known.
Issues
• Classification accuracy
– False positives, False negatives
– No clear “best” metric
• Computation cost
– Training
– Classification
Approaches:
•
•
•
•
Naïve Bayes
K-Nearest Neighbor
Decision rules/Decision trees
Neural Networks
Naïve Bayes: History
Bayes classifier: From
Probability Theory
• Idea: A-posteriori
P[Y | Xi]P[ Xi]
P[ Xi | Y ]
probability of class
j P[Y | Xj]P[ Xj]
given all inputs is best
possible classifier
• Problem: doesn’t
Y2
generalize.
Y1
Y4
• Solution: Bayesian
Y3
Belief network
P(Xi|Y) = P(Y4|Y2,Y3)P(Y2|Y1)P(Y3|Y1)P(Y1)
Problems with Bayesian Belief
Network
• What should the network structure be?
– Some work in how to learn the structure
– Getting it wrong results in over-specificity
• What are the probabilities?
– Learning techniques exist here
• Computational cost to learn network
Naïve Bayes
• Two-layer Bayes network
– No need to learn structure
– Assumes inputs independent
• Learn the probabilities that work best on
training data
Y1
Y2
X
Y3
P(X|Y1...Yn) = P(X)*Πi P(Yi|X)
K-Nearest Neighbor
• Idea: Choose “closest” training item
– Class of test is same as class of closest
training item
– Need to define distance
• What if this is a bad match?
– Find K closest items
– Use most common class in those K
KNN: Advantages
• As training set → ∞, K → ∞, result
approaches optimal
– View as “best probability over all samples”:
this is Bayes theorem
• Training simple
– Just put training set into a data structure
KNN: Problems
• With small K, only captures convex
classes
• High dimensionality: may be “nearest”
in irrelevant attributes
• Query time: Search all training data
– Algorithms to make this faster
• But good enough to be “standard” for
comparison
Classification and Security
• Ideas on how to use classifiers to
improve security
– Intrusion Detection
–?
• Potential risks
– Identifying private information based on
similarity with training data