Transcript Review

Review
1
Distributions
2
Distribution Definitions
• Discrete Probability Distribution
• Continuous Probability Distribution
• Cumulative Distribution Function
3
Discrete Distribution
• A r.v. X is discrete if it takes countably many
values {x1,x2,….}
• The probability function or probability mass
function for X is given by
– fX(x)= P(X=x)
• From previous example
1 / 4
1 / 2

f X ( x)  
1 / 4

 0
x0
x 1
x2
otherwise
4
Continuous Distributions
• A r.v. X is continuous if there exists a function fX
such that

f
fX  0
X
( x)dx  1

b
P(a  x  b)   f X ( x)dx
a
5
Example: Continuous Distribution
• Suppose X has the pdf
1 0  x  1
f X ( x)  
0 otherwise
• This is the Uniform (0,1) distribution
6
Binomial Distribution
• A coin flips Heads with probability p. Flip it n
times and let X be the number of Heads.
Assume flips are independent.
• Let f(x) =P(X=x), then
 n  x
  p (1  p ) n  x
f ( x)   x 

0

x  0,1,...n
otherwise
7
Binomial Example
• Let p =0.5; n = 5 then
 5
4
P( X  4)   0.5 (1  0.5)54  0.1562
 4
• In Matlab
>>binopdf(4,5,0.5)
8
Normal Distribution
• X has a Normal (Gaussian) distribution with
parameters μ and σ if
f ( x) 
1
 1
2
exp 
(
x


)

2
2

 2


• X is standard Normal if μ =0 and σ =1. It is
denoted as Z.
• If X ~ N(μ,
σ 2)
then
X 

~Z
9
Normal Example
• The number of spam emails received by a email server in
a day follows a Normal Distribution N(1000,500). What is
the probability of receiving 2000 spam emails in a day?
• Let X be the number of spam emails
received in a day. We want P(X = 2000)?
• The answer is P(X=2000) = 0;
• It is more meaningful to ask P(X >= 2000);
10
Normal Example
• This is
2000
P( X  2000)  1  P( X  2000)  1   f ( x)dx

• In Matlab: >> 1 –normcdf(2000,1000,500)
• The answer is 1 – 0.9772 = 0.0228 or 2.28%
• This type of analysis is so common that there is a
special name for it: cumulative distribution function F.
x
F ( x)  P( X  x) 
 f ( x)dx

11
Conditional Independence
• If A and B are independent then
P(A|B)=P(A)
• P(AB) = P(A|B)P(B)
• Law of Total Probability.
12
Bayes Theorem
13
Question 1
Gender
% of credit
% of gender
card holders who default
Male
60
55
Female
40
35
• Question: Suppose you randomly select a credit card holder
and the person has defaulted on their credit card. What is
the probability that the person selected is a ‘Female’?
14
Answer to Question 1
P(G  F | D  Y ) 


P ( D  Y | G  F ) P(G  F )
P( D  Y | G  F )  P( D  Y | G  M )
0.35  0.40
0.35  0.40  0.55  0.60
0.30
But what does G=F and D=Y mean? We have not even formally defined them.
15
Clustering
16
Types of Clusterings
• A clustering is a set of clusters
• Important distinction between
hierarchical and partitional sets of
clusters
• Partitional Clustering
– A division data objects into non-overlapping
subsets (clusters) such that each data object is in
exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical
tree
17
Partitional Clustering
Original Points
A Partitional Clustering
18
Hierarchical Clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
p3 p4
Non-traditional Dendrogram
19
K-means Clustering
•
•
•
•
•
Partitional clustering approach
Each cluster is associated with a centroid (center
point)
Each point is assigned to the cluster with the
closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
20
K-means Clustering – Details
•
Initial centroids are often chosen randomly.
–
•
•
•
•
The centroid is (typically) the mean of the points
in the cluster.
‘Closeness’ is measured by Euclidean distance,
cosine similarity, correlation, etc.
K-means will converge for common similarity
measures mentioned above.
Most of the convergence happens in the first few
iterations.
–
•
Clusters produced vary from one run to another.
Often the stopping condition is changed to ‘Until relatively
few points change clusters’
Complexity is O( n * K * I * d )
–
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
21
Evaluating K-means Clusters
•
Most common measure is Sum of Squared Error (SSE)
– For each point, the error is the distance to the nearest cluster
– To get SSE, we square these errors and sum them.
K
SSE    dist 2 ( mi , x )
i 1 xCi
– x is a data point in cluster Ci and mi is the representative point for cluster
Ci
• can show that mi corresponds to the center (mean) of the cluster
– Given two clusters, we can choose the one with the smallest error
– One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
22
Hierarchical Clustering
• Produces a set of nested clusters
organized as a hierarchical tree
• Can be visualized as a dendrogram
– A tree like diagram that records the
sequences of merges or splits
5
6
0.2
4
3
0.15
4
2
5
2
0.1
1
0.05
3
0
1
3
2
5
4
1
6
23
Strengths of Hierarchical
Clustering
• Do not have to assume any particular
number of clusters
– Any desired number of clusters can be
obtained by ‘cutting’ the dendogram at the
proper level
• They may correspond to meaningful
taxonomies
– Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
24
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a
point (or there are k clusters)
• Traditional hierarchical algorithms use a similarity or distance matrix
– Merge or split one cluster at a time
25
Agglomerative Clustering Algorithm
•
More popular hierarchical clustering technique
•
Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4.
Merge the two closest clusters
5.
Update the proximity matrix
6. Until only a single cluster remains
•
Key operation is the computation of the proximity of two clusters
–
Different approaches to defining the distance between
clusters distinguish the different algorithms
26
EM Algorithm
27
Missing Data
• We think of clustering as a problem of
estimating missing data.
• The missing data are the cluster labels.
• Clustering is only one example of a
missing data problem. Several other
problems can be formulated as missing
data problems.
28
Missing Data Problem
• Let D = {x(1),x(2),…x(n)} be a set of n
observations.
• Let H = {z(1),z(2),..z(n)} be a set of n
values of a hidden variable Z.
– z(i) corresponds to x(i)
• Assume Z is discrete.
29
EM Algorithm
• The log-likelihood of the observed data is
l ( )  log p( D |  )  log  p( D, H |  )
H
• Not only do we have to estimate  but also H
• Let Q(H) be the probability distribution on the missing
data.
30
EM Algorithm
Inequality is because of Jensen’s Inequality.
This means that the F(Q,) is a lower bound on l()
Notice that the log of sums is become a sum of logs
31
EM Algorithm
• The EM Algorithm alternates between
maximizing F with respect to Q (theta
fixed) and then maximizing F with respect
to theta (Q fixed).
32
EM Algorithm
• It turns out that the E-step is just
• And, furthermore
• Just plug-in
33
EM Algorithm
• The M-step reduces to maximizing the first
term with respect to  as there is no  in
the second term.
34
EM Algorithm for Mixture of
Normals
Mixture of
Normals
E Step
M-Step
35
What is Association Rule Mining?
• Association rule mining finds
• combinations of items that typically occur
together in a database (market-basket analysis)
• Sequences of items that occur frequently
(sequential analysis) in a database
• Originally introduced for Market-basket analysis -useful for analysing purchasing behaviour of
customers.
36
Market-Basket Analysis – Examples





Where should strawberries be placed to maximize their sale?
Services purchased together by telecommunication customers (e.g.
broad band Internet, call forwarding, etc.) help determine how to
bundle these services together to maximize revenue
Unusual combinations of insurance claims can be a sign of a fraud
Medical histories can give indications of complications based on
combinations of treatments
Sport: analyzing game statistics (shots blocked, assists, and fouls) to
gain competitive advantage
•
“When player X is on the floor, player Y’s shot accuracy
decreases from 75% to 30%”
•
Bhandari et.al. (1997). Advanced Scout: data mining and knowledge discovery
in NBA data, Data Mining and Knowledge Discovery, 1(1), pp.121-125
37
Support and Confidence - Example
• What is the support and confidence of the following
rules?
• {Beer}{Bread}
• {Bread, PeanutButter}{Jelly} ?
Support(XY)=support(X Y)
confidence(XY)=support(XY)/support(X)
38
Association Rule Mining Problem Definition
• Given a set of transactions T={t1, t2, …,tn} and 2 thresholds;
minsup and minconf,
• Find all association rules XY with support  minsup and
confidence  minconf
• I.E: we want rules with high confidence and support
• We call these rules interesting
• We would like to
• Design an efficient algorithm for mining association rules
in large data sets
• Develop an effective approach for distinguishing
interesting rules from spurious ones
39
Generating Association Rules – Approach 1
(Naïve)
• Enumerate all possible rules and select those of them
that satisfy the minimum support and confidence
thresholds
• Not practical for large databases
• For a given dataset with m items, the total number of
possible rules is 3m-2m+1+1 (Why?*)
• And most of these will be discarded!
• We need a strategy for rule generation -- generate only
the promising rules
• rules that are likely to be interesting, or, more accurately, don’t
generate rules that can’t be interesting.
*hint: use inclusion-exclusion principle
40
Generating Association Rules – Approach 2
• What do these rules have in common?
A,BC
A,CB
B,CA
• The support of a rule XY depends only on the support
of its itemset X Y
Answer: they have the same support: support({A,B,C})
• Hence, a better approach: find Frequent itemsets first,
then generate the rules
• Frequent itemset is an itemset that occurs more than
minsup times
• If an itemset is infrequent, all the rules that contain it will
have support<minsup and there is no need to generate
them
41
Generating Association Rules – Approach 2
• 2 step-approach:
Step 1: Generate frequent itemsets -- Frequent
Itemset Mining (i.e. support  minsup)
• e.g. {A,B,C} is frequent (so A,BC, A,CB and B,CA
satisfy the minSup threshold).
Step 2: From them, extract rules that satisfy
the confidence threshold (i.e. confidence 
minconf)
• e.g. maybe only A,B C and C,BA are confident
• Step 1 is the computationally difficult part
(the next slides explain why, and a way to reduce
the complexity….)
42
Frequent Itemset Generation (Step 1)
– Brute-Force Approach
• Enumerate all possible itemsets and scan the dataset to
calculate the support for each of them
• Example: I={a,b,c,d,e}
Search space
showing superset
/ subset
relationships
Given d items, there
are 2d-1 possible (nonempty) candidate
itemsets => not
practical for large d
43
Frequent Itemset Generation (Step 1)
-- Apriori Principle (1)
A subset of any frequent itemset is also frequent
Example: If
{c,d,e} is
frequent then
{c,d}, {c,e},
{d,e}, {c}, {d}
are also
frequent
44
Frequent Itemset Generation (Step 1)
-- Apriori Principle (2)
If an itemset is not frequent, a superset of it is also not
frequent
Example: If we
know that {a,b} is
infrequent, the
entire sub-graph
can be pruned.
Ie: {a,b,c}, {a,b,d},
{a,b,e}, {a,b,c,d},
{a,b,c,e}, {a,b,d,e}
and {a,b,c,d} are
infrequent
45
Recall the 2 Step process
for Association Rule Mining
Step 1: Find all frequent Itemsets
So far: main ideas and concepts (Apriori
principle).
Later: algorithms
Step 2: Generate the association rules from
the frequent itemsets.
46
ARGen Algorithm (Step 2)
• Generates interesting rules from the frequent itemsets
• Already know the rules are frequent (Why?), just need
to check confidence.
ARGen algorithm
for each frequent itemset F
generate all non-empty subsets S.
for each s in S do
if confidence(s  F-s) ≥ minConf
then output rule s  F-s
end
Example:
F={a,b,c}
S={{a,b}, {a,c}, {b,c}, {a}, {b}, {c}}
rules output: {a,b} {c}, etc.
47
ARGen - Example
• minsup=30%, minconf=50%
• The set of frequent itemsets
L={{Beer},{Bread}, {Milk},
{PeanutButter},
{Bread, PeanutButter}}
• Only the last itemset from L consists of 2 nonempty
subsets of frequent itemsets – Bread and PeanutButter.
confidence( Bread   PeanutButter ) 
support({Bread , PeanutButter}) 60

 0.75  minconf
support({Bread})
80
confidence( PeanutButter   Bread ) 
support({Bread , PeanutButter}) 60

 1  minconf
support({PeanutButter})
60
• => 2 rules will be generated
48
Bayes Classifier
• A probabilistic framework for solving
classification problems
• Conditional Probability:
P ( A, C )
P (C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )
• Bayes theorem:
P( A | C ) P(C )
P(C | A) 
P( A)
49
Example of Bayes Theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, what’s the
probability he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 

 0.0002
P( S )
1 / 20
50
Bayesian Classifiers
• Consider each attribute and class label as
random variables
• Given a record with attributes (A1, A2,…,An)
– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
• Can we estimate P(C| A1, A2,…,An ) directly
from data?
51
Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P (C | A A  A ) 
1
2
n
P ( A A  A | C ) P (C )
P( A A  A )
1
2
n
1
2
n
– Choose value of C that maximizes
P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
• How to estimate P(A1, A2, …, An | C )?
52
Naïve Bayes Classifier
• Assume independence among attributes Ai
when class is given:
– P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
– Can estimate P(Ai| Cj) for all Ai and Cj.
– New point is classified to Cj if P(Cj)  P(Ai| Cj) is
maximal.
53
How to Estimate
Probabilities
from
Data?
s
al
al
u
c
Tid
e
at
Refund
g
ic
r
o
c
e
at
g
ic
r
o
c
on
o
u
tin
s
s
a
cl
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
• Class: P(C) = Nc/N
– e.g., P(No) = 7/10,
P(Yes) = 3/10
• For discrete attributes:
k
P(Ai | Ck) = |Aik|/ Nc
– where |Aik| is number of
instances having attribute
Ai and belongs to class Ck
– Examples:
10
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
54
How to Estimate Probabilities from Data?
• For continuous attributes:
– Discretize the range into bins
• one ordinal attribute per bin
• violates independence assumption
k
– Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
– Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
• Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
55
l
l
How togo Estimate
Probabilities
from
Data
?
o
n
s
i
g
a
c
i
r
c
Tid
e
at
Refund
c
a
c
i
r
e
at
Marital
Status
c
t
n
o
Taxable
Income
u
o
u
s
as
l
c
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
• Normal distribution:
1
P( A | c ) 
e
2
i
j

( Ai   ij ) 2
2  ij2
2
ij
– One for each (Ai,ci) pair
• For (Income, Class=No):
– If Class=No
• sample mean = 110
• sample variance = 2975
10
1
P( Income  120 | No) 
e
2 (54.54)

( 120110) 2
2 ( 2975)
 0.0072
56
Example of Naïve Bayes Classifier
Given a Test Record:
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes: sample mean=90
sample variance=25

P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No)
 P(Income=120K| Class=No)
= 4/7  4/7  0.0072 = 0.0024

P(X|Class=Yes) = P(Refund=No| Class=Yes)
 P(Married| Class=Yes)
 P(Income=120K| Class=Yes)
= 1  0  1.2  10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
57
Naïve Bayes Classifier
• If one of the conditional probability is zero,
then the entire expression becomes zero
• Probability estimation:
N ic
Original : P( Ai | C ) 
Nc
N ic  1
Laplace : P( Ai | C ) 
Nc  c
c: number of classes
p: prior probability
m: parameter
N ic  mp
m - estimate : P( Ai | C ) 
Nc  m
58
Example of Naïve Bayes
Classifier
A: attributes
Name
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
Live in Water Have Legs
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes
Class
mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
M: mammals
N: non-mammals
6 6 2 2
P( A | M )      0.06
7 7 7 7
1 10 3 4
P( A | N )      0.0042
13 13 13 13
7
P( A | M ) P ( M )  0.06   0.021
20
13
P( A | N ) P( N )  0.004   0.0027
20
P(A|M)P(M) > P(A|N)P(N)
Give Birth
yes
Can Fly
no
Live in Water Have Legs
yes
no
Class
=> Mammals
?
59
Naïve Bayes (Summary)
• Robust to isolated noise points
• Handle missing values by ignoring the instance during
probability estimate calculations
• Robust to irrelevant attributes
• Independence assumption may not hold for some
attributes
– Use other techniques such as Bayesian Belief
Networks (BBN)
60
Principal Component Analysis
61
Motivation
• Bulk of data has a time component
• For example, retail transactions, stock
prices
• Data set can be organized as N x M table
• N customers and the price of the calls they
made in 365 days
• M << N
62
Objective
• Compress the data matrix X into Xc, such
that
– The compression ratio is high and the
average error between the original and the
compressed matrix is low
– N could be in the order of millions and M in
the order of hundreds
63
Example database
We Thr Fri
Sat Sun
7/10 7/11 7/12 7/13 7/14
ABC 1
1
1
0
0
DEF 2
2
2
0
0
GHI
1
1
1
0
0
KLM 5
5
5
0
0
smit 0
h
john 0
0
0
2
2
0
0
3
3
64
Decision Support Queries
• What was the amount of sales to GHI on
July 11?
• Find the total sales to business customers
for the week ending July 12th?
65
Intuition behind SVD
y
x’
y’
x
Customers are 2-D points
66
SVD Definition
• An N x M matrix X can be expressed as
X  U   V
t
Lambda is a diagonal r x r matrix.
67
SVD Definition
• More importantly X can be written as
X   u v  2u2 v2   r ur vr
t
1 1 1
t
t
Where the eigenvalues are in decreasing order.
X c   u v  2u2v2  k uk vk
t
1 1 1
t
t
k,<r
68
Example
.18
0
.36 .58 t
0  0 
   
  

.
58
0
.18  
0 

X  9.64  .90  .58  5.29   0    0 
   
  

0
0
.
71
0  
.53 

0 0
.80 0.71

   
  
0
.27
69
Compression
r
X   i ui vi
t
i 1
k
X c   i ui v t i
i 1
Where k <=r <= M
70
Density-based: LOF approach
• For each point, compute the density of its local
neighborhood
• Compute local outlier factor (LOF) of a sample p as
the average of the ratios of the density of sample p
and the density of its nearest neighbors
• Outliers are points with largest LOF value
In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers

p2

p1
71
Clustering-Based
• Basic idea:
– Cluster the data into
groups of different density
– Choose points in small
cluster as candidate
outliers
– Compute the distance
between candidate points
and non-candidate clusters.
• If candidate points are far
from all other non-candidate
points, they are outliers
72
Base Rate Fallacy
• Bayes theorem:
• More generally:
73
Base Rate Fallacy (Axelsson,
1999)
74
Base Rate Fallacy in Intrusion
Detection
•
I: intrusive behavior,
I: non-intrusive behavior
A: alarm
A: no alarm
• Detection rate (true positive rate): P(A|I)
• False alarm rate: P(A|I)
• Goal is to maximize both
– Bayesian detection rate, P(I|A)
– P(I|A)
75
Detection Rate vs False Alarm Rate
• Suppose:
• Then:
• False alarm rate becomes more dominant
if P(I) is very low
76
Detection Rate vs False Alarm Rate
• Axelsson: We need a very low false alarm rate to
achieve a reasonable Bayesian detection rate 77