Transcript Probability
Statistics for
Data Miners: Part I
S.T. Balke
Statistics
Statistics is concerned with how to collect and
analyze data in the presence of variability.
• Variability=random error + systematic error
• Precision: reproducibility; associated with
random error.
• Accuracy: deviation from the truth; associated
with systematic error.
Accuracy and Reproducibility
• Error rate is an expression of accuracy of a data mining
method.
• Our estimate of error rate is based upon the data that we
have.
• We want to say that the error rate will be the same when
other data of the same type is used.
• However, all data has some random error.
• Thus, our estimate of error rate obtained using data is
affected by the presence of random error.
• Our estimate has some uncertainty.
• Statistics can quantify that uncertainty and can tell us how
to decrease it.
Statistics
Statistics
Descriptive Statistics
Inferential Statistics
Probability
Statistics Lectures:
Part I: The Basics: S.T. Balke
Part II: The Analysis of Count Data: S. Sayad
Part III: Hypothesis Testing for Numeric Values: S.T.
Balke
Statistics in Data Mining
•
•
•
•
•
•
•
Data visualization and cleaning
Rule and tree construction
Basis for Bayesian Approaches
Assessment of competing data mining methods
Assessing the significance of an error rate
Fitting equations to data
Reducing the dimensions of the problem
Part I: The Basics
• Discrete Distributions: the Binomial Distribution
• Continuous Distributions:
– Histograms
– Distributions
– Measures of Distributions
• The Normal Distribution
• The Central Limit Theorem
• Confidence Intervals
Data Types
• Categorical (Nominal): labels
• Numerical:
– Discrete: integers
– Continuous: numeric non-integers
Initial Focus
Random Variables
• Quantities that cannot be predicted with
certainty
• If only distinct values: discrete random
variable
• If any value in a continuum: continuous
random variable
Distributions
Probability
The Relative Frequency Concept
• If an experiment is repeated n times and
event A is observed b times, then
• For large n: P(A)= b/n
• Or: P(A)= no. Of b’s observed/no of total
observations
• Simply put:
Probability = relative frequency
Probability = Relative Frequency
Distributions
• Portray what happens when the same
experiment is repeated a number of times.
• When you see a distribution think:
REPRODUCIBILITY
Typical Distribution for a Discrete
Variable
Binomial Distribution (n=10, p=.50)
0.3
Probability
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
6
No. of Successes
7
8
9
10
The Binomial Distribution
• Consider a random experiment where one
of two mutually and exhaustive outcomes
can occur (success and failure or heads and
tails, etc.).
• Repeat n times.
• Outcomes are mutually independent.
• The probability, p, of success is the same in
each trial.
The Binomial Distribution
• The probability of y
successes in n trials is:
y
n!
p (1 p)ny
g(y) b(n, p)
y!(n y)!
The total probability of having any number of successes is
the sum of all the g(y) which is unity.
The probability of having any number of successes up to a
certain value y’ is the sum of f(y) up to that value of y.
See page 178 regarding quantifying the value of a rule.
Binomial Distribution
(n=10, p=0.30)
Binomial Distribution (n=10, p=.30)
0.3
Probability
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
6
No. of Successes
7
8
9
10
Binomial Distribution
(n=10,p=0.80)
Binomial Distribution (n=10, p=.80)
0.35
Probability
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
6
No. of Successes
7
8
9
10
Binomial Distribution
(n=25,p=0.80)
Binomial Distribution (n=25, p=.80)
0.25
Probability
0.2
0.15
0.1
0.05
0
0
2
4
6
8
10
12
14
16
No. of Successes
18
20
22
24
Typical Distribution for a
Continuous Variable
Normal (Gaussian) Distribution
Example
• In 1798, Henry Cavendish estimated the
density of the earth by using a torsion
balance
Cavendish Experiment
m earth
4 3
rearth
3
Fg
m1m 2
d
2
Cavendish Experiment:
Sources of Error
•
•
•
•
•
•
•
•
torsional strength of wire
air currents
body mass of experimenter
placement of masses
measurement of distances
measurement of angle
contribution of damping device
radius from Eratosthenes, 200BC
Density of the Earth [g/cm3]
5.5
5.57
5.42
5.61
5.53
5.47
4.88
5.62
5.63
4.07
5.29
5.34
5.26
5.44
5.46
5.55
5.34
5.3
5.36
5.79
5.75
5.29
5.1
5.86
5.58
5.27
5.85
5.65
5.39
Note: water =1 g/cc
granite=2.7g/cc
Density Data
Ascending Order
1
2
3
4
5
6
7
4.07
4.88
5.10
5.26
5.27
5.29
5.29
8
9
10
11
12
13
14
5.30
5.34
5.34
5.36
5.39
5.42
5.44
Accepted Value: 5.50 g/mL
Iron: 7.85 g/mL
Nickel: 8.90 g/mL
15
16
17
18
19
20
21
5.46
5.47
5.50
5.53
5.55
5.57
5.58
22
23
24
25
26
27
28
29
5.61
5.62
5.63
5.65
5.75
5.79
5.85
5.86
Basic Histogram Calculations
Density Freq.
4
0
4.25
1
4.5
0
5
1
5.25
1
5.5
14
5.75
9
6
3
more 0
Total 29
0
0.034483
0
0.034483
0.034483
0.482759
0.310345
0.103448
0
1
0
0.137931
0
0.137931
0.137931
1.931034
1.241379
0.413793
0
4
0
0.034483
0.034483
0.068966
0.103448
0.586207
0.896552
1
1
Histogram
Freq.= 14 for
earth density
values between
5.25 and 5.5
Histogram
16
14
Frequency
12
10
8
6
4
2
0
4
4.25
4.5
5
5.25
Bin
5.5
5.75
6
More
Histograms
• Height of bars=frequency
• Frequency obtained depends upon total
number of observations
• We would like to remove that dependency!
Basic Histogram Calculations
Density Freq. Rel. Freq.
4
0
4.25
1
4.5
0
5
1
5.25
1
5.5
14
5.75
9
6
3
more 0
Total 29
0
0.034483
0
0.034483
0.034483
0.482759
0.310345
0.103448
0
1
0
0.137931
0
0.137931
0.137931
1.931034
1.241379
0.413793
0
4
0
0.034483
0.034483
0.068966
0.103448
0.586207
0.896552
1
1
Histogram
Relative Frequency versus Density
Relative Frequency
0.6
0.5
Rel.Freq.= 0.482
for earth density
values between
5.25 and 5.5
0.4
0.3
0.2
0.1
0
4
4.25
4.5
5
5.25
Density
5.5
5.75
6
more
Histograms
• The eye reacts to area more than to height
of a bar (important if class sizes are
different!)
• We want the area of a bar to be the relative
frequency
Basic Histogram Calculations
Density Freq. Rel Freq. Rel Freq./Width
4
0
4.25
1
4.5
0
5
1
5.25
1
5.5
14
5.75
9
6
3
more 0
Total 29
0
0.034483
0
0.034483
0.034483
0.482759
0.310345
0.103448
0
1
0
0.137931
0
0.137931
0.137931
1.931034
1.241379
0.413793
0
4
0
0.034483
0.034483
0.068966
0.103448
0.586207
0.896552
1
1
Histogram
Rel.Freq.
Relative Frequency/Width
Relative Frequency/Width vs Density
2.5
2
1.5
=1.93x0.25=0.482
for earth density
values between
5.25 and 5.5
1
0.5
0
4
4.25
4.5
5
5.25
Density
5.5
5.75
6
more
Basic Histogram Calculations
Density Freq. Rel. Freq.
Rel Freq/Width Cum. Rel. Freq.
4
0
4.25
1
4.5
0
5
1
5.25
1
5.5
14
5.75
9
6
3
more 0
Total 29
0
0.137931
0
0.137931
0.137931
1.931034
1.241379
0.413793
0
4
0
0.034483
0
0.034483
0.034483
0.482759
0.310345
0.103448
0
1
0
0.034483
0.034483
0.068966
0.103448
0.586207
0.896552
1
1
Histogram
Differential and Cumulative Histograms of Density
1.2
1
2
0.8
1.5
0.6
1
0.4
0.5
0.2
0
0
4
4.25
4.5
5
5.25
Density
5.5
5.75
6
more
Cumulative Relative
Frequency
Relative
Frequency/Width
2.5
When will you see a histogram in
this course?
• Data Visualization
http://stat.skku.ac.kr/~myhuh/software/DAVIS/DAVIS.htm
• Even more important: Probability Density
Functions and Probability Distributions are
both related to histograms!
Differential Probability Density
Distribution
• Picture a histogram with the area of each
bar equal to the relative frequency
• Assume that the histogram represents a very
large number of observations
• Reduce the width of the bars until they each
reach dx and the height of a bar is f(x)
• The area of a bar is then f(x) dx
Probability Density Distribution
Probability Density (f(x))
Histogram
120
100
80
60
dx= width of a bar
40
20
0
Observation ( x)
Probability Density Function
0.82
0.83
0.835
0.84
0.845
0.85
dx= width of a bar
8
0. 2
82
0. 2
82
0. 4
82
0. 6
82
8
0.
8
0. 3
83
0. 2
83
0. 4
83
0. 6
83
8
0.
8
0. 4
84
0. 2
84
0. 4
84
0. 6
84
8
0.
8
0. 5
85
2
120
100
80
60
40
20
0
0.825
0.
Probability Density (f(x))
Histogram Fit by Gaussian Curve
Observation ( x)
120
100
80
60
40
20
0
Probability Density Function:
The Normal Distribution
The Normal Distribution
(Also termed the “Gaussian Distribution”)
f ( x)
( x )2
1
e xp
2
2
2
Note: f(x)dx is the probability of observing a value of x between
x and x+dx. Note the statement on page 87 of the text re: dx
canceling for the Bayesian method.
The Normal Distribution:
Areas
Referring to the x axis:
• Area from - to + is 0.6826
• Area from -2 to +2 is 0.9544
• Area from -3 to +3 is 0.9974
• Area from -1.96 to +1.96 is 0.9500
• Total area under the curve = 1.0000
Excel: Descriptive Statistics
Mean
5.42
Standard Error 0.0629
Median
5.46
Mode
5.29
Standard Deviation
0.3388
Sample Variance
0.1148
Kurtosis
8.487
Skewness
-2.329
Accepted Value: 5.50 g/mL
Range
1.79
Minimum
4.07
Maximum
5.86
Sum
157.17
Count
29
Largest(1)
5.86
Smallest(1)
4.07
Confidence Level(95.0%) 0.1289
Measures of Location
The Mean:
n
xi
x i1
n
The Median:
is the (n+1)/2 value of xi in an ordered array from
lowest to highest.
About 50% of the ordered density values observed
fall below the median.
Comments on
the Mean versus the Median
Rank
x
1
2
3
4
5
6
7
8
5
7
9
10
14
16
17
50
rank of median= 4.5
median=
12
mean
16
Measures of Location (Con.)
The Mode:
is the value of xi at the peak of the histogram (the
most frequent value as defined by the mid-point of
the bar corresponding to the peak).
Measures of Dispersion
Range:
highest value of xi -lowest value of xi
Variance:
Standard Deviation:
n
(xi x)2
s2 i1
n 1
n
2
(xi x)
s i 1
n 1
Comments on
Standard Deviation
x
xbar=
sum
stdev
x-xbar
35
47
48
50
51
53
54
70
75
53.67
12.08
(x-xbar)^2
-18.67
-6.67
-5.67
-3.67
-2.67
-0.67
0.33
16.33
21.33
348.44
44.44
32.11
13.44
7.11
0.44
0.11
266.78
455.11
0.00
1168.00
146.00
12.08
Comments on
Standard Deviation
x
xbar=
sum
stdev
x-xbar
35
47
48
50
51
53
54
70
75
53.67
12.08
(x-xbar)^2
-18.67
-6.67
-5.67
-3.67
-2.67
-0.67
0.33
16.33
21.33
348.44
44.44
32.11
13.44
7.11
0.44
0.11
266.78
455.11
0.00
1168.00
146.00
12.08
Comments on
Standard Deviation
x
xbar=
sum
stdev
x-xbar
35
47
48
50
51
53
54
70
75
53.67
12.08
(x-xbar)^2
-18.67
-6.67
-5.67
-3.67
-2.67
-0.67
0.33
16.33
21.33
348.44
44.44
32.11
13.44
7.11
0.44
0.11
266.78
455.11
0.00
1168.00
146.00
12.08
Quartiles and Quantiles
Rank of the First Quartile:
i=0.25n+0.5
(Value of First Quartile=Q1)
Rank of the Third Quartile:
i=0.75n+0.5
(Value of Third Quartile=Q3)
Rank of the bth Quantile:
i=bn+0.5
Value of the Inter Quartile Range=Q3-Q1
Use of Quartiles and Quantiles
• Box Plots
• Defining Cumulative Distributions
Summary to this Point:
• Discrete variables
– Binomial Distribution
• Continuous variables:
– Probability is the same as relative frequency
– Relative frequency can be expressed as a histogram
– The fit of a “narrow bar” histogram where relative
frequency has been replaced by probability is a
probability density function
– The most famous p.d.f. is the Normal Curve (or
Gaussian Distribution)
Need for the Standard Normal
Distribution
• The mean, , and standard deviation, ,
depends upon the data----a wide variety of
values are possible
• To generalize about data we need:
– to define a standard curve and
– a method of converting any Normal curve to
the standard Normal curve
The Standard Normal
Distribution
=0
=1
Transforming Normal to
Standard Normal Distributions
• Observations xi are transformed to zi:
xi
zi
The Standard Normal
Distribution
z
1
f (z )
e xp
2
2
2
The Use of
Standard Normal Curves
Statistical Tables
• Convert x to z
• Use tables of area of curve segments
between different z values on the standard
normal curve to define probabilities
Z Table
http://www.statsoft.com/textbook/stathome.html
P.D.F. of z
Standard Normal Curve
0.5
0.4
f(z)
0.3
0.2
0.1
0
-6
-4
-2
0
z
2
4
6