Transcript ppt

Al-Imam Mohammad bin Saud University
CS433: Modeling and Simulation
Lecture 05: Statistical
Analysis Tools
Dr. Anis Koubâa
15 October 2010
Textbook Reading

Section 7.5
 Goodness-of-Fit Tests for Distributions,
page134



7.5.1 Chi-Square Test 134
7.5.2 Kolmogorov-Smirnov (K-S) Test 137
Section 3.6

Correlation, page 32
Goals of Today
Know how to compare between two
distributions
 Know how to evaluate the relationship between
two random variable

Outline


Comparing Distributions:
Tests for Goodness-of-Fit
 Chi-Square Distribution (for discrete models: PMF)
 Kolmogorov-Smirnov Test (for continuous models:
CDF)
Evaluating the relationship
 Linear Regression
 Correlation
‫جودة المطابقة‬
Goodness-of-fit
Statistical Tests enables to compare between two distributions, also known as
Goodness-of-Fit.

 The
goodness-of-fit of a statistical model describes how well it fits a set of
observations.
Measures
of goodness of fit typically summarize the discrepancy between observed values
and the values expected under the model in question
Goodness-of-fit
means how well a statistical model fits a set of observations
Pearson’s c²-Tests
Chi-Square Tests for Discrete Models
The Pearson's chi-square test enables to compare two probability mass
functions of two distribution.
If the difference value (Error) is greater than the critical value, the two
distribution are said to be different or the first distribution does not fit (well) the
second distribution.
If the difference if smaller that the critical value, the first distribution fits well the
second distribution
(Pearson's ) Chi-Square test

Pearson's chi-square is used to assess two types of comparison:
 tests of goodness of fit: it establishes whether or not an observed
frequency distribution differs from a theoretical distribution.
 tests of independence. it assesses whether paired observations on
two variables are independent of each other.


For example, whether people from different regions differ in the frequency
with which they report that they support a political candidate.
If the chi-square probability is less or equal to 0.05 then we say that
 both distributions are equal (goodness-of-fit) or that
 the row variable is unrelated (that is, only randomly related) to the
column variable (test of independence).
Chi-Square Distribution
http://en.wikipedia.org/wiki/Chi-square_distribution
Chi-Square Distribution
http://en.wikipedia.org/wiki/Chi-square_distribution
(Pearson's ) Chi-Square test



The chi-square test, in general, can be used to check whether an
empirical distribution follows a specific theoretical distribution.
Chi-square is calculated by finding the difference between each observed
(O) and theoretical or expected (E) frequency for each possible outcome,
squaring them, dividing each by the theoretical frequency, and taking the
sum of the results.
For n data outcomes (observations), the chi-square statistic is defined
as:
c n21
n


i 1



Oi  E i

2
Ei
Oi = an observed frequency for a given outcome;
Ei = an expected (theoretical) frequency for a given outcome;
n = the number of possible outcomes of each event;
(Pearson's ) Chi-Square test
A chi-square probability of
0.05 or less is the criteria to
accept or reject the test of
difference between the
empirical and theoretical
distributions.
Chi-Square test: General Algorithm
http://en.wikipedia.org/wiki/Inverse-chi-square_distribution

We say that the observed distribution (empirical) fits well the
expected distribution (theoretical) if:
c 02  ca2*, k 1c
which means p  c 02  ca2*, k 1c   1  a


where
2*
ccritical
 ca2*,k 1c  idfChiSquare  k  1  c , a 

•
(k – 1 – c) is the degree of freedom, where

k is the number of possible outcome and

c is the number of estimated parameters.
1-a is the confidence level (basically, we use a = 0.05)
Chi-Square test: Example
Uniform distribution in [0 .. 9]
PASS
(KS-Test) Kolmogorov – Smirnov Test
for Continuous Models
In statistics, the Kolmogorov–Smirnov test (K–S test) quantifies a distance between the empirical
distribution function of the sample and the cumulative distribution function of the expected distribution, or
between the empirical distribution functions of two samples.


It can be used for both continuous and discrete models
Basic idea: compute the maximum distance between two cumulative distribution functions and compare it to
critical value.


If the maximum distance is smaller than the critical value, the first distribution fits the second distribution

If the maximum distance is greater than the critical value, the first distribution does not fit the second
distribution
Kolmogorov – Smirnov test




In statistics, the Kolmogorov – Smirnov test is used to determine
 whether two one-dimensional probability distributions differ, or
 whether an probability distribution differs from a hypothesized
distribution,
in either case based on finite samples.
The Kolmogorov-Smirnov test statistic measures the largest vertical
distance between an empirical cdf calculated from a data set and a
theoretical cdf.
The one-sample KS-test compares the empirical distribution function with
a cumulative distribution function.
The main applications are testing goodness-of-fit with the normal and
uniform distributions.
Kolmogorov–Smirnov Statistic


Let X1, X2, …, Xn be iid random variables in with the CDF equal
to F(x).
The empirical distribution function Fn(x) based on sample
X1, X2, …, Xn is a step function defined by:
number of element in the sample  x
Fn  x  
n

The Kolmogorov-Smirnov test statistic for a given
function F(x) is
D n  sup Fn  x   F  x 
x
Kolmogorov–Smirnov Statistic

The Kolmogorov-Smirnov test statistic for a given
function F(x) is
D n  sup Fn  x   F  x 
x
Facts


By the Glivenko-Cantelli theorem, if the sample comes
from a distribution F(x), then Dn converges to 0 almost
surely.
In other words, If X1, X2, …, Xn really come from the
distribution with CDF F(X), the distance Dn should be small
Example
Dmax
Example: Grade Distribution?

We would like to know the distribution of the Grades of
students.



First, determine the empirical distribution
Second, compare to Normal and Poisson distributions
Data Sample: 50 Grades in a course and computed the
empirical distribution


Mean = 63
Standard Deviation = 15
Example: Grade Distribution?
D n  sup Fn  x   F  x 
x


Frequency X grade = Number of grades  X grade

 

Empirical Distribution = F X grade =p X  X grade =

Frequency X grade
Sample Size

Example: Grade Distribution?
Dmax,Poisson= 0.153
Dmax,Normal= 0.119
D n  sup Fn  x   F  x 
x
Kolmogorov–Smirnov Acceptance
Criteria



Rejection Criteria: We consider that the two distributions
are not equal if the empirical CDF is too far from the
theoritical CDF of the proposed distribution
This means: We reject if Dn is too large.
But the question is: What does large mean?
For which values of Dn
should we accept the distribution?
Kolmogorov–Smirnov test
http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test
In the 1930’s, Kolmogorov and Smirnov showed
that

2 2
lim P
n 

n Dn  t

 1- 2

(-1)i -1 e -2i
t
i 1
So, for large sample sizes, you could assume
P


n D n  t  1- 2


(-1)i -1 e -2i
2
t2
i 1

2  (-1)i -1 e -2i
alevel test: find the value of at such
t a i 1
.
Dn 
n
So, the test is accepted if
2
t2
Critical
value
Kolmogorov–Smirnov test
 For small samples, people have worked out
and tabulated critical values, but there is no
nice closed form solution.
• J. Pomeranz (1973)
• J . Durbin (1968)
 For Large Samples: Good approximations for
n>40:
a
0.20
0.10
0.05
0.02
0.01
critical value
1.0730
n
1.2239
n
1.3581
n
1.5174
n
1.6276
n
Example: Grade Distribution?


For our example, we have n = 50
The critical value for a a = 0.05
Dcritical
1.3581

 0.192
50
D max,Normal  0.119  Dcritical  0.192
D max,Poisson  0.153  Dcritical  0.192
ACCEPT
ACCEPT
Example: Grade Distribution?


If we get the same distribution for n = 100
The critical value for a a = 0.05
Dcritical
1.3581

 0.1358
100
D max,Normal  0.119  Dcritical  0.1358
ACCEPT
D max,Poisson  0.153  Dcritical  0.1358
REJECT
http://en.wikipedia.org/wiki/Linear_regression
Linear Regression: Least Square
Method
In statistics, linear regression is a form of regression analysis in
which the relationship between one or more independent variables
and another variable, called dependent variable, is modeled by a least
squares function, called linear regression equation. This function is
a linear combination of one or more model parameters, called
regression coefficients.
A linear regression equation with one independent variable
represents a straight line. The results are subject to statistical
analysis.
The Method of Least Squares


The equation of the best-fitting line is calculated using a
set of n pairs (xi, yi).
We choose our estimates a and b to estimate a and b
so that the vertical distances of the points from the
line, are minimized.
Bestfitting line :yˆ  a  bx
Choosea and b to minimize
SSE  ( y  yˆ ) 2  ( y  a  bx) 2
SSE: Sum of Square of Errors
Least Squares Estimators
Calculatethe sumsof squares:
( x)
( y )
2
Sxx   x 
Syy   y 
n
n
( x)( y )
Sxy   xy 
n
Bestfitting line : yˆ  a  bx where
2
2
b
S xy
S xx
and a  y  bx
2
Example
The table shows the math achievement test scores
for a random sample of n = 10 college freshmen,
along with their final calculus grades.
Student
1
Math test, x
3
4
5
6
7
8
9
10
39 43
21
64
57
47
28
75
34
52
Calculus grade, y 65 78
52
82
92
89
73
98
56
75
Use your calculator to
find the sums and
sums of squares.
2
 x  460
 y  760
 x 2  23634  y 2  59816
 xy  36854
x  46
y  76
Example
(460) 2
Sxx  23634 
 2474
10
(760) 2
Syy  59816 
 2056
10
(460)(760)
Sxy  36854 
 1894
10
1894
b
 .76556 and a  76  .76556(46)  40.78
2474
Bestfitting line : yˆ  40.78  .77 x
Correlation Analysis
In probability theory and statistics, correlation (often measured
as a correlation coefficient) indicates the strength and
direction of a linear relationship between two random
variables.
In general statistical usage, correlation or co-relation refers to
the departure of two variables from independence. In this broad
sense there are several coefficients, measuring the degree of
correlation, adapted to the nature of the data.
Correlation Analysis
•
The strength of the relationship between x and y is
measured using the coefficient of correlation:
Correlation coefficient : r 
S xy
S xx S yy
•
•
The sign of r indicates the direction of the
relationship;
•
r near 0 indicates no linear relationship,
•
r near 1 or -1 indicates a strong linear
relationship.
A test of the significance of the correlation
coefficient is identical to the test of the slope b.
Example
The table shows the heights and weights of
n = 10 randomly selected college football players.
Player
1
2
3
4
5
6
7
8
9
10
Height, x
73
71
75
72
72
75
67
69
71
69
Weight, y
185 175 200 210
Use your calculator to
find the sums and
sums of squares.
190 195 150
170 180 175
S xy  328 S xx  60.4 S yy  2610
r
328
(60.4)(2610)
 .8261
Football Players
Scatterplot of Weight vs Height
210
200
Weight
190
180
r = .8261
170
Strong positive
correlation
160
150
66
67
68
69
70
71
Height
72
73
74
75
As the player’s height
increases, so does his
weight.
Some Correlation Patterns
r = 0; No correlation
r = 1; Linear
relationship
r = .931; Strong positive
correlation
r = -.67; Weaker negative
correlation