Review of Basic Statistics

Download Report

Transcript Review of Basic Statistics

Review of Basic Statistics
Parameters and Statistics
• Parameters are characteristics of
populations, and are knowable only
by taking a census.
• Statistics are estimates of
parameters made from samples.
Descriptive Statistics Review
Measures of Location
The Mean
The Median
The Mode
Measures of Dispersion
The variance
The standard deviation
The mean (or average) is
Mean
the basic measure of
location or “central
tendency” of the data.
•The sample mean
sample statistic.
x
is a
•The population mean  is a
population statistic.
Sample Mean
 xi
x
n
Where the numerator is the sum of values of n
observations, or:
 xi  x1  x2  ...  xn
The Greek letter Σ is the summation sign
Example: College Class Size
We have the following sample of data
for 5 college classes:
46 54 42 46 32
We use the notation x1, x2, x3, x4, and x5 to represent the
number of students in each of the 5 classes:
X1 = 46
x2 = 54 x3 = 42
x4 = 46
x5 = 32
Thus we have:
 xi x1  x2  x3  x4  x5 46  54  42  46  32
x


 44
n
5
5
The average class size is 44 students
Population Mean ()
number of observations
 xi The
in the population is denoted

by the upper case N.
N
The sample mean x is
a point estimator of
the population mean 
Median
The median is the value in the
middle when the data are arranged in
ascending order (from smallest value
to largest value).
a. For an odd number of observations the median
is the middle value.
b. For an even number of observations the
median is the average of the two middle values.
The College Class Size example
First, arrange the data in ascending order:
32 42 46 46 54
Notice than n = 5, an odd number. Thus the
median is given by the middle value.
32 42 46 46 54
The median class
size is 46
Median Starting Salary For a Sample of 12
Business School Graduates
A college placement office has obtained the
following data for 12 recent graduates:
Graduate Starting Salary
Graduate
Starting
Salary
1
2850
7
2890
2
2950
8
3130
3
3050
9
2940
4
2880
10
3325
5
2755
11
2920
6
2710
12
2880
First we arrange
the data in
ascending order
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325
Notice that n = 12, an even number. Thus we take an
average of the middle 2 observations:
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325
Middle two values
Thus
2890  2920
Median 
 2905
2
Mode
The mode is the value that occurs with
greatest frequency
Soft Drink Example
Soft Drink
Frequency
Coke Classic
19
Diet Coke
8
Dr. Pepper
5
Pepsi Cola
13
Sprite
5
Total
50
The mode is Coke
Classic. A mean or
median is
meaningless of
qualitative data
Using Excel to Compute the Mean, Median,
and Mode
Enter the data into cells A1:B13 for the starting salary
example.
•To compute the mean, activate an empty cell and enter
the following in the formula bar:
=Average(b2:b13) and click the green checkmark.
•To compute the median, activate an empty cell and enter
the following in the formula bar:
= Median(b2:b13) and click the green checkmark.
•To compute the mode, activate an empty cell and enter
the following in the formula bar:
=Average(b2:b13) and click the green checkmark.
The Starting Salary Example
Mean
Median
Mode
2940
2905
2880
•
Variance
The variance is a measure of variability that uses all the
data
• The variance is based on the difference between each
observation (xi) and the
mean ( x ) for the sample and μ for the population).
The variance is the average of the
squared differences between the
observations and the mean value
For the population:
For the sample:
2

(
x


)
i
2 
N
2

(
x

x
)
i
s2 
n 1
Standard Deviation
• The Standard Deviation of a data set is
the square root of the variance.
• The standard deviation is measured in
the same units as the data, making it
easy to interpret.
Computing a standard deviation
For the population:
 
For the sample:
s
( xi   ) 2
N
( xi  x ) 2
n 1
Measures of Association
Between two Variables
•Covariance
•Correlation coefficient
Covariance
• Covariance is a measure of linear association
between variables.
• Positive values indicate a positive correlation
between variables.
• Negative values indicate a negative correlation
between variables.
To compute a covariance for variables x and y
 xy 
( xi   x )( yi  u y )
For populations
N
( xi  x )( yi  y )
s xy 
n 1
For samples
Mortgage Interest Rates and Monthly Home Sales, 1980-2004
17
Mortgage Interest Rate
(Percent)
n = 299
x  60.3
II
15
I
13
11
y  9.02
9
IV
7
III
5
3
15
35
55
75
95
Monthly Home Sales (thousands)
115
If the majority of the
sample points are
located in quadrants II
and IV, you have a
negative correlation
between the variables—
as we do in this case.
Thus the covariance will
have a negative sign.
The (Pearson) Correlation Coefficient
A covariance will tell you if 2
variables are positively or
negatively correlated—but it will
not tell you the degree of
correlation. Moreover, the
covariance is sensitive to the unit
of measurement. The correlation
coefficient does not suffer from
these defects
The (Pearson) Correlation Coefficient
 xy
 xy 
 x y
rxy 
Note that:
s xy
sx s y
For populations
For samples
 1   xy  1
and
 1  rxy  1
Distance Traveled in 5
Hours (Miles)
Correlation Coefficient = 1
500
400
300
200
100
0
0
20
40
60
Average Speed (MPH)
80
100
I have 7 hours per
week for exercise
Time Spent Swimming
(Hours)
Correlation Coefficient = -1
8
7
6
5
4
3
2
1
0
0
2
4
6
Time Spent Jogging (Hours)
8
Normal Probability Distribution
The normal distribution is by
far the most important
distribution for continuous
random variables. It is widely
used for making statistical
inferences in both the natural
and social sciences.
Normal Probability Distribution

It has been used in a wide variety of applications:
Heights
of people
Scientific
measurements
Normal Probability Distribution

It has been used in a wide variety of applications:
Test
scores
Amounts
of rainfall
The Normal Distribution
1
 ( x   ) 2 / 2 2
f ( x) 
e
 2
Where:
μ is the mean
σ is the standard deviation
 = 3.1459
e = 2.71828
Normal Probability Distribution

Characteristics
The distribution is symmetric, and is bell-shaped.
x
Normal Probability Distribution

Characteristics
The entire family of normal probability
distributions is defined by its mean  and its
standard deviation  .
Standard Deviation 
Mean 
x
Normal Probability Distribution

Characteristics
The highest point on the normal curve is at the
mean, which is also the median and mode.
x
Normal Probability Distribution

Characteristics
The mean can be any numerical value: negative,
zero, or positive.
x
-10
0
20
Normal Probability Distribution

Characteristics
The standard deviation determines the width of the
curve: larger values result in wider, flatter curves.
 = 15
 = 25
x
Normal Probability Distribution

Characteristics
Probabilities for the normal random variable are
given by areas under the curve. The total area
under the curve is 1 (.5 to the left of the mean and
.5 to the right).
.5
.5
x
The Standard Normal
Distribution The Standard Normal Distribution
is a normal distribution with the
special properties that is mean is
zero and its standard deviation is
one.
 0
 1
Standard Normal Probability Distribution
The letter z is used to designate the standard
normal random variable.
1
z
0
Cumulative Probability
Probability that z ≤ 1 is the area under the curve
to the left of 1.
P ( z  1)
0
1
z
What is P(z ≤ 1)?
To find out, use the Cumulative Probabilities
Table for the Standard Normal Distribution
Z
.00
.01
.02
●
●
●
.9
.8159
.8186
.8212
1.0
.8413
.8438
.8461
1.1
.8643
.8665
.8686
1.2
.8849
.8869
.8888
●
●
P ( z  1)
Area under the curve
•68.25 percent of the
total area under the
curve is within (±) 1
standard deviation from
the mean.
•95.45 percent of the
area under the curve is
within (±) 2 standard
deviations of the mean.
68.25%
95.45%
-2
-1
0
1
z
2
Exercise 1
a) What is P(z ≤2.46)?
Answer:
b) What is P(z >2.46)?
a) .9931
b) 1-.9931=.0069
2.46
z
Exercise 2
a) What is P(z ≤-1.29)?
Answer:
b) What is P(z > -1.29)?
a) 1-.9015=.0985
b) .9015
Red-shaded area is
equal to greenshaded area
-1.29
Note that:
P ( z  1.29)  1  P ( z  1.29)
1.29
z
Note that, because of the symmetry, the area to the left of -1.29 is
the same as the area to the right of 1.29
Exercise 3
What is P(.00 ≤ z ≤1.00)?
P(.00 ≤ z ≤1.00)=.3413
0
1
z
P(.00  z  1)  P( z  1)  P( z  0)
 .8413  .5000  .3413