L00D MGS8110 Statist.. - Georgia State University

Download Report

Transcript L00D MGS8110 Statist.. - Georgia State University

Descriptive
Statistics
Prerequisite Material
MGS 8110
Regression & Forecasting
Descriptive Analysis of Data
Given a bunch of data (simple data for only one
variable), how is the best way to summarize the
data?
L00D MGS 8110 - Descriptive Statistics
2
Measures of Central Tendency
•
Mean
•
Median
x
if n is odd
x
i
n
M  x n 1
M  xn
2
1
2 2
if n is even
•

M  (x n  x n ) / 2
2
2
1
•
Mode
•
Mid Range
M R  ( x min  x max ) / 2
•
Mid InterQuartile
M Q  ( x .25  x .75 ) / 2
the xi value that occurs most frequently
L00D MGS 8110 - Descriptive Statistics
3
Measures of Variability
Av Deviate
(x  x)


•
Av Absolute Deviate
x x


•
Variance
•
i
n
i
n
s
2
 (x  x)

2
i
n 1
 (x  x)
2
•
Standard Deviation
•
Range
R  ( x max  x min )
•
Inter Quartile Range (IQR)
RQ  ( x .75  x .25 )
s
i
n 1
L00D MGS 8110 - Descriptive Statistics
4
Lets Review
- Statistical Precepts
Use the Mean for quantitative, symmetric data.
Use the Medium for quantitative non-symmetric data.
Use the mode for categorical data.
Use the Variance when doing calculations.
Use the Standard Deviation when presenting the results of
the calculations.
Major Teaching Points are frequently
shown in green boxes
L00D MGS 8110 - Descriptive Statistics
5
Appropriate Statistics
discussed in chapter 2
Quantitative - Bell Shaped
Quantitative - NOT Bell Shaped
Ordinal
Nominal
Mean
Median
Median
Mode
St Dev
Quartile Range
Quartile Range
NA
Histogram
Histogram
Tables or Bar charts
Tables or Bar charts
L00D MGS 8110 - Descriptive Statistics
6
More Review
s2
(x  x)


s
2
– Need to memorize formules
Used for calculation, but not for presentations. Units
are squared (e.g., inches squared).
i
n 1
 ( xi  x )2
Used for presentations. Common units (e.g., inches).
n 1
Divide by n-1 instead of n in order to get an unbiased estimate.
L00D MGS 8110 - Descriptive Statistics
7
Interpretation of Standard
Deviation
(1 of 2)
If the data is normally distributed
( x  s , x  s ) contains 68.27% of thepopulation
( x  2s , x  2s ) contains 95.45% of thepopulation
( x  3s , x  3s ) contains 99.73 of thepopulation
Statistical Precepts
Two-thirds of the data is contained in  one sigma.
95% of the data is contained in  two sigma.
Almost all of the data is contained in  three sigma.
L00D MGS 8110 - Descriptive Statistics
8
Interpretation of Standard
Deviation
(2 of 2)
If ever asked to explain what the Standard Deviation means, say
“two-thirds of the data will be within plus or minus one
Standard Deviation from the mean”.
If ever asked for the “worst case” or “best case” outcome
calculate “mean – (2)sigma” and/or “mean + (2)sigma”.
Statistical Precepts
Definition of Standard Deviation - two-thirds of the
data is contained in a range of values that are two sigma
wide.
Worst case outcome is
m – 2s.
Best case outcome is
m + 2s.
L00D MGS 8110 - Descriptive Statistics
9
Other Measures
•
Percentiles
xp is the xi such that
p% of the xi < xp
(measure of tails)
•
Quartiles
are percentile where p = 25, 50 or 75
the lower, middle or upper quartile.
(measure of tails)
•
C.25  C1  x 1 or x 1
4
n
4
C.50  C2  x n 1
n
3
4
or
or x n 1
4
 (x n  x n ) / 2
2
2
C.75  C3  x 3  x 3
4
n
1
n
4 4
 x 3n 1
2
1
C0  x minimum
C4  x maximum
4
L00D MGS 8110 - Descriptive Statistics
10
Other Measures
•
Coefficient of
Variation
(percentage measure of variability)
•
Correlation
Coefficient
(measure of
linear association)
s
Cv 
x
  ( xi  x )( y i  y ) 
 / sx s y

n 1


  
  1,1
L00D MGS 8110 - Descriptive Statistics
11
4
Interpretation
of Correlation
Coefficient
n=25
   1,1
3
2
Y variable
1
0
-4
-3
-2
-1
0
1
2
3
4
-1
-2
=0
-3
-4
4
4
3
3
2
2
1
1
0
-4
-3
-2
-1
0
1
2
3
4
Y variable
Y variable
X variable
0
-4
-3
-2
-1
0
-1
-1
-2
-2
-3
-3
-4
X variable
=.8
-4
1
2
3
4
=1
X variable
L00D MGS 8110 - Descriptive Statistics
12
4
Interpretation
of Correlation
Coefficient
n=25
   1,1
3
2
Y variable
1
0
-4
-3
-2
-1
0
1
2
3
4
-1
-2
=0
-3
-4
4
4
3
3
2
2
1
1
0
-4
-3
-2
-1
0
1
2
3
4
Y variable
Y variable
X variable
0
-4
-3
-2
-1
0
-1
-1
-2
-2
-3
-3
-4
X variable
=-.8
-4
1
2
3
4
=-1
X variable
L00D MGS 8110 - Descriptive Statistics
13
Interpretation of Correlation Coefficient
1
  ( xi  x )( yi  y ) 




n 1



sx s y
0.9
Percentage of Variability Explained
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Correlation Coefficient
L00D MGS 8110 - Descriptive Statistics
14
Interpretation of Correlation Coefficient
(1 of 2)
Statistical tests of correlation coefficients are relatively
meaningless. These tests are based on the hypothesis
that “ = 0”. Based on the previous graph, knowing that
a correlation coefficient is greater than zero is not
necessarily a valuable piece of information.
In terms of “Practical Significance” (compared to
“Statistical Significance”), the correlation coefficient
has to be at least greater than .5. From the previous
graph it can be seen that  = .5 only explains 25% of the
variability in the data.
Statistical Precept
 must be greater than .5 to be of Practical Significance
L00D MGS 8110 - Descriptive Statistics
15
Other Measures
•
Skew
(measure of symmetry)
(page 3 of 3)
 (x  x)
3
i
3 
n
  ( xi  x )


n

2




3
2
 3   , more typicall y   3,3
•
Kurtosis
(measure of peakedness)
 (x  x)
4
i
4 
n
2
  ( xi  x ) 2 




n


3
 3   , more typicall y   3,3
L00D MGS 8110 - Descriptive Statistics
16
Verifying Bell Shape
(Normal Distribution)
•
Negative Skew if the distribution has a ‘long’ tail to the left,
measured as skewness  -1
•
Positive Skew if the distribution has a ‘long’ tail to the right (more
common situation), measured as skewness  +1
•
Symmetric if
-1  skewness  +1
•
Peaked Distribution if Kurtosis is a large positive number ( +1).
•
Flat Distribution if Kurtosis is a large negative number ( -1).
•
Normal shape (proportionally S-shaped sides) if Kurtosis near zero.
L00D MGS 8110 - Descriptive Statistics
17
Verifying Bell Shape
(Normal Distribution)
Statistical Precept
Bell shaped (Normally distributed)
if -1  skewness  +1
and
if -1  kurtosis  +1
L00D MGS 8110 - Descriptive Statistics
18
How is the best way to
summarize data?
(our original question)
•
Central Tendency (Mean, Median & Mode)
•
Variability (Variance & Standard Deviation)
•
Shape (Percentiles, Skewness & Kurtosis)
•
Association (correlation)
L00D MGS 8110 - Descriptive Statistics
19
Notations
Parameter
Population
True Value
Sample Value
Mean
(average)
m
X
Standard
Deviation
s
s
Variance
s2
s2
Correlatio n

p
L00D MGS 8110 - Descriptive Statistics
20
Standard Deviation
of Sample Mean
sX
sX 
n
Called the “Standard Error” of the Mean
L00D MGS 8110 - Descriptive Statistics
21
Insert / Function examples
(1 of 3)
•
Mean
Average(A1:A10)
•
Median
Median(A1:A10)
•
Mode
Mode(A1:A10)
•
Mid Range
( MAX(A1:A10) + MIN(A1:A10) ) / 2
•
InterQuartile
( Quartile(A1:A10,1) + Quartile(A1:A10,3) ) / 2
L00D MGS 8110 - Descriptive Statistics
22
Insert / Function examples
(2 of 3)
•
Av Deviate
NA
•
Av Absolute Deviate
AveDev(A1:A10)
•
Variance
Var(A1:A10)
•
Standard Deviation
StDev(A1:A10)
•
Range
MAX(A1:A10) - MIN(A1:A10)
•
Inter Quartile Range (IQR)
Quartile(A1:A10,3) - Quartile(A1:A10,1)
L00D MGS 8110 - Descriptive Statistics
23
Insert / Function examples
(3 of 3)
•
Percentiles
Percentile(A1:A10,.05)
•
Quartiles
Quartile(A1:A10,q) where q=0,1,2,3,4
•
Coef. of Variation
StDev(A1:A10)/Average(A1:A10)
•
Correlation
Correl(A1:A10,B1:B10)
•
Skew
Skew(A1:A10)
•
Kurtosis
Kurt(A1:A10)
L00D MGS 8110 - Descriptive Statistics
24
Example
Calculations
Q. Should I use the Mean or
the Median to state the
Central value of this data?
-0.31
-1.27
=SKEW(Height)
=KURT(Height)
0.17
-1.03
=SKEW(Weight)
=KURT(Weight)
Answer – Both variables have a somewhat
peaked distributions (Kurtosis greater
than 1), but both variables have very
symmetric distributions (non-skewed
distribution); hence, use Mean.
L00D MGS 8110 - Descriptive Statistics
25
Example
Calculations
Q. The Standard Deviation for
Height is almost 2 inches,
what is the practical
interpretation of this value?
67.1
71.0
3.87
=B14-B15
=B14+B15
=F10-F9
Answer – The height of 2/3 of the
population will vary by less than 4
inches (3.87”).
L00D MGS 8110 - Descriptive Statistics
26
Example
Calculations
Q. What is the height of the
shortest person and the
tallest person that I may
meet today (worst case and
best case)?
65.2
72.9
=AvHt-2*StDevHt
=AvHt+2*StDevHt
Answer – The shortest person will be 5’-5”
(65.2”) and the tallest person will be 6’-1”
(72.9”).
L00D MGS 8110 - Descriptive Statistics
27
Example
Calculations
Q. What is the height of the
shortest person that I may
meet over the next year?
66.0
=PERCENTILE(Height,0.01)
Answer – The shortest person that I
am likely to meet in the foreseeable
future will be 5’-6” (66.0”).
L00D MGS 8110 - Descriptive Statistics
28
Example
Calculations
Q. The answers to the two
previous questions are not
consistent. The 5% values
calculated as Mean –
2(Sigma) was 5’-5” where as
the 1% value calculated as a
Percentile was 5’-6”.
Answer – These types of inconsistencies (i.e., errors) will occur
with small samples. The procedure used by the PERCENTILE
function is based on an interpolated calculation with the two
smallest values in the sample.
L00D MGS 8110 - Descriptive Statistics
29
Example
Calculations
Q. Which variable, Height or
Weight, has the greatest
relative variability?
Coef of Variation
3%
11%
=StDevHt/AvHt
=StDevWt/AvWt
Answer – In agreement with our intuition,
Weight is 3 to 4 times more variable
than height (11/3 = 3.67).
L00D MGS 8110 - Descriptive Statistics
30
Example
Calculations
Q. Is there a relationship
between Height and Weight
and if so how large is the
relationship?
0.783
0.61
=CORREL(Height,Weight)
=G23^2
1
0.9
Percentage of Variability Explained
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Answer – The correlation between
Height and Weight is .78 which
means that about 60% (.61) of the
variability in weight is due to
differences in Height.
Correlation Coefficient
L00D MGS 8110 - Descriptive Statistics
31
Example
Calculations
Q. Given that there is a
relationship between Height
and Weight, is the
relationship linear or nonlinear?
0.783
0.61
=CORREL(Height,Weight)
=G23^2
Answer – Simple statistics cannot be used to determine linear
versus non-linear, would need to plot the data. The correlation
indicates that there is a relatively strong linear relationship,
but a plot of the data (Weight vs. Height) may indicate that
there is an even stronger non-linear relationship
L00D MGS 8110 - Descriptive Statistics
32
Example
Calculations
Q. Are Height and Weight
Normally distributed?
-0.31
-1.27
=SKEW(Height)
=KURT(Height)
0.17
-1.03
=SKEW(Weight)
=KURT(Weight)
Answer – Based on out Rule-of-Thumb test
(-1 < Skew < +1 and -1 < Kurt < +1),
neither of these variables are normally
distributed.
L00D MGS 8110 - Descriptive Statistics
33
Example
Calculations
Q. Given that the variables are
NOT Normally distributed,
why do I care?
Answer – You previous interpretation of the Standard Deviation
maybe somewhat inaccurate (“The height of 2/3 of the
population will vary by less than 4 inches “). Also, you
previous interpretation of Worst Case and Best Case maybe
somewhat inaccurate (“The shortest person will be 5’-5” and
the tallest person will be 6’-1”).
L00D MGS 8110 - Descriptive Statistics
34
Example
Calculations
Q. The average Height is
estimated to be 69.1”, how
good is that estimate?
67.8
70.3
=AvHt-2*StErrorHt
=AvHt+2*StErrorHt
Answer – The true average height could be
anywhere between 67.8 inches to 70.3
inches. A better estimate could be
obtained if a large sample was available.
L00D MGS 8110 - Descriptive Statistics
35
More about Variability
s
 (x  x)
s
 ( xi  x )2
s
 (x  m)
2
Use StDev (or Var) in Excel
i
n 1
N
2
i
n
Alternative formulation
1) if every item in the Universe is included in the
Sample or
2) The Mean is know with certainty.
Use StDevP (or VarP) in Excel
L00D MGS 8110 - Descriptive Statistics
36
0.18
0.16
0.14
Probability Density
Normal
Calculations
Probability Density Function for Male Height
0.12
0.10
0.08
0.06
.95
0.04
0.02
0.00
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
76
77
78
79
Height
NORMINV(0.95,68.8,2.6)=73.08
Cumulative Distribution Function for Male Height
1.00
.95
0.90
Cumulative Distribution
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Height
L00D MGS 8110 - Descriptive Statistics
37
Probability Density Function for Male Height
0.16
0.14
Probability Density
Normal
Calculations
0.18
0.12
0.10
0.08
0.06
0.04
0.02
.244
0.00
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
76
77
78
79
Height
NORMDIST(67,68.8,2.6,TRUE)=.244
Cumulative Distribution Function for Male Height
1.00
0.90
Cumulative Distribution
0.80
0.70
0.60
0.50
0.40
0.30
.244
0.20
0.10
0.00
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
Height
L00D MGS 8110 - Descriptive Statistics
38
x x
z i
s
0.45
0.40
0.35
Probability Density
Standardized
Normal
Calculations
Probability Density Function for Male Height
0.30
0.25
0.20
0.15
0.10
.95
0.05
X variable has mean and
StDev of m and s which are
estimated by x bar and s.
0.00
-5
-4
-3
-2
-1
0
1
2
3
4
5
4
5
Height
Cumulative Distribution Function for Male Height
1.00
NORMSINV(0.95)=1.645
0.80
Cumulative Distribution
Z variable has mean=0 and
StDev=1. Z is a “standardized
normal”.
.95
0.90
0.70
0.60
0.50
0.40
0.30
0.20
0.10
NORMSINV(.05)=-1.645
0.00
-5
-4
-3
-2
-1
0
1
2
3
Height
L00D MGS 8110 - Descriptive Statistics
39
0.45
0.40
0.35
Probability Density
Standardized
Normal
Calculations
Probability Density Function for Male Height
0.30
0.25
0.20
0.15
0.10
0.05
.159
0.00
-5
-4
-3
-2
-1
0
1
2
3
4
5
4
5
Height
NORMSDIST(-1)=.159
Cumulative Distribution Function for Male Height
NORMSDIST(+1)=.841
1.00
0.90
Cumulative Distribution
0.80
0.70
0.60
0.50
0.40
0.30
0.20
.159
0.10
0.00
-5
-4
-3
-2
-1
0
1
2
3
Height
L00D MGS 8110 - Descriptive Statistics
40
t-distribution
Probability Density Function for Male Height
0.45
t-distribution is
needed if s is not
know and estimated
by s and n<30.
0.40
Probability Density
0.35
0.30
0.25
0.20
0.15
0.10
.95
0.05
0.00
-5
-4
-3
-2
-1
0
1
2
3
4
5
Height
t-Dist df=4
t-Dist df=16
z Normal
L00D MGS 8110 - Descriptive Statistics
41
Probability Density Function for Male Height
0.35
0.30
Probability Density
t-distribution
Calculations
0.40
0.25
0.20
0.15
one-tail
0.10
0.05
.058
0.00
-5
-4
-3
-2
-1
0
1
2
3
4
5
Height
TDIST(2,4,1)=.058
Cumulative Distribution Function for Male Height
1.0
TDIST(X, d.f., # tails)
0.9
“t” with tails=1 sums from +
infinity.
“Z” and “Normal” sums from –
infinity.
Cumulative Distribution
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
.058
0.0
-5
-4
-3
-2
-1
0
1
2
3
4
Height
L00D MGS 8110 - Descriptive Statistics
42
5
two-tail
0.40
0.35
0.30
Probability Density
t-distribution
Calculations
Probability Density Function for Male Height
0.25
0.20
0.15
0.10
0.05
.058
-5
-4
-3
.058
0.00
-2
-1
0
1
2
3
4
5
Height
TDIST(2,4,2)=.116
Cumulative Distribution Function for Male Height
1.0
TDIST(X, d.f., # tails)
0.9
“t (2 tail)” sums simultaneously
from both – infinity and +
infinity. Undefined for negative
values of t.
Cumulative Distribution
0.8
0.7
0.6
0.5
0.4
0.3
0.2
.116
0.1
0.0
-5
-4
-3
-2
-1
0
1
2
3
4
5
Height
L00D MGS 8110 - Descriptive Statistics
43
Loading “Data
Analysis” in Office
2003
Both
NO
/ Tools / Add-Ins /
Will need to have original
MS Office CD.
L00D MGS 8110 - Descriptive Statistics
44
Example of
Tools / Data Analysis / Descriptive Statistics
L00D MGS 8110 - Descriptive Statistics
45
Example Output of
Tools / Data Analysis / Descriptive Statistics
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
47.84
2.63
48.00
41.00
11.46
131.25
-0.73
0.29
41
30
71
909
19
L00D MGS 8110 - Descriptive Statistics
46
Loading “Data
Analysis” in Office
2007
1) Click the Office button in
the upper left hand corner
of the Excel.
Both
NO
NO
2) Click the “Excel Options”
tab in the bottom righthand cornet of the dropdown menu gotten from
step #1.
L00D MGS 8110 - Descriptive Statistics
47
Loading “Data Analysis” in
Office 2007
Both
Both
3) Click Add-Ins in the left
banner of the Excel
Options menu.
NO
NO
4) Click “Analysis ToolPak”
in the Add-ins menu.
Then, select BOTH
“Analysis ToolPak” and
“Analysisi ToolPak – VBA”
5) Click the Go button at
the bottom right hand
corner of the Excel
Options menu. Don’t
click the “OK” button”.
L00D MGS 8110 - Descriptive Statistics
48
Loading “Data
Analysis” in Office
2007
L00D MGS 8110 - Descriptive Statistics
49
Example of
Tools / Data Analysis
/ Descriptive
Statistics
L00D MGS 8110 - Descriptive Statistics
50
Precision of numerical results –
state “3 Significant Digits”
Too Many Digits
Too Few Digits
Correct 3+ significant digits
Wt
Wt
Wt
Mean
Mean
76.5
Mean
76.5
Standard Error
4.8
Standard Error
4.78
76.506
Standard Error 4.780314
Median
73.58
Median
73.6
Median
73.6
Mode
#N/A
Mode
#N/A
Mode
#N/A
Standard Deviation 15.11668
Standard Deviation
15.1
Standard Deviation
15.1
Sample Variance 228.514
Sample Variance
228.5
Sample Variance
228.5
Kurtosis -0.67472
Kurtosis
-0.7
Kurtosis
-0.675
Skewness
0.4
Skewness
0.362
Range
48.0
Range
48.0
Minimum
54.3
Minimum
54.3
Maximum
102.3
Maximum
102.3
Sum
765.1
Sum
765.1
Count
10.0
Count
Skewness 0.362102
too few digits
Range
47.98
Minimum
54.28
Maximum
102.26
Sum
meaningless
765.06
extra digits
Count
10
(.651 to .744)
OK
10
L00D MGS 8110 - Descriptive Statistics
51
Precision of numerical results –
state “3 Significant Digits”
(continued)
CORRECT
3 significant digits
INCORRECT
3 decimal places
Wt
Wt
Mean
76.5
Mean
76.506
Standard Error
4.78
Standard Error
4.780
Median
73.6
Median
73.580
Mode
#N/A
Mode
#N/A
Standard Deviation
15.1
Standard Deviation
15.117
Sample Variance
229
Sample Variance 228.514
Kurtosis
-0.675
Kurtosis
-0.675
Skewness
0.362
Skewness
0.362
Range
48.0
Range
47.980
Minimum
54.3
Minimum
54.280
Maximum
102
Maximum 102.260
Sum
765
Sum 765.060
Count
10
Count
10.000
L00D MGS 8110 - Descriptive Statistics
52
Some Great Rules of Thumb
Data is Normally distributed (Bell
shaped)
if -1  skewness  +1
if -1  kurtosis  +1
and
Data is a potential outlier
Symmetric distribution
xi < mean – 3s
or
Skewed distribution
xi < Q1 – (1.5)RQ
xi > mean + 3 s
or
xi > Q3 + (1.5)RQ
L00D MGS 8110 - Descriptive Statistics
53
Prerequisite Spreadsheet Skills
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Cut, Copy, Paste & Paste Special
Cell corner Copy
Add or delete Rows or Columns
Change width/height of row/column
Font, alignment, boarder & number of cell
Referencing and calculations with cells.
Data / Sort
Naming cell or range of cells
Insert / Function / Average
Sum, Max, Min, Count, Small and Large
(Tools / Add-ins / Data analysis)
Tools / Data Analysis / Descriptive statistics
Single quote for equation statement.
REPLACE command.
DATA / Group & FORMAT: Column, Hide.
Grab an entire column of data (CTRL+SHIFT, down arrow).
L00D MGS 8110 - Descriptive Statistics
54