Slide 1

Transcript Slide 1

CHAPTER 2
• 2.1 - Basic Definitions and Properties
 Population Characteristics = “Parameters”
 Sample Characteristics = “Statistics”
 Random Variables (Numerical vs. Categorical)
• 2.2, 2.3 - Exploratory Data Analysis
 Graphical Displays
 Descriptive Statistics
• Measures of Center (mode, median, mean)
• Measures of Spread (range, variance, standard deviation)
“Classical Scientific Method”
• Hypothesis – Define the study population...
What’s the question?
• Experiment – Designed to test hypothesis
• Observations – Collect sample measurements
• Analysis – Do the data formally tend to support or
refute the hypothesis, and with what strength?
(Lots of juicy formulas...)
• Conclusion – Reject or retain hypothesis; is the
result statistically significant?
• Interpretation – Translate findings in context!
Statistics is implemented in each step of the classical
scientific method!
2
• Analysis – Do the data formally tend to support or
refute the hypothesis, and with what strength?
(Lots of juicy formulas...)
To help answer this question, we should first try to
obtain an informal “feel” for the sample data we
have collected, and see if it suggests anything
about the population distribution.
~ Exploratory Data Analysis ~
1. Visual Displays (charts, tables, graphs, etc.)
“What do the data look like?”
2. “Descriptive Statistics” (measures of center,
measures of spread, proportions, etc.)
“How can the data be summarized?”
3
Example: Suppose the random variable is X = Age (years) in a certain population of individuals,
and we select the following random sample of n = 20 ages.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
4 values
8 values
5 values
2 values 1 value
From these values, we can construct a table which consists of the frequencies of each age-interval
in the dataset, i.e., a frequency table.
Frequency Histogram
8
Class Interval Frequency
Suggests
[10, 20)
4
[20, 30)
8
[30, 40)
5
[40, 50)
2
[50, 60)
1
Total
n = 20
population may
be skewed to
the right (i.e.,
positively
skewed).
“Endpoint convention”
Here, the left endpoint is
included, but not the right.
Note!...
Stay away from “10-20,”
“20-30,” “30-40,” etc.
5
4
2
1
In published journal articles, the original data are almost never shown, but
displayed in tabular form as above. This summary is called “grouped data.”
4
Example: Suppose the random variable is X = Age (years) in a certain population of individuals,
and we select the following random sample of n = 20 ages.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
Often though, it is preferable to work with proportions, i.e., relative frequencies…
Divide frequencies by n = 20.
Relative Frequency Histogram
↓
Class Interval
Relative
Frequency
Frequency
[10, 20)
4/20
4 = 0.20
[20, 30)
8/20
8 = 0.40
[30, 40)
5/20
5 = 0.25
[40, 50)
2/20
2 = 0.10
[50, 60)
1/20
1 = 0.05
Total
n20/20
= 20 = 1.00
.40
0.4
0.3
.25
0.2
.20
.10
0.1
.05
0.0
Relative frequencies are
always between 0 and 1,
and sum to 1.
5
Example: Suppose the random variable is X = Age (years) in a certain population of individuals,
and we select the following random sample of n = 20 ages.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
Often though, it is preferable to work with proportions, i.e., relative frequencies…
Divide frequencies by n = 20.
Relative Frequency Histogram
↓
“0.00 of the sample is under 10 yrs old”
Class Interval
Relative
Frequency
Frequency
Cumulative
[10, 20)
4/20
4 = 0.20
0.20
[20, 30)
8/20
8 = 0.40
0.60
[30, 40)
5/20
5 = 0.25
0.85
[40, 50)
2/20
2 = 0.10
0.95
[50, 60)
1/20
1 = 0.05
Total
n20/20
= 20 = 1.00
.40
0.4
“0.20 of the sample is under 20 yrs old”
(0.00)
1.00
0.3
“0.60 of the sample
.25 is under 30 yrs old”
.20
“0.85 of the sample is under 40 yrs old”
0.2
.10
“0.95 of the sample is under
50 yrs old”
0.1
.05
“1.00 of the sample is under 60 yrs old”
0.0
Relative frequencies are
always between 0 and 1,
and sum to 1.
6
Example: Suppose the random variable is X = Age (years) in a certain population of individuals,
and we select the following random sample of n = 20 ages.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
Often though, it is preferable to work with proportions, i.e., relative frequencies…
Divide frequencies by n = 20.
Relative Frequency Histogram
↓
Class Interval
Relative
Frequency
Frequency
Cumulative
[10, 20)
4/20
4 = 0.20
0.20
[20, 30)
8/20
8 = 0.40
0.60
[30, 40)
5/20
5 = 0.25
0.85
[40, 50)
2/20
2 = 0.10
0.95
[50, 60)
1/20
1 = 0.05
Total
n20/20
= 20 = 1.00
.40
0.4
(0.00)
1.00
0.3
0.2
.20
.25
Cumulative
relative
frequencies always
increase from 0 to 1.
.10
0.1
.05
0.0
Relative frequencies are
always between 0 and 1,
and sum to 1.
Example: Approximately
Exactly
what proportion of the sample is under 34 years old?
Solution: [0, 30) contains 0.6, [30, 34) contains 4/10 of 0.25 = 0.1, sum = 0.77
Example: Suppose the random variable is X = Age (years) in a certain population of individuals,
and we select the following random sample of n = 20 ages.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
Often though, it is preferable to work with proportions, i.e., relative frequencies…
Divide frequencies by n = 20.
Relative Frequency Histogram
↓
Class Interval
Relative
Frequency
Frequency
Cumulative
[10, 20)
4/20
4 = 0.20
0.20
[20, 30)
8/20
8 = 0.40
0.60
[30, 40)
5/20
5 = 0.25
0.85
[40, 50)
2/20
2 = 0.10
0.95
[50, 60)
1/20
1 = 0.05
Total
n20/20
= 20 = 1.00
.40
0.4
(0.00)
1.00
0.3
0.2
.20
.25
Cumulative
relative
frequencies always
increase from 0 to 1.
.10
0.1
.05
0.0
Relative frequencies are
always between 0 and 1,
and sum to 1.
Example: Approximately
Exactly
what proportion of the sample is under 34 years old?
But alas, there is a major problem….
Solution: [0, 30) contains 0.6, [30, 34) contains 4/10 of 0.25 = 0.1, sum = 0.78
Suppose that, for the purpose of the study, we are not primarily concerned with those 30 or older,
and wish to “lump” them into a single class interval.
{18, 19, 19, 19, 20, 21, 21, 23, 24, 24, 26, 27, 31, 35, 35, 37, 38, 42, 46, 59}
4 values
8 values
8 values
Relative Frequency Histogram
What effect will this have on the histogram?
Class Interval
Relative
Frequency
Frequency
[10, 20)
4 = 0.20
4/20
[20, 30)
8 = 0.40
8/20
[30, 60)
40)
5/20
5 = 0.25
8/20
0.40
[40,
Total
50)
2/20
2 ==0.10
20/20
1.00
[50, 60)
1/20
1 = 0.05
Total
n20/20
= 20 = 1.00
.40
0.4
.40
0.3
.25
0.2
.20
.10
0.1
.05
0.0
The skew no longer appears.
The histogram is distorted because of the
presence of an outlier (59) in the data,
creating the need for unequal class widths.
9
(A Pain in the Tuches)
•
What are they?
Informally, an outlier is a sample data value that is either
“much” smaller or larger than the other values.
•
How do they arise?
experimental error
o measurement error
o recording error
o not an error; genuine
o
•
What can we do about them?
double-check them if possible
o delete them?
o include them… somehow
o perform analysis both ways
o
10
IDEA: Instead of having height of each class rectangle = relative frequency,
make...
area of each class rectangle = relative frequency.
“Density”
height × width
= relative frequency /
Density Histogram
Class
Interval
Relative
Frequency
Density
(= height)
[10, 20)
width = 10
0.20
0.20/10 = 0.020
0.40
0.40/10 = 0.040
[20, 30)
width = 10
[30, 60)
width = 30
Total
0.04
Total
Area = 1!
0.02
0.40
0.40
0.40/30 = 0.013
0.0133…
20/20 = 1.00
0.20
0.40
The outlier is included, and the overall
skewed appearance is restored.
Exercise: What if the outlier was 99
instead of 59?
11
• Analysis – Do the data formally tend to support or
refute the hypothesis, and with what strength?
(Lots of juicy formulas...)
To help answer this question, we should first try to
obtain an informal “feel” for the sample data we
have collected, and see if it suggests anything
about the population distribution.
~ Exploratory Data Analysis ~
1. Visual Displays (charts, tables, graphs, etc.)
“What do the data look like?”
2. “Descriptive Statistics” (measures of center,
measures of spread, proportions, etc.)
“How can the data be summarized?”
12
CHAPTER 2
• 2.1 - Basic Definitions and Properties
 Population Characteristics = “Parameters”
 Sample Characteristics = “Statistics”
 Random Variables (Numerical vs. Categorical)
• 2.2, 2.3 - Exploratory Data Analysis
 Graphical Displays
 Descriptive Statistics
• Measures of Center (mode, median, mean)
• Measures of Spread (range, variance, standard deviation)
“Measures of Center ”
Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
• sample mode
most frequent value = 80
Data values
Frequencies
xi
fi
i=1
70
1
i=2
80
4
i=3
(Quartiles are found similarly: Q1 = 80 , Q2 = 85, Q3 = 100 )
90
2
100
3
Total
n = 10
• sample median
“middle” value = (80 + 90)/2 = 85
i=4
• sample mean
average value =
1/10 (70)(1) + (80)(4) + (90)(2) + (100)(3)
x =
1
n
= 87
x f
i i
14
“Measures of Center”
Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
• sample mode
most frequent value = 80
• sample median
“middle” value = (80 + 90)/2 = 85
• sample mean
average value =
Data values
Frequencies
xi
fi
70
1
80
4
90
2
100
3
Total
n = 10
1/10 (70)(1) + (80)(4) + (90)(2) + (100)(3)
x =
1
n
= 87
x f
i i
15
“Measures of Center”
Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
• sample mean
x = 87
1/10 (70)(1) + (80)(4) + (90)(2) + (100)(3)
1
4
2
= 87
3
1/10 (70)(1)
10 + (80)(4)
10 + (90)(2)
10 + (100)(3)
10
Data values
Frequencies
Relative Frequencies
xi
fi
f (xi ) = fi /n
70
1
1/10 = 0.1
80
4
4/10 = 0.4
x =
x =
90
2
2/10 = 0.2
100
3
3/10 = 0.3
Total
n = 10
10/10 = 1.0
1
n
x f
i i
 x f(x )
i
i
“Notation, notation, notation.”
16
“Measures of Spread ”
Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
• sample mean
x = 87
… but how do we measure the
“spread” of a set of values?
First attempt:
• sample range = xn – x1 = 100 – 70 = 30. Simple, but…
Data values
Frequencies
xi
fi
70
1
Ignores all of the data except the extreme points, thus
far too sensitive to outliers to be of any practical value.
Example: Company employee salaries, including CEO
Can modify with…
80
4
90
2
= 100 – 80 = 20.
100
3
We would still prefer a measure that uses all of the data.
Total
n = 10
• sample interquartile range (IQR) = Q3 – Q1
17
“Measures of Spread ”
Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
• sample mean
x = 87
… but how do we measure the
“spread” of a set of values?
Better attempt: Calculate the average of the “deviations from the mean.”
1
n
 (x – x) f
i
i
= 1/10 [(–17)(1) + (–7)(4) + (3)(2) + (13)(3)] = 0. ????????
Data values
Frequencies
Deviations from mean
xi
fi
xi – x
70
1
70 – 87 = –17
80
4
80 – 87 = –7
90
2
90 – 87 = +3
100
3
100 – 87 = +13
Total
n = 10
This is not a coincidence – the
deviations always sum to 0* – so it
is not a good measure of variability.
* Physically, the sample mean is
a “balance point” for the data.
18
“Measures of Spread”
Example: Sample exam scores = {70, 80, 80, 80, 80, 90, 90, 100, 100, 100}
• sample mean
x = 87
a modified
Calculate the average of the “squared deviations from the mean.”
1/9 [(–17) 2 (1) + (–7) 2 (4) + (3) 2 (2) + (13) 2 (3)] = 112.22
Data values
Frequencies
Deviations from mean
xi
fi
xi – x
70
1
70 – 87 = –17
80
4
80 – 87 = –7
90
2
90 – 87 = +3
100
3
100 – 87 = +13
Total
n = 10
• sample variance
2
s =
1
n 1
 (x
i
– x) 2 fi
• sample standard deviation
s = s2
s = 10.59
19
Comments
 x is an unbiased estimator of the population mean ,
s 2 is an unbiased estimator of the population variance  2.
(Their “expected values” are  and  2, respectively.)
 Beware of roundoff error!!! There is an alternate, more
computationally stable formula for sample variance s 2.
 The numerator of s 2 is called a sum of squares (SS); the
denominator “n – 1” is the number of degrees of freedom
(df) of the n deviations xi – x , because they must satisfy a
constraint (sum = 0), hence 1 degree of freedom is “lost.”
 A natural setting for these formulas and concepts is
c
geometric, specifically, the Pythagorean Theorem: b
a 2 + b 2 = c 2. See lecture notes appendix…
a
20

Slide 1

Transcript Slide 1

Directory