Lecture 10

Transcript Lecture 10

Statistics 111 - Lecture 10
Midterm review
Chapters 1-5
June 11, 2008
Stat 111 - Lecture 10 - Review
1
Administrative Notes
• Homework 3 is due Monday
– Covers material from Chapter 5, so worth doing as
practice for the midterm!
• Exam on Monday
– Starts exactly at 10:40 – get here early
June 11, 2008
Stat 111 - Lecture 10 - Review
2
Some Topics Not Covered on Midterm
• Continuity correction for binomial
calculations (chapter 5)
• Normal quintile plots(chapter 1)
June 11, 2008
Stat 111 - Lecture 10 - Review
3
Experiments
•
Used to examine effect of a treatment eg. medical
trials, education interventions
Treatment Group
Treatment
Result
1
Experimental
Units
Population
2
3
Control Group
•
4
No Treatment
Result
Different from an observational study, where no
treatment is imposed
Observational studies can only examine associations
between variables, whereas experiments try to
establish causal effects
•
•
Experiments can still be biased though!
June 11, 2008
Stat 111 - Lecture 10 - Review
4
Sampling and Surveys
Population
?
Parameter
Sampling
Sample
Inference
Estimation
Statistic
• Just like in experiments, we must be cautious of
potential sources of bias in our sampling results
• Voluntary response samples, undercoverage, nonresponse, untrue-response, wording of questions
• Simple Random Sampling: less biased since each
individual in the population has an equal chance of
being included in the sample
June 11, 2008
Stat 111 - Lecture 10 - Review
5
Distributions
• A distribution describes what values a variable
takes and how frequently these values occur
• Boxplots are good for center and spread, but don’t
indicate shape of a distribution
• Histograms much more effective at displaying the
shape of a distribution
June 11, 2008
Stat 111 - Lecture 10 - Review
6
Numerical Measures of Center
• Mean:
x1  x 2 
X
n
 xn
n
 1n  x i
i1
• Median: “middle number in distribution”
• Mean is more affected by large outliers and
asymmetry than the median

• Symmetric: Mean ≈ Median
• Skewed Left: Mean<Median
• Skewed Right: Mean>Median
June 11, 2008
Stat 111 - Lecture 10 - Review
7
Numerical Measures of Spread
• Variance: average of the squared deviations of each
2
observation
(x

x
)

i
2
s 
n 1
• Standard Deviation =

• Inter-Quartile Range: IQR = Q3 - Q1
• First Quartile (Q1) is the median of the smaller half of data
• Third Quartile (Q3) is the median of the larger half of data
• With outliers or asymmetry, median and IQR are
better but we will use mean and SD more since most
distributions we use (eg. normal distribution) are
symmetric with no outliers
June 11, 2008
Stat 111 - Lecture 10 - Review
8
Scatterplots of two variables
• Positiveassociation vs Negative association
• Some associations are not just positive or negative,
but also appear to be linear
• Correlation is a measure of the strength of linear
relationship between variables X and Y
• r near 1 or -1 means strong linear relationship
• r near 0 means weak linear relationship
• Negative r means negative association
June 11, 2008
Stat 111 - Lecture 10 - Review
9
Linear Regression
• Best fit line between X and Y:
Y = a + b·X
• The slope b(
): average change you get in
the Y variable if you increased the X variable by one
• The intercept a (
):average value of the Y
variable when the X variable is equal to zero
• Regression equation used to predict response
variable Y for a value of our explanatory variable X
June 11, 2008
Stat 111 - Lecture 10 - Review
10
Probability
• Random process: outcome not known exactly, but
have probability distribution of possible outcomes
• Event: outcome of random process with prob. P(A)
• Additive Rule for Disjoint Events:
P(A or B) = P(A) + P(B) if A and B are disjoint
• Multiplication Rule for Independent Events:
P(A and B) = P(A) x P(B) if A and B are independent
• Need to combine different rules (Eg. Lecture 8)
June 11, 2008
Stat 111 - Lecture 10 - Review
11
Probability and Random Variables
• Conditional Probability:
• Random variable: numerical outcome or summary of
a random process
• A discrete random variable has a finite number of
distinct values
• Continuous random variables can have a noncountable number of values
June 11, 2008
Stat 111 - Lecture 10 - Review
12
Discrete vs. Continuous RV’s
• Probability histogram for distribution of discrete r.v.
• Calculate probabilities by adding up bars of histogram
• Density curve used for distribution of continuous r.v.
• Calculate probabilities by integrating area under curve
June 11, 2008
Stat 111 - Lecture 10 - Review
13
Linear Transformations of Variables
• Same rules for both data and random variables:
mean(a·X + c) = a·mean(X) + c
variance(a·X + c) = a2 ·variance(X)
SD(a·X + c) = |a|· SD(X)
• Adding constants does not change spread measures
• Can also do combinations of more than one variable:
If X and Y are variables and Z = a·X + b·Y + c
mean(Z) = a·mean(X) + b·mean(Y) + c
If X and Y are also independent then
Variance(Z) = a2·Variance(X) + b2·Variance(Y)
June 11, 2008
Stat 111 - Lecture 10 - Review
14
The Normal Distribution
• The Normal distribution has the shape of a “bell
curve” with parameters  and 2,denoted N(,2)
N(0,1)
N(2,1)
N(-1,2)
N(0,2)
• StandardNormal:  = 0 and 2 = 1
• Normal distribution follows the 68-95-99.7 rule:
• 68% of observations are between  -  and  + 
• 95% of observations are between  - 2 and  + 2
• 99.7% of observations are between  - 3 and  + 3
• Have tables for any probability from the standard
normal distribution
June 11, 2008
Stat 111 - Lecture 10 - Review
15
Standardization
• For non-standard normal probabilities, need to
transform to a standard normal distribution
• If X has a N(,2) distribution, then we can convert to
Z which follows a N(0,1) distribution:
• Can then calculate P(Z < k) using table
• Reverse standardization: converting a standard
normal Z into a non-standard normal X
X = σZ + μ
• Practice makes perfect!
June 11, 2008
Stat 111 - Lecture 10 - Review
16
Inference for Continuous Data
• Continuous data is summarized by sample mean
• Sample mean is used as our estimate of the
population mean, but how does sample mean vary
between samples?
Population
Parameters:
 and 2
June 11, 2008
Sample 1 of size n
Sample 2 of size n
Sample 3 of size n
Sample 4 of size n
Sample 5 of size n
Sample 6 of size n
.
.
.
Stat 111 - Lecture 10 - Review
x
x
x
x
x
x
Distribution
of these
values?
17
Sampling Distribution of Sample Mean
• The center of the sampling distribution of the sample
mean is the population mean:
• Over all samples, the sample mean will, on average, be
equal to the population mean (no guarantees for 1 sample!)
• The spread of the sampling distribution of the sample
mean is
• As sample size increases, variance of the sample mean
decreases!
• Central Limit Theorem: if the sample size is large
enough, then the sample mean X has an
approximately Normal distribution
June 11, 2008
Stat 111 - Lecture 10 - Review
18
Inference for Count Data
• Goal for count data is to estimate the population proportion p
• From a sample of size n, we can calculate two statistics:
1. sample count Y
2. sample proportion
• Use sample proportion as our estimate of population proportion p
• Sampling Distribution of the Sample Proportion
• how does sample proportion change over different samples?
Population
Parameter: p
June 11, 2008
Sample 1 of size n
Sample 2 of size n
Sample 3 of size n
Sample 4 of size n
Sample 5 of size n
Sample 6 of size n
.
.
Stat 111 - Lecture 10 - Review
.
Distribution
of these
values?
19
Sampling Distribution for Proportion
• For small samples, use the Binomial distribution to calculate
probabilities for the sample count or sample proportion
• Definition of “small”: n·p < 10 or n·(1-p) < 10
• For large samples, we use the Normal approximation to the
Binomial distribution for the sample count or sample proportion
June 11, 2008
Stat 111 - Lecture 10 - Review
20
Next Week - Lecture 11
• Chapter 6
• Good luck on midterm next Monday!
June 11, 2008
Stat 111 - Lecture 10 - Review
21

Lecture 10

Transcript Lecture 10

Directory