standard deviation

Download Report

Transcript standard deviation

And Here We Go …
Get ready to study for the
AP Stats test!
Only 1050 minutes of class
time until the big day…
Friday,MAY 10!
How much
studying will you
do for
$521.04?
plus book…
The Exam Itself
To maximize your score on the AP Statistics Exam, you
first need to know how the exam is organized and how
it will be scored.
The AP Statistics Exam consists of two separate
sections:
Section I
40 MultipleChoice questions
90 minutes
counts 50
percent of exam
score
Section II
FreeResponse
questions
90
minutes
counts 50
percent of
exam score
Questions are
designed to test your
statistical reasoning
and your
communication skills.
SCORING:
Five open-ended problems @ 13 minutes; each counts 15 percent of freeresponse score
One investigative task @ 25 minutes; counts 25 percent of free-response score
Each free-response question is scored on a 0 to 4 scale. General descriptors for
each of the scores are:
4
Complete Response
NO statistical errors and clear communication
3
Substantial Response
Minor statistical error/omission or fuzzy communication
2
Developing Response
Important statistical error/omission or lousy communication
1
Minimal Response
A "glimmer" of statistical knowledge related to the problem
0
Inadequate Response
No glimmer; statistically dangerous to himself and others
Your work is graded holistically, meaning that your entire response to a problem is
considered before a score is assigned.
Calculator Policy
Each student is expected to bring to the exam a graphing calculator with
statistical capabilities. The computational capabilities should include standard
statistical univariate and bivariate summaries, through linear regression. The
graphical capabilities should include common univariate and bivariate displays such
as histograms, boxplots, and scatterplots.
• You can bring two calculators to the exam.
• The calculator memory will not be cleared but you may only use the memory to
store programs, not notes.
• For the exam, you're not allowed to access any information in your graphing
calculators or elsewhere if it's not directly related to upgrading the statistical
functionality of older graphing calculators to make them comparable to statistical
features found on newer models. The only acceptable upgrades are those that
improve the computational functionalities and/or graphical functionalities for data
you key into the calculator while taking the examination. Unacceptable
enhancements include, but aren't limited to, keying or scanning text or response
templates into the calculator.
• During the exam, you can't use minicomputers, pocket organizers, electronic
writing pads, or calculators with QWERTY (i.e., typewriter) keyboards.
2008-09 List of Graphing Calculators
Graphing calculators having the expected built-in capabilities listed above are
indicated with an asterisk (*). However, students may bring any calculator on the list
to the exam; any model within each series is acceptable.
Casio
FX-6000 series
FX-6200 series
FX-6300 series
FX-6500 series
FX-7000 series
FX-7300 series
FX-7400 series
FX-7500 series
FX-7700 series
FX-7800 series
FX-8000 series
FX-8500 series
FX-8700 series
FX-8800 series
FX-9700 series *
FX-9750 series *
FX-9860 series *
CFX-9800 series *
CFX-9850 series *
CFX-9950 series *
CFX-9970 series *
FX 1.0 series *
Algebra FX 2.0 series *
Hewlett-Packard
HP-9G
HP-28 series *
HP-38G *
HP-39 series *
HP-40 series*
HP-48 series *
HP-49 series *
HP-50 series*
Radio Shack
EC-4033
EC-4034
EC-4037
Sharp
EL-5200
EL-9200
EL-9300
EL-9600
EL-9900
series
series
series
series
*
*
†
*
*
Texas Instruments
TI-73
TI-80
TI-81
TI-82 *
TI-83/TI-83 Plus *
TI-83 Plus Silver *
TI-84 Plus *
TI-84 Plus Silver *
TI-85 *
TI-86 *
TI-89 *
TI-89 Titanium *
TI-Nspire *
TI-Nspire CAS *
Other
Datexx DS-883
Micronta
2
Smart
1st AP Statistics test: 1997 ~ 7500 students
2008 AP Stat test:
~ 100,000 students
Exam grade
2008 Statistics
Goins 2008
5
14,009
12.8%
3
12%
4
24,528
22.6%
7
28%
3
25,707
23.8%
8
32%
2
20,403
18.8%
4
16%
1
23,637
21.9%
3
12%
Number of students
108,284
3 or higher / %
64,244
Mean grade
2.86
Standard deviation
1.34
25
59.2%
18
3.12
72%
1st AP Statistics test: 1997 ~ 7500 students
2009 AP Stat test:
116,876 students
Exam grade
2009 Statistics
Goins 2009
5
12.3%
2
4.3%
4
22.3%
6
12.8%
3
24.2%
17
36.2%
2
19.1%
12
25.5%
1
22.2%
10
21.3%
Number of students
116,876
3 or higher / %
68,679
Mean grade
2.83
Standard deviation
1.33
47
58.8%
25
2.56
53.3%
1st AP Statistics test: 1997 ~ 7500 students
2010 AP Stat test:
~ 109,609 students
Exam grade
2010 Statistics
Goins 2010
5
12.8%
5
13.9%
4
22.4%
10
27.8%
3
23.5%
11
30.6%
2
18.2%
6
16.7%
1
23.1%
4
11.1%
Number of students
129,899
36
3 or higher / %
58.7%
72.3%
Mean grade
2.84
3.167
Standard deviation
1.35
1.2
1st AP Statistics test: 1997 ~ 7500 students
2011 AP Stat test:
~ 137,498 students
Exam grade
2011 Statistics
Goins 2011
5
12.1%
8
16.0%
4
21.3%
18
36.0%
3
25.0%
14
28.0%
2
17.8%
7
14.0%
1
23.9%
3
6.0%
Number of students
142,910
50
3 or higher / %
58.8%
80.0%
Mean grade
2.82
3.42
Standard deviation
1.34
1.1
1st AP Statistics test: 1997 ~ 7500 students
2012 AP Stat test:
~ 143,554 students
Exam grade
2012 Statistics
5
12.5%
5
8.2%
4
21.1%
13
21.3%
3
25.6%
17
27.9%
2
18.0%
16
26.2%
1
22.8%
10
16.4%
Number of students
153,859
61
59.2%
57.4%
Mean grade
2.83
2.62
Standard deviation
1.33
3 or higher / %
Goins 2012
The AP Statistics Exam covers material in these areas:
I.
Exploring data: describing patterns and departures from patterns (20-30%)
 Analyze data using graphical and numerical techniques
 Emphasis on interpreting info from graphical and numerical displays
and summaries
II. Sampling and experimentation: planning and conducting a study (10–15%)
 Collecting data with a well developed plan
 Clarifying the question and deciding on a method of data collection and
analysis
III. Anticipating patterns: Exploring random phenomena using probability and
simulations (20-30%)
 Anticipating what the distribution of data should look like under a given
model
IV. Statistical inference: Estimating population parameters and testing
hypotheses (30-40%)
 Selecting appropriate models for statistical inferences
So. . .
Let’s get
started!
What do you call data
that has only ONE
variable?
UNIVARIATE DATA
What are the two types of
univariate data sets?
Categorical:
qualitative (brand)
Type of computer you use
Car you drive
Area codes
Numerical:
quantitative
(numerical in nature)
height
Price of textbook
Amount of cola in can
What are the two types of
numerical data?
Discrete: possible values are
isolated points on a number line
Number of AP classes
Continuous: possible values form
an interval (measurements are
usually continuous)
Distance lives from school
What are appropriate graphical
displays for categorical data?
Bar Graphs
• Bars do not touch
• Categorical variable is
typically on the horizontal
axis
• To describe – comment on
which occurred the most
often or least often
• May make a double bar graph
or segmented bar graph for
bivariate categorical data
sets
Subject Preference
25
20
15
10
5
0
History
Math
Science
English
Business
Foreign
language
Subject preference by gender
14
12
10
8
Male
6
Female
4
2
0
History
Math
Science
English
Business
Foreign
language
What are appropriate graphical
displays for categorical data?
Pie Charts
• To make:
– Proportion X 360°
– Using a protractor, mark off each part
• To describe – comment on which occurred the most
often or least often
Subject Preference
Foreign language
8%
Business
2%
History
6%
English
13%
Math
44%
Science
27%
What are appropriate graphical
displays for numerical data?
Dot Plot
Stem (and leaf) Plot
• Used with numerical data
(either discrete or
continuous)
• Made by putting dots (or
X’s) on a number line
• Can make comparative
dotplots by using the same
axis for multiple groups
• Used with univariate,
numerical data
• Must have key so that we
know how to read numbers
• Can split stems when you
have long list of leaves
• Can have a comparative
stemplot with two groups
(back to back)
What are appropriate graphical
displays for numerical data?
Histograms
• Used with numerical data
• Bars touch on histograms
• Two types
– Discrete
• Bars are centered over discrete values
– Continuous
• Bars cover a class (interval) of values
• For comparative histograms – use two separate graphs
with the same scale on the horizontal axis
• Use no fewer than 5 classes (bars)
• Check to see if scale is misleading
• Look for symmetry and skewness
What are appropriate graphical displays
for numerical data?
Cumulative Relative
Frequency Plot
(Ogive)
• . . . is used to answer questions about percentiles.
• Percentiles are the percent of individuals that are at or below a
certain value.
• Quartiles are located every 25% of the data. The first quartile
(Q1) is the 25th percentile, while the third quartile (Q3) is the
75th percentile. What is the special name for Q2?
• Interquartile Range (IQR) is the range of the middle half (50%)
of the data.
IQR = Q3 – Q1
What are appropriate graphical displays
for numerical data?
Boxplot (and
whisker)
• Used with numerical data
(either discrete or
continuous)
• Modified shows outliers
• Can make comparative by
showing side-by-side on
same scale
• Good for comparing
quartile, medians, and
spread
Why use boxplots?
Why not use
boxplots?
• ease of construction
• does not retain
• convenient handling
the individual
of outliers
observations
• construction is not
subjective (like
• should not be
histograms)
used with small
• Used with medium
data sets (n < 10)
or large size data
sets (n > 10)
How to construct
• useful for
• find five-number summary
comparative
Min Q1 Med Q3 Max
displays
• draw box from Q1 to Q3
• draw median as center line in the
box
• extend whiskers to min & max
Modified boxplots
• display outliers
• fences mark off mild &
ALWAYS
use modified
extreme
outliers
boxplots in this class!!!
• whiskers extend to largest
(smallest) data value inside
the fence
Inner fence
Interquartile Range
Q1 –– 1.5IQR
Q3 + 1.5IQR
(IQR)
is the range
(length) of
theobservation
box
Any
outside this
Q3 -fence
Q1 is an outlier! Put a dot
for the outliers.
Q1
Q3
Modified Boxplot . . .
Draw the “whisker” from the quartiles
to the observation that is within the
fence!
Q1
Q3
Outer fence
Q1 – 3IQR
Q3 + 3IQR
observation
between
AnyAny
observation
outside
this
theisfences
is considered
fence
an extreme
outlier! a
mild outlier.
Q1
Q3
Symmetrical boxplots
Approximately symmetrical boxplot
Skewed boxplot
Variable
Type of variable
the heights of male students in your
school
Continuous numerical
the income of adults in your city
Discrete numerical
the color of M&M candies selected
at random from a bag
the number of TV’s in the homes of
AP Stat students
Categorical
Graph
Histogram
Stem Plot
Bar graph
Discrete numerical
Dot Plot
the number of speeding tickets each
student in AP Stat received
Discrete numerical
Dot Plot
the birth weights of female babies
born at a large hospital
Continuous numerical
Histogram
the favorite movie type of AP Stat
students by gender
Categorical
Bar graph –
the area code of an individual
Categorical
Bar graph
the Math SAT Score for students at
your school
Discrete numerical
Histogram
the average number of text sent per
month
Continuous numerical
segmented or double
Cumulative frequency
plot (ogive)
How do you describe univariate
data?
Just
CUSS
and
BS!
Center
“the typical value”
Mean
Median
Unusual Features
Outliers
Gaps
Shape
single vs. multiple modes
(unimodal, bimodal)
symmetry vs. skewness
Illustrated Distribution Shapes
Unimodal
Skew negatively
(left)
Bimodal
Symmetric
Multimodal
Skew positively
(right)
Spread
“how tightly values cluster
around the center”
Standard deviation
IQR
Range
5-number summary
And Be Specific!
Measures of Central Tendency
• Median - the middle of the data;
50th percentile
–Observations must be in
numerical order
–Is the middle single value if n is
odd
–The average of the middle two
values if n is even
NOTE: n denotes the sample size
Measures of Central Tendency
parameter
• Mean - the arithmetic average
–Use m to represent a population
statistic
mean
–Use x to represent a sample
mean
Formula:
x
x
n
S is the capital Greek
letter sigma – it means to
sum the values that follow
Measures of Central Tendency
• Mode – the observation that occurs
the most often
–Can be more than one mode
–If all values occur only once – there
is no mode
–Not used as often as mean &
median
Suppose we are interested in the number of
lollipops that are bought at a certain store. A
sample of 5 customers buys the following number
of lollipops. Find the median.
The numbers are in order
& n is odd – so find the
middle observation.
2
The median is 4
lollipops!
3 4 8 12
Suppose we have sample of 6 customers that buy
the following number of lollipops. The median is …
The median is 5
The numbers are in order
lollipops!
& n is even – so find the
middle two observations.
Now, average these two values.
2
5
3 4 6 8 12
Suppose we have sample of 6 customers that buy
the following number of lollipops. Find the mean.
To find the mean number of lollipops
add the observations and divide by
n.
x  5.833
2  3  4  6  8  12
6
2
3 4 6 8 12
What would happen to the median & mean if the
12 lollipops were 20?
The median is . . .
The mean is . . .
5
7.17
2  3  4  6  8  20
6 What happened?
2
3 4 6 8 20
What would happen to the median & mean if the
20 lollipops were 50?
The median is . . .
The mean is . . .
5
12.17
2  3  4  6  8  50
6 What happened?
2
3 4 6 8 50
Resistant • Statistics that are not affected by
outliers
• Is the median resistant?
►Is
the mean resistant?
YES
NO
Look at the following data set. Find
the mean.
22
23
24
25
25
26
29
30
x  25 .5
Now find how eachWill
observation
this sum always
equal zero?
deviates from the mean.
YES
What is the sum of the deviations from
This is the
deviation from
the mean.
the mean?
 x  x   0
Look at the following data set. Find the
mean & median.
Mean = 27
Median = 27
21
27
Create a histogram with the
data.
x-scale
of 2) Then
Look(use
at the
placement
of
find
mean
median.
thethe
mean
andand
median
in
this symmetrical
distribution.
23
23
24
25
25
27
27
28
30
30
26
26
26
27
30
31
32
32
Look at the following data set. Find the
mean & median.
Mean = 28.176
Median = 25
Create a histogram with the
data.
x-scale
of 8) Then
Look(use
at the
placement
of
find
mean
median.
thethe
mean
andand
median
in
this right skewed
22
29 distribution.
28
22
24
25
28
21
23
62
23
24
23
26
36
38
25
Look at the following data set. Find the
mean & median.
Mean = 54.588
Median = 58
Create a histogram with the
data.
Then
findplacement
the meanof
and
Look
at the
median.
the mean
and median in
this skewed left
distribution.
21
46
54
47
53
60
55
55
56
63
64
58
58
58
58
62
60
Recap:
• In a symmetrical distribution, the mean
and median are equal.
• In a skewed distribution, the mean is
pulled in the direction of the skewness.
• In a symmetrical distribution, you
should report the mean!
• In a skewed distribution, the median
should be reported as the measure of
center!
Trimmed mean:
Purpose is to remove outliers from a
data set
To calculate a trimmed mean:
• Multiply the % to trim by n
• Truncate that many observations from
BOTH ends of the distribution (when
listed in order)
• Calculate the mean with the
shortened data set
Find a 10% trimmed mean with the following data.
12
14
19
20
22
24
25
26
26
10%(10) = 1
So remove one observation
from each side!
14  19  20  22  24  25  26  26
 22
8
35
Why is the study of variability
important?
• Allows us to distinguish between
usual & unusual values
• In some situations, want
more/less variability
–scores on standardized tests
–time bombs
–medicine
Range:
•
Single number – not an interval
•
Sensitive to outliers
•
Midrange – average of the max and
min values - VERY
sensitive to outliers
Interquartile Range (IQR):
.
Quartiles:
IQR  Q3  Q1
The first quartile (Q1) is the value for which 25% of the
observations are less than. It is the Median of the first
half of the set of observations. (the 25th percentile)
The third quartile (Q3) is the value for which 75% of the
observations are less than. It is the Median of the second half
of the set of observations. (the 75th percentile)
IQR is insensitive to outliers.
The average of the deviations
squared is called the variance.
Population parameter

2
Sample
s
2
statistic
A standard deviation is a
measure of the average
deviation from the mean.
Population

Sample
s
Suppose that we have this population:
24
16
34
28
26
21
Find the mean
(m )
Find the deviations.
30
35
37
29
x  m
What is the sum of the deviations from the mean?
24
16
34
28
26
21
Square the deviations:
30
35
37
29
x  m
2
Find the average of the squared deviations:

2
x  m


n
2
Calculation of variance
of a sample
  xn  x 
s 
n 1
2
2
df
Degrees of Freedom
(df)
• n deviations contain (n - 1)
independent pieces of
information about
variability
Calculation of standard
deviation of a sample

xn  x 

s
2
n 1
When to use what??????
Note: Variance and Standard Deviation are used to
measure spread when the mean is used to describe
center.
Note: IQR is typically used to describe spread when
Median is used to describe center.
Note: When the distribution is approximately
symmetric, the mean and standard deviation are
generally used to summarize the distribution. If the
distribution is skewed, a five number summary is
generally use
Which measure(s) of
variability is/are
resistant?
Linear transformation rule
• When adding a constant to a random
variable, the mean changes but not
the standard deviation.
• When multiplying a constant to a
random variable, the mean and the
standard deviation changes.
An appliance repair shop charges a $30 service call
to go to a home for a repair. It also charges $25 per
hour for labor. From past history, the average length
of repairs is 1 hour 15 minutes (1.25 hours) with
standard deviation of 20 minutes (1/3 hour).
Including the charge for the service call, what is the
mean and standard deviation for the charges for
labor?
m  30  25(1.25)  $61.25
1
  25   $8.33
3
Rules for Combining two variables
• To find the mean for the sum (or difference),
add (or subtract) the two means
• To find the standard deviation of the sum (or
differences), ALWAYS add the variances,
then take the square root.
• Formulas:
m a b  m a  mb
ma b  ma  mb
2
a
 a b    
If variables are independent
2
b
Bicycles arrive at a bike shop in boxes. Before they can be
sold, they must be unpacked, assembled, and tuned
(lubricated, adjusted, etc.). Based on past experience, the
times for each setup phase are independent with the
following means & standard deviations (in minutes). What
are the mean and standard deviation for the total bicycle
setup times?
Phase
Mean
SD
Unpacking
Assembly
Tuning
3.5
21.8
12.3
0.7
2.4
2.7
mT  3.5  21.8  12.3  37.6 minutes
T  0.7 2  2.42  2.7 2  3.680 minutes
Normal Distributions
•
•
•
•
•
Symmetrical bell-shaped (unimodal) density curve
How is this done
Above the horizontal axis
mathematically?
N(m, )
The transition points occur at m + 
Probability is calculated by finding the area under
the curve
• As  increases, the curve flattens &
spreads out
• As  decreases, the curve gets
taller and thinner
Normal distributions occur
frequently.
•
•
•
•
•
•
•
Length of newborn child
Height
Weight
ACT or SAT scores
Intelligence
Number of typing errors
Chemical processes
A
6
B


Do these two normal curves have the same mean?
If so, what is it?
YES
Which normal curve has a standard deviation of 3?
B
Which normal curve has a standard deviation of 1?
A
Empirical Rule
• Approximately 68% of the
observations fall within  of m
• Approximately 95% of the
observations fall within 2 of m
• Approximately 99.7% of the
observations fall within 3 of m
Suppose that the height of male
students at SHS is normally
distributed with a mean of 71 inches
and standard deviation of 2.5 inches.
What is the probability that the
height of a randomly selected male
student is more than 73.5 inches?
1 - .68 = .32
P(X > 73.5) = 0.16
68%
71
Standard Normal Density
Curves
Always has m = 0 &  = 1
To standardize:
x m
z 

Must have
this
memorized!
Strategies for finding probabilities
or proportions in normal distributions
1. State the probability
statement
2. Draw a picture
3. Calculate the z-score
4. Look up the probability
(proportion) in the table
The lifetime of a certain type of battery
is normally distributed with a mean of
200 hours
and
a standardDraw
deviation
of 15
& shade
Write
the
hours. What
proportion of these
the curve
probability
batteries
can be expected to last less
statement
than 220 hours?
P(X < 220) = .9082
Look up z220
 200
score
in
z 
 1.33
table
15
Calculate z-score
The lifetime of a certain type of battery
is normally distributed with a mean of
200 hours and a standard deviation of 15
hours. What proportion of these
batteries can be expected to last more
than 220 hours?
P(X>220) = 1 - .9082
= .0918
220  200
z 
 1.33
15
The lifetime of a certain type of battery
is normally distributed with a mean of
200 hours and a standard deviation of 15
Look
up in
table 0.95
hours. How long
must
a battery
last to be
in the top 5%? to find z- score
P(X > ?) = .05
x  200
1.645 
15
x  224.675
.95
.05
1.645
The heights of the female students at
SHS are normally distributed with a
What
is the zmean of 65 inches. What
is the
for the
standard deviation of this score
distribution
63?
if 18.5% of the female students are
shorter than 63 inches?
P(X < 63) = .185
63  65
 .9 

2
 
 2.22
 .9
-0.9
63
The heights of female teachers at SHS
are normally distributed with mean of
65.5 inches and standard deviation of
2.25 inches. The heights of male
teachers are normally distributed with
mean of 70 inches and standard
deviation of 2.5 inches.
•Describe the distribution of differences
of heights (male – female) teachers.
Normal distribution with
m = 4.5 &  = 3.3634
• What is the probability that a
randomly selected male teacher is
shorter than a randomly selected
female teacher?
P(X<0) = .0901
0  4 .5
z 
 1.34
3.3634
4.5
Will my calculator do any
of this normal stuff?
• Normalpdf – use for graphing ONLY
• Normalcdf – will find probability of
area from lower bound to upper bound
• Invnorm (inverse normal) – will find zscore for probability
Bivariate data
• x – variable: is the independent or
explanatory variable
• y- variable: is the dependent or
response variable
• Use x to predict y
yˆ  a  bx
ŷ - (y-hat) means the predicted y
b – is the slope
– it is the approximate
by which
Be sureamount
to put the
hat
y increases when x increases
on the y by 1 unit
a – is the y-intercept
– it is the approximate height of the line
when x = 0
– in some situations, the y-intercept has
no meaning
Least Squares Regression Line
LSRL
• The line that gives the best fit to
the data set
• The line that minimizes the sum
of the squares of the deviations
from the line
Interpretations
Slope:
For each unit increase in x, there is an
approximate increase/decrease of b in y.
Correlation coefficient:
There is a direction, strength, linear of
association between x and y.
Identify as having a positive
association, a negative association,
or no association.
1. Heights of mothers & heights of their +
adult daughters
2. Age of a car in years and its current value 3. Weight of a person and calories consumed +
4. Height of a person and the person’s birth
NO
month
5. Number of hours spent in safety training and
the number of accidents that occur
Correlation Coefficient (r)• A quantitative assessment of the
strength & direction of the linear
relationship between bivariate, quantitative
data
• Pearson’s sample correlation is used most
• parameter - r rho)
• statistic - r
 xi  x  yi  y 
1




r



n  1  s x  s y 
Properties of r
(correlation coefficient)
• legitimate values of r is [-1,1]
No
Correlation
Strong
correlation
Moderate
Correlation
Weak correlation
-1 -.8
-.5
0
.5
.8
1
Properties of r
(correlation coefficient)
•value of r is non-resistant
•value of r does not depend on which
of the two variables is labeled x
•value of r is not changed by any
transformations
•value of r is a measure of the extent
to which x & y are linearly related
The correlation coefficient
and the LSRL are both
non-resistant measures.
Correlation does not imply
causation
Correlation does not imply causation
Correlation does not
imply causation
Interpolation (good):
• Using a regression line for estimating
predicted values between known values.
•Extrapolation (bad):
It is unknown whether the pattern observed
in the scatterplot continues outside this
range.
The LSRL should not be used to predict y
for values of x outside the data set.
Formulas – on chart
yˆ  b0  b1 x
b1
x  x  y  y 


 x  x 
i
i
2
i
b0  y  b1 x
b1  r
sy
sx
The following statistics are found for the
variables posted speed limit and the
average number of accidents.
x  40, s x  11 .6,
y  18, s y  8.4, r  .9981
Find the LSRL & predict the number of
accidents for a posted speed limit of 50 mph.
ˆ
y  .723 x  10 .92
ˆ
y  25.23 accidents
Residuals (error) • The vertical deviation between the
observations & the LSRL
• the sum of the residuals is always zero
• error = observed - expected
residual  y  yˆ
Residual plot
• A scatterplot of the (x, residual) pairs.
• Residuals can be graphed against other
statistics besides x
• Purpose is to tell if a linear association
exist between the x & y variables
• If no pattern exists between the points in
the residual plot, then the association is
linear.
Residuals
Residuals
x
Linear
x
Not linear
Coefficient of determination• r2
• gives the approximate proportion of
variation in y that can be attributed
to an linear relationship between x &
y
• remains the same no matter which
variable is labeled x
Interpretation of
2
r
Approximately r2% of the
variation in y can be explained
by the LSRL of x & y.
Outlier –
• In a regression setting, an outlier is a
data point with a large residual
•Influential pointA point that influences where the LSRL is
located
If removed, it will significantly change the
slope of the LSRL
(189,30) could be
influential. Remove &
recalculate LSRL
(189,30) was influential
since it moved the
LSRL
Which of these measures are
resistant?
• LSRL
• Correlation coefficient
• Coefficient of determination
NONE – all are affected by outliers
What to do if the data is not
linear…
Calculate the LSRL
Is the residual plot NO
scattered?
YES
Appropriate model
Transform data:
x & log y
log x & log y
x &
y
x & 1y