No Slide Title

Download Report

Transcript No Slide Title

0
A statistics learning project:
By: Chris Hartl
1
2
How the Presentation Works
A sentence including a long underscore
(______)
answer denotes a question-and-answer
exercise. Your next click will put an answer
into the space.
2
Presentation Enhancement
• This presentation is enhanced by the
statistical features of the TI-83 calculator.
• The data used in the presentation can be
found in TI-program form here . The raw
data is located at the end of the presentation.
3
Worksheets on This Presentation
• Some slides are worksheet slides,
containing only open-ended questions. It is
recommended that the user answer the
questions with the data provided before
proceeding to the answer slides.
• The answers to the worksheet are on the
slides proceeding the worksheet slide.
4
1
• There are two basic types of transformations:
single-variable, and
multi-variable transformations.
• This slideshow will concentrate on onevariable and two-variable transformations.
5
1
I. Single Variable Transformations
6
Sample Data
This is the sample median.
This is the sample mean.
While this mean is accurate, it does not appear to come from a
normal or ~N population,
the assumptions for a mean
This and
is athus,
sample.
test or a mean confidence interval are not met (assuming that
n<30).
7
Sample Data
This is the sample median.
This is the sample mean.
It is common to think of a population in terms of its
There
are statistical
median
testsiswhich
can It
beis
used
The
of the
sample
robust.
notfor
as
meanmedian
and standard
deviation
rather
than its
median,
small,
skewed
samples.
and most
statistical
tests
involve
these numbers.
affected
by
outliers
and
skewness
as the mean is.
8
0
Influence of Skewed Data on the Mean
• In a skewed distribution, the mean is influenced by the data found in the
tail, and shifts towards the tail.
• The mean and standard deviation not summarize the data as well as the
five-number summary because of the skewness.
9
1
Transformation Can Normalize Data
In a skewed distribution, the mean is biased, and
Using a transformation will truncate the tail of the
moves
towards
the tail.
standard
deviation
is
distribution,
and making
theThe
data more
symmetric.
The mean
andhigh:
standard
deviation
are not
by the tail as much.
very
those
values
doaffected
not summarize
the In
other words, using a transformation makes the mean and
data
as well as the five-number summary due to
standard deviation more robust. (There is an example of this
thelater
skew.
in the presentation.)
10
The Goals of Single-Variable
Transformations
• Make the display of data easier to
read and analyze.
• Reduce the effects of outliers and
skewness on the mean and standard
deviation of a sample.
11
2
Goal 1: Simplifying Data Display
Ease of contextual analyses
are based on the
of the data description.
The complexity of the word
“simplicity”s display makes it hard to
read and understand.The same thing
occurs when you look at data: some
data will be hard to analyze due to its
display and summary.
12
3
Many Standard
Statistical
Transformation
is changing
theProcedures
system of
measurement
the data so
that things become
Require the of
Normally
Distributed
Data
easier
to interpret.can help satisfy this
• Transformation
requirement by at least helping the distribution be
Itsymmetric
is often worthwhile
to search for a
instead of skewed.
• The main goals
can be
transformation
of that
the data
to achieved
simplify by
its
transformingand
a single
group of numbers are:
description
analysis.
• Effective display of the data
• Symmetry of the distribution
13
Single Variable Worksheet #1
Transformations can help to interpret both single and multi-variable data, they
change both the shape and the summaries of the data. For an example of how
transformations can affect outliers, run the program “Island.” (use the sheet that
has been passed out to you.)
1) Construct a boxplot of the data. What do you notice? Do you have any ideas of how to improve the
display?
2) Construct a stemplot of the data. From the display below, identify what causes the display in
question one. What could we do to change it?
3) Rather than splitting the data and examining histograms, take the base 10 logarithm of the data, and
store it in another list. Use the command (Log(Area)
List). Construct a boxplot of this new data.
What do you notice? Does transforming the data appear to have an affect on outliers?
14
Solution: Worksheet #1(a); Question #1
1. Construct a boxplot of the data. What do you notice? Do
you have any suggestions of how to improve the display?
• The boxplot has at least one outlier, which makes the
display very hard to read.
• The whisker is many times wider than the IQR. I would
recommend using a modified box plot.
15
Solution: Worksheet #1(a); Follow-up to Question & #1
Construct a modified boxplot of the data. What do you
notice? Do you have any suggestions of what display to use
for the data?
The outliers on the modified boxplot are still making it hard
to read. Because of the scale, very little information can be
determined from this graph. My next recommendation
would be to examine a histogram or stemplot of the
distribution.
16
Solution: Worksheet #1; Question #2
2) Construct a stemplot of the data. From the display
below, identify what causes the display in question one.
What could we do to change it?
Ten-Thousands Thousands
0 000000000000000000000
1 59
2
3
4
02
5
6
84 0
The stemplot reveals that the scale required
to display all the data forces most of the data
to cluster together, even though the values
range from 7 to 9,000. A histogram would
have the same shape as this stemplot, so my
recommendation (excluding use of a
transformation) would be to split the data
into x<10,000 and x>10,000.
17
Solution: Worksheet #1; Question #3
3) Rather than splitting the data and examining histograms,
take the base 10 logarithm of the data, and store it in
another list. Construct a boxplot of this new data. What do
you notice? Does transforming the data appear to have an
affect on outliers?
Not only has the transformation of the data made the plot
easier to analyze, but the outliers have, in this case, been
brought within the statistical cutoff for outliers (namely 1.5
times the IQR.) Transforming data seems to have a
profound affect on outliers.
18
13
Single Variable Transformation Continued
To examine the effect of functions on the spread of data,
construct four parallel number lines:
x
Log(x)
x
√x
1
2
3
4
5
6
7
8
9
10
11
12
And match up values for X with the corresponding values
for Log(x) and √x, like so (click):
19
Worksheet #2
Run the program CARPRI. This program will help to identify the
ways in which transformations can change the shape of data, and
help calculate a more meaningful mean.
1) Examine the histogram for the variable “Price,” and calculate the
mean, standard deviation, and the mean ± standard deviation. If you
were to use these values in a statistical test, what would be your
concern? What would better values be?
2) Based on the results from the previous slide, which transformation
would you recommend, a square-root function, or a logarithmic
function?
3) Perform the appropriate transformation and compare its histogram
to the histogram of the original data. Calculate the same values that
you did in question one. What has changed? Do these values describe
the data better before or after the transformation?
20
Solution: Worksheet #2; Question #1
1) Examine the histogram for the variable “Price,” and calculate
the mean, standard deviation, and the mean ± standard
deviation. If you were to use these values in a statistical test or
confidence interval, what would be your concern? What would
a more appropriate number summary be?
The histogram of the data shows that the distribution is highly skewed. Single
variable (1-Var Stats) shows the mean and the standard deviation, the interval
for mean±SD is (-19160, 214644). Because the data is skewed, the mean is
pulled to the right, and the standard deviation is very high. I would not use
these values for a confidence interval or significance test. Since medians are
robust, I would suggest a 5-number summary as a better summary of the data.
21
Solution: Worksheet #2; Question #2
2) Based on the results from the previous slide, which
transformation would you recommend, a square-root
function, or a logarithmic function?
• The activity we did on the previous slide showed that the logarithm
function pulls higher values of x much closer to the other values than
the square-root function does. For instance, the square-root of
100,000 is approximately 316. The logarithm of 100,000 is 5.
Because the data in Question #1 has a very pronounced skew with a
very long tail, a logarithm function would probably work better than a
square-root function..
• I would recommend using the logarithm function.
22
Solution: Worksheet #2; Question #3
3) Perform the appropriate transformation and compare its
histogram to the histogram of the original data. Calculate the same
values that you did in question one. What has changed? Do these
values describe the data better before or after the transformation?
The histogram now is much less skewed, and much more
symmetrical after the transformation. The new mean is
10^(4.783) = 60,673. This is a more logical result (considering
it is the mean car price). The new mean ± standard deviation no
longer includes negative numbers, which is reassuring.
23
The “Issue” With Changing Data
• We’ve used a transformation which changes
the shape of the data.
• As you saw in the answer to question 3, the
summary statistics of the transformed data
are “un-transformed” to put the statistics
back into the original units and preserve
context.
24
The “Issue” With Changing Data
• The five-number summary of the
transformed data, when put back in the
context of the data (un-transformed) is the
same as the 5-number summary of the
original data.
25
The “Issue” With Changing Data
• The mean and standard deviation from the
transformed data, however, is not the same
as the mean and standard deviation of the
original data. The transformed mean is a
new center of the data, which is viable for a
statistical mean test.
26
The “Issue” With Changing Data
• In small, skewed samples, the conditions for
a two-sample mean test of equality are not
met when using the original means, but are
met when using the original mean.
• By using the transformed means rather than
the original means, one can do a mean test
of equality on two small, skewed samples.
27
Two-Variable Transformation
LSR line
(y)
R
E
S
P
O
N
S
E
EXPLANATORY VARIABLE
This is a graph of two-variable data.
28
Two-Variable Transformation
LSR line
(y)
R
E
S
P
O
N
S
E
EXPLANATORY VARIABLE
(x)
A statistician needs to test whether the population has a
linear association, by using a b test on the data.
29
Two-Variable Transformation
LSR line
(y)
R
E
S
P
O
N
S
E
EXPLANATORY VARIABLE
(x)
For this data, the conditions for the test are violated
_____ because
the data does not vary normally about the LSR line.
30
Two-Variable Transformation
What Two-Variable Transformations Do:
• Linearize data so that higher statistical calculations
(such as inference tests for a and b) can be used on data
which is originally not linear.
For this reason, two-variable transformations are
often called “linear transformations.”
31
Two-Variable Transformation
(y)
R
E
S
P
O
N
S
E
EXPLANATORY VARIABLE
(x)
In other words, the goal of linear transformations is to turn
this into:
32
Two-Variable Transformation
( y)
R
E
S
P
O
N
S
E
Note the change from Y to
the square-root of Y.
This is the transformation.
LSR line
EXPLANATORY VARIABLE
This.
(x)
33
Two-Variable Transformation
Linear Transformations
In the demonstration, data appeared to be
modeled by a power function : it appeared to
be a quadratic.
34
Two-Variable Transformation
Linear Transformations
The square-root function was the best
candidate to linearize the sample data
because it was the inverse of the apparent
functional relationship between the
explanatory and response variables.
35
Two-Variable Transformation
Linear Transformations
This is true of all linearizing functions: the best
candidates for functions to use are the ones
which would best “undo” the non-linear
relationship—the inverse of the apparent
functional relationship.
36
Two-Variable Transformation: Linear Transformation
There are many different types of transformations to
use on two-variable data. The three which we will
examine are:
• Square-root function
(used for parabolic or power data)
• Logarithmic function
(used for exponential data)
1
• Inverse function ( x )
(used for asymptotic data)
37
Worksheet #3
Run the program CARPRI again. Set up a scatterplot of Price vs
Mileage. Try several different transformations, including the
square-root and the logarithmic function.
1) Of the functions tried, which one renders a more linear graph?
2) Can you draw any conclusions about what shape of scatterplots the
LOG function will linearize?
3) What transformation would you suggest for this graph?
38
Worksheet 3; Question #1
1) Of the functions tried, which one renders a
more linear graph?
Price
Mileage
This is the graph of the original data.
39
Worksheet 3; Question #1, cont.
1) Of the functions tried, which one renders a more
linear graph?
Log(Price)
Price
Mileage
This is the data after a
square-root transformation.
Mileage
This is the data after a
logarithmic transformation.
40
Worksheet 3; Question #1, cont.
1) Of the functions tried, which one renders a
more linear graph?
Price
Log(Price)
Mileage
The square-root transformation
improves linearity, the downward
trend near the origin of the data
persists before the linear data is
evident.
Mileage
The logarithmic transformation
has less downward trend
compared to the square-root
function, resulting in a more
linear plot..
41
Worksheet 3; Question #1, cont.
1) Of the functions tried, which one renders a more
linear graph?
Residual plots of linear regression models fit to respective graphs.
Log(Price)
Price
resid
resid
Mileage
The square root transformation
appears not to have eliminated the
trend in the data, but only
lessened the eccentricity of the
relationship.
Mileage
The residuals in the residual plot
for the Logarithm function appear
more random than the square-root
function.
42
Solution: Worksheet 3; Question #1, cont.
Examining the Model
Log(Price)
Log(Price)
resid
Mileage
Mileage
Since we have identified the transformation which best linearizes
the data, we can treat the graph as a linear relationship between x
and log(y).
43
Solution: Worksheet 3; Question #1, cont.
Examining the Model
Log(Price)
Log(Price)
resid
Mileage
Mileage
Here’s the big question: Is the new relationship linear?
We can examine the model like any other linear model to answer
this question.
44
Solution: Worksheet 3; Question #1, cont.
Examining the Model
The values for r and r2 are very high, and the residual plot seems fairly
random, or at the very least, the most random residual plot obtained.
The linear model appears to fit the data. My conclusion is that there is
a linear association between Mileage and Log(Price).
45
Worksheet 3; Question #2
2) Can you draw any conclusions about the shape
of scatterplots the LOG function will linearize?
Since the graph of Price vs. Mileage was a horizontally asymptotic
graph, I think that for most graphs which look similar to the
demonstrated graph, the logarithm function (of no particular base) of
the response variable will linearize the data.
46
Worksheet 3; Question #3
3) What transformation would you suggest
to linearize this graph:
This data appears to take either a second-degree relationship, or
an exponential relationship. I would suggest using either a
square-root transformation on the response variable, or a normal
logarithm on the response variable.
47
Guidelines for Transformations
The last question of the exercise is
intended to reveal that choosing a
transformation to linearize data requires
thought, and perhaps trial and error,
especially when specific values for the
data are not given (i.e. only a graph.)
48
Guidelines for Transformations
Luckily, some guidelines exist which tell us
which patterns to use specific transformations
on.
In these guidelines, the ln function is a
recommendation for a logarithmic function:
depending on the data, the base of the
function may differ.
49
Guidelines for Transformations
y
0
x
Contains (0,0) and appears to be a power curve, or a
Suggested transformation: (x,y)
(ln(x),ln(y))
curve asymptotic to both horizontal and vertical axes.
xi>0; yi>0
50
Guidelines for Transformations
y
0
x
Contains a nonzero y-intercept and appears
Suggested transformation: (x,y)
(x, ln(y))
exponential (either growth or decay).
yi>0
51
Guidelines for Transformations
y
0
x
Contains (0,0) and appears logarithmic.
Suggested transformation: (x,y)
( x, y)
xi≥0
52
Guidelines for Transformations
y
0
x
Contains a nonzero y-intercept and appears
Suggested transformation: (x,y)
(ln(x),y)
logarithmic.
xi>0
53
Guidelines for Transformations
y
0
x
1
1
Has nonzero horizontal and vertical asymptotes.
Suggested transformation: (x,y)
( x , y )
xi≠0;yi≠0
i
i
54
(x,y)
(ln(x),ln(y))
(x,y)
(x,y)
(ln(x),y)
(x,y)
(x,ln(y))
(x,y)
( x ,y)
1 1
x y
( , )
55