Very non-resistant

Download Report

Transcript Very non-resistant

Stat 31, Section 1, Last Time
•
•
•
•
Distributions (how are data “spread out”?)
Visual Display: Histograms
Binwidth is critical
Bivariate display: scatterplot
• Course Organization & Website
https://www.unc.edu/%7Emarron/UNCstat31-2005/Stat31sec1Home.html
Exploratory Data Analysis 4
“Time Plots”, i.e. “Time Series:
Idea: when time structure is important,
plot variable as a function of time:
variable
time
Often useful to “connect the dots”
Class Time Series Example
Monthly Airline Passenger Numbers
https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Done.xls
•
•
•
Increasing Trend
(long term growth, over years)
Increasing Variation
(appears proportional to trend)
“Seasonal Effect” - 12 Month Cycle
(Peak in summer, less in winter)
Airline Passengers Example
Interesting variation:
log transformation
•
Stabilizes variation
•
Since log of product is sum
•
Shows changing variation prop’l to trend
•
Log10 is “most interpretable”
(log10(1000) = 3, …)
•
Generally useful trick (there are others)
Airline Passengers Example
A look under the hood
https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg5Raw.xls
•
•
•
•
•
•
Use Chart Wizard
Chart Type: Line (or could do XY)
Use subtype for points & lines
Use menu for first log10
Although could just type it in
Drag down to repeat for whole column
Time Series HW
HW: 1.36, 1.37
•
Use EXCEL
Exploratory Data Analysis 5
Numerical Summaries of Quant. Variables:
Idea: Summarize distributional information
(“center”, “spread”, “skewed”)
In Text, Sec. 1.2
for data
x1 , x2 ,..., xn
(subscripts allow “indexing numbers” in list)
Numerical Summaries
A. “Centers” (note there are several)
1. “Mean” = Average =
x1    xn

n
n
 n1  xi  x
i 1
•
Greek letter “Sigma”, for “sum”
In EXCEL, use “AVERAGE” function
Numerical Summaries of Center
2.
“Median” = Value in middle (of sorted list)
Unsorted E.g:
Sorted E.g:
3
0
1
1
27 “in middle”? (no)
2 better “middle”!
2
3
0
27
EXCEL:
use function “MEDIAN”
Difference Betw’n Mean & Median
Symmetric Distribution: Essentially no difference
Right Skewed:
50% area
50% area
M
x
bigger since “feels tails more strongly”
Difference Betw’n Mean & Median
Outliers (unusual values):
Nice Web Example:
http://www.stat.sc.edu/~west/applets/box.html
• Mean feels outliers much more strongly
• Leaves “range of most of data”
• Good notion of “center”? (perhaps not)
• Median affected very minimally
• Robustness Terminology:
Median is “resistant to the effect of outliers”
Difference Betw’n Mean & Median
A more flexible web example:
http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html
•
Get various dist’ns, by manipulating bar heights
•
See Mean, Median and more
•
Similar for symmetric distributions
•
Very different when skewed
•
“Big Gap”, can make median jump a lot
•
But mean is less sensitive (more “continuous”)
Numerical Centerpoint HW
HW: 1.49 a (but make histograms), b
•
Use EXCEL
Numerical Summaries (cont.)
A. “Spreads” (again there are several)
1. Range = biggest xi - smallest xi
range
Problems:
•
Feels only “outliers”
•
Not “bulk of data”
•
Very non-resistant to outliers
Numerical Summaries of Spread
2. Variance =
s 
2
n
 x1  x      x1  x 
2
2
n 1
  xi  x 
 i 1
2
n 1
= “average squared distance to
EXCEL:
x“
VAR
Drawback:
units are wrong
e. g. For xi in feet

s
2
is in square feet
Numerical Summaries of Spread
3. Standard Deviation  s  s
EXCEL:
2
STDEV
•
Scale is right
•
But not resistant to outliers
•
Will use quite a lot later
(for reasons described later)
Interactive View of S. D.
Revisit flexible web example:
http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html
•
Note SD range centered at mean
•
Can put SD “right near middle” (densely packed data)
•
Can put SD at “edges of data”
(U shaped data)
•
Can put SD “outside of data”
(big spike + outlier)
•
But generally “sensible measure of spread”
Variance – S. D. HW
HW: for both data sets in 1.49, find the:
i.
Variance
(698.9, 1079)
ii. Standard Deviation
•
Use EXCEL
(26.4, 32.9)
Numerical Summaries of Spread
3. Interquartile Range = IQR
Based on “quartiles”, Q1 and Q3
(idea: shows where are 25% & 75% “through the data”)
25%
25%
25%
25%
Q1
IQR = Q3 – Q1
Q2 = median
Q3
Quartiles Example
Revisit flexible web example:
https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls
•
Right skewness gives:
–
Median < Mean
(mean “feels farther points more strongly”)
–
Q1 near median
–
Q3 quite far
(makes sense from histogram)
Quartiles Example
A look under the hood:
https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Raw.xls
•
Can compute as separate functions for each
•
Or use:
Tools  Data Analysis  Descriptive Stats
•
Which gives many other measures as well
•
Use “k-th largest & smallest” to get quartiles
5 Number Summary
1.
2.
3.
4.
5.
Minimum
Q1 - 1st Quartile
Median
Q3 - 3rd Quartile
Maximum
Summarize Information About:
a)
b)
c)
d)
Center
Spread
Skewness
Outliers
-
from 3
from 2 & 4 (maybe 1 & 6)
from 2, 3 & 4
from 1 & 5
5 Number Summary
How to Compute?
https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls
•
EXCEL function QUARTILE
•
“One stop shopping”
•
IQR seems to need explicit calculation
Rule for Defining “Outliers”
Caution: There are many of these
Textbook version:
Above Q3 + 1.5 * IQR
Below Q1 – 1.5 * IQR
For stamps data:
https://www.unc.edu/~marron/UNCstat31-2005/Stat31Eg6Done.xls
– No outliers at “low end”
– Some that “high end”
Box Plot
•
Additional Visual Display Device
•
Again legacy from pencil & paper days
•
Not supported in EXCEL
•
We will skip
5 Number Sum. & Outliers HW
1.49 c, d
1.46 and add:
(d) How much does the mean change if you
omit Montana and Wyoming?