Exploring Data

Download Report

Transcript Exploring Data

Chapter 1 - Exploring
Data
AP Statistics
Amanda Carter
Borrowed from Brian
Hamilton & Jason Molesky
Chapter 1 Section 1
Displaying Distributions with
Graphs
HW: 1.1(a, b), 1.2, 1.3, 1.4, 1. 5, 1.6
1.8, 1.9, 1.10, 1.11
1.13, 1.14, 1.16
Exploratory Data Analysis
o Exploratory Data Analysis: Statistical
practice of analyzing distributions of data
through graphical displays and numerical
summaries.
o Distribution: Description of the values a
variable takes on and how often the variable
takes on those values.
o An EDA allows us to identify patterns and
departures from patterns in distributions.
Categorical Data
• Categorical Variable: Values are labels or
categories. Distributions list the categories
and either the count or percent of
individuals in each.
• Displays: BarGraphs & Pie chart(not done in AP)
Bar Graph
Quantitative Data
• Quantitative Variable:
• Values are numeric - arithmetic computation
makes sense (average, etc.)
• Distributions list the values and # of times the
variable takes on that value.
• Displays:
•
•
•
•
Dotplots
Stemplots
Histograms
Boxplots
Only organized Data can Illuminate!
Your goal is to make neat, organized,
labeled graphs that display the
distribution of data effectively and
provide an insight into patterns and
departures from patterns.
Dotplots
• Small datasets with a small range (max-min)
can be easily displayed using a dotplot.
Draw and label a number line from min to max.
Place one dot per observation above its value.
Stack multiple observations evenly.
34 values
ranging from
0 to8.
Stemplots
• A stemplot gives a quick picture of the
shape of a distribution while including the
numerical values.
• Separate each observation into a stem and a
leaf.
• eg. 14g -> 1|4 256 -> 25|6 32.9oz -> 32|9
• Write stems in a vertical column and draw a
vertical line to the right of the column.
• Write each leaf to the right of its stem.
Stemplots
• Example1.4, pages 42-43
• Literacy Rates in Islamic Nations
Stemplots
• Note: Stemplots do not work well for large
data sets
• Back-to-Back Stemplots: Compare datasets
• Splitting Stems: Double the number of
stems, writing 0-4 after the first and 5-9
after second.
Histograms
• Histograms break the range of data values
into classes and displays the count/% of
observations that fall into that class.
Divide the range of data into equal-width classes.
Count the observations in each class “frequency”
Draw bars to represent classes - height =
frequency
Bars should touch (unlike bar graphs).
Histograms
• Let’s look ex. 1.6,
page 49 on 5th
grade IQ scores
Histograms
TAKE THE DATA AND MAKE
CLASSES WITH FREQUENCY
• Describe the SOCS or
SECS
• What do these data
suggest?
EDA Summary
• The purpose of an Exploratory Data Analysis is
to organize data and identify
patterns/departures.
PLOT YOUR DATA - Choose an appropriate graph
Look for overall pattern and departures from
pattern
Shape {mound, bimodal, skewed, uniform}
Outliers {points clearly away from body of data}
Center {What number “typifies” the data?}
Spread {How “variable” are the data values?}
Let’s Look/Investigate
Draw different “shaped” histograms
Cumulative Relative Frequency
Graphs
(OGIVE)
Histograms do a good job of displaying the
distribution of values or a quantitative
variable, but tell us little about the relative
standing of an individual observation. If we
want relative standing information, we use an
ogive.
Cumulative Relative Frequency
Graphs
(OGIVE)
Cumulative Relative Frequency
Graphs
• Decide on class intervals and determine the frequency in
each class (like with a histogram), but add some additional
columns to your table.
• Label and scale your axes and title your graph.
• Plot a point corresponding to the relative cumulative frequency in
each class interval at the left endpoint of the next class interval.
Using Ogives
Bill Clinton took office
at the age of 46. Find
his relative standing
for the age he took
office.
About 10% of all U.S.
Presidents were the
same age as or
younger than Bill
Clinton when they
were inaugurated.
Average Price of Gasoline from 1996-2006
Chapter 1 Section 2
Describing Distributions with Numbers
HW: 1.27, 1.29-1.31, 1.33, 1.36,
1.37, 1.41, 1.43, 1.45, 1.46
Describing Distributions with
Numbers
What three things do we always include
when describing a distribution of
numbers?
• Shape
• Center
• Spread
Describing Distributions with
Numbers
What three things do we always include
when describing a distribution of numbers?
• Shape- use a histogram
• Center
• Spread
Describing Distributions with
Numbers
What three things do we always include
when describing a distribution of numbers?
• Shape- use a histogram
• Center• Spread-
Describing Center
Find the mean for the Fuel Economy for
2004 model motor vehicles
You can use
a graphing
calculator
How Do I write this?
• Use your notation and begin the
calculation (this tells me you
understand the formula) and it will
allow for partial credit versus no
credit….
• (show example)
Means & Outliers
Are there any values that appear to be
outliers in our fuel economy data?
If so, find the mean without the outlier.
The single outlier adds 2 mpg onto the mean
highway mileage of 2004 motor vehicles.
The mean is sensitive to the influence of a
few extreme observations.
It is not a resistant measure of center.
Finding the median
• Median highway mileage for two-seaters
13
15
16
16
17
19
20
22
23
23
23
24
25
25
26
28
28
28
29
32
6
Without the Honda insight?
13
15
16
16
17
19
20
22
23
23
23
24
25
25
26
28
28
28
29
32
• The median is a more resistant measure of
center than the mean.
Mean verses Median
How do you think the mean and median
compare in a symmetric distribution?
In a right skewed distribution?
Don't confuse the “average” value (mean)
with the “typical” value (median).
Mean vs. Median
The mean and the median are the most
common measures of center.
If a distribution is perfectly symmetric,
the mean and the median are the same.
The mean is not resistant to outliers.
You must decide which number is the
most appropriate description of the
center...
MeanMedian Applet
Mean verses Median
Generally, when you encounter outliers,
correct them if wrongly recorded, delete
them for good reason, or otherwise give them
individual attention.
If correctly recorded outliers cannot be
discarded, consider using resistant measures.
Measuring Spread
Quartiles
Quartiles Q1 and Q3 represent the 25th
and 75th percentiles.
To find them, order data from min to max.
Determine the median - average if necessary.
The first quartile is the middle of the ‘bottom
half’.
The third quartile is the middle of the ‘top
half’.
19
22
23
23
23
Q1=23
45
68
74
Q1
26
26
27
28
med
75
76
med=79
29
30
31
32
Q3=29.5
82
82
91
Q3
93
98
Finding the Quartiles
Find the Quartiles, and IQR.
13
15
16
16
17
19
20
22
23
23
23
24
25
25
26
28
28
28
2
3
Computer Software Output
5-Number Summary, Boxplots
We can visualize the 5 Number
Summary with a boxplot.
min=45
45
50
Q1=74
55
60
med=79
65
70
Q3=91
75
80
Quiz Scores
85
max=98
90
95 100
Measures of Spread
Variability is the key to Statistics.
Without variability, there would be no
need for the subject.
When describing data, never rely on center
alone.
Measures of Spread:
Range - {rarely used...why?}
Quartiles - InterQuartile Range {IQR=Q3Q1}
Variance and Standard Deviation {var and sx}
Like Measures of Center, you must
choose the most appropriate measure of
What about OUTLIERS?
THIS HELPS SETS UP “FENCES” TO SEE IF THERE ARE ANY
DATA THAT FALLS OUTSIDE..IF SO THEY ARE OUTLIERS
1.5 • IQR Rule
To determine outliers:
Find 5 Number Summary
Determine IQR
Multiply 1.5xIQR
Set up “fences” Q1-(1.5IQR) and
Q3+(1.5IQR)
Observations “outside” the fences are
outliers.
Outlier Example
All data
on p. 48.
IQR=45.72-19.06
IQR=26.66
fence: 19.0639.99
= -20.93
fence: 45.72+39.99
= 85.71
outliers
}
{
0
1.5IQR=1.5(26.66)
1.5IQR=39.99
10
20
30
40
50 60 70
Spending ($)
80
90
100
Let’s CREATE a Box-Plot
What do we need?
Standard Deviation
Another common measure of spread is
the Standard Deviation: a measure of
the “average” deviation of all
observations from the mean.
To calculate Standard Deviation:
Calculate the mean.
Determine each observation’s deviation (x
- xbar).
“Average” the squared-deviations by
dividing the total squared deviation by (n1).
This quantity is the Variance.
Standard Deviation
(x1  x ) 2  (x2  x ) 2  ... (xn  x ) 2
Variance:var 
n 1
sx 
Standard
Deviation:

2
(x

x
)
 i
n 1
Example 1.16 (p.85): Metabolic Rates
1792
1666
1362

1614
1460
1867
1439
Standard Deviation
1792
1666
1362
1614
1460
1867
1439
Metabolic Rates: mean=1600
x
(x - x)
(x - x)2
1792
192
36864
1666
66
4356
1362
-238
56644
1614
14
196
1460
-140
19600
1867
267
71289
1439
-161
25921
Totals:
0
214870
Total
Squared
Deviation
214870
Variance
var=214870/
6
var=35811.6
6
Standard s=√35811.66
Deviation s=189.24 cal
What does this value, s,
mean?
Linear Transformations
Variables can be measured in different units (feet vs
meters, pounds vs kilograms, etc)
When converting units, the measures of center and
spread will change.
Linear Transformations (xnew=a+bx) do not change
the shape of a distribution.
Multiplying each observation by b multiplies both
the measure of center and spread by b.
Adding a to each observation adds a to the
measure of center, but does not affect spread.
Data Analysis Toolbox
•To answer a statistical question of interest:
Data: Organize and Examine
Who are the individuals described?
What are the variables?
Why were the data gathered?
When,Where,How,By Whom were data gathered?
Graph: Construct an appropriate graphical display
Describe SOCS
Numerical Summary: Calculate appropriate center and
spread (mean and s or 5 number summary)
Interpretation: Answer question in context!
Chapter 1 Summary
Data Analysis is the art of
describing data in context using
graphs and numerical summaries.
The purpose is to describe the most
important features of a dataset.