Transcript ch_02
Chapter 2
Exploring Data with Graphs and
Numerical Summaries
Learn ….
The Different Types of Data
The Use of Graphs to Describe
Data
The Numerical Methods of
Summarizing Data
Agresti/Franklin Statistics, 1 of 63
Section 2.1
What are the Types of Data?
Agresti/Franklin Statistics, 2 of 63
In Every Statistical Study:
Questions
are posed
Characteristics are observed
Agresti/Franklin Statistics, 3 of 63
Characteristics are Variables
A Variable is any characteristic that
is recorded for subjects in the study
Agresti/Franklin Statistics, 4 of 63
Variation in Data
The terminology variable highlights
the fact that data values vary.
Agresti/Franklin Statistics, 5 of 63
Example: Students in a
Statistics Class
Variables:
• Age
• GPA
• Major
• Smoking Status
•…
Agresti/Franklin Statistics, 6 of 63
Data values are called
observations
Each observation can be:
• Quantitative
• Categorical
Agresti/Franklin Statistics, 7 of 63
Categorical Variable
Each observation belongs to one of a set of
categories
Examples:
• Gender (Male or Female)
• Religious Affiliation (Catholic, Jewish, …)
• Place of residence (Apt, Condo, …)
• Belief in Life After Death (Yes or No)
Agresti/Franklin Statistics, 8 of 63
Quantitative Variable
Observations take numerical values
Examples:
• Age
• Number of siblings
• Annual Income
• Number of years of education completed
Agresti/Franklin Statistics, 9 of 63
Graphs and Numerical
Summaries
Describe the main features of a
variable
For Quantitative variables: key
features are center and spread
For Categorical variables: key feature
is the percentage in each of the
categories
Agresti/Franklin Statistics, 10 of 63
Quantitative Variables
Discrete Quantitative Variables
and
Continuous Quantitative Variables
Agresti/Franklin Statistics, 11 of 63
Discrete
A quantitative variable is discrete if its
possible values form a set of separate
numbers such as 0, 1, 2, 3, …
Agresti/Franklin Statistics, 12 of 63
Examples of discrete
variables
Number of pets in a household
Number of children in a family
Number of foreign languages spoken
Agresti/Franklin Statistics, 13 of 63
Continuous
A quantitative variable is continuous
if its possible values form an interval
Agresti/Franklin Statistics, 14 of 63
Examples of Continuous
Variables
Height
Weight
Age
Amount of time it takes to complete
an assignment
Agresti/Franklin Statistics, 15 of 63
Frequency Table
A method of organizing data
Lists all possible values for a variable
along with the number of
observations for each value
Agresti/Franklin Statistics, 16 of 63
Example: Shark Attacks
Agresti/Franklin Statistics, 17 of 63
Example:
Example: Shark
Shark Attacks
Attacks
What is the variable?
Is it categorical or quantitative?
How is the proportion for Florida
calculated?
How is the % for Florida calculated?
Agresti/Franklin Statistics, 18 of 63
Example: Shark Attacks
Insights – what the data tells us about
shark attacks
Agresti/Franklin Statistics, 19 of 63
Identify the following variable as
categorical or quantitative:
Choice of diet
(vegetarian or non-vegetarian):
a.
b.
Categorical
Quantitative
Agresti/Franklin Statistics, 20 of 63
Identify the following variable as
categorical or quantitative:
Number of people you have known who have
been elected to political office:
a.
b.
Categorical
Quantitative
Agresti/Franklin Statistics, 21 of 63
Identify the following variable as
discrete or continuous:
The number of people in line at a box office to
purchase theater tickets:
a.
b.
Continuous
Discrete
Agresti/Franklin Statistics, 22 of 63
Identify the following variable as
discrete or continuous:
The weight of a dog:
a.
Continuous
b.
Discrete
Agresti/Franklin Statistics, 23 of 63
Section 2.2
How Can We Describe Data Using
Graphical Summaries?
Agresti/Franklin Statistics, 24 of 63
Graphs for Categorical Data
Pie Chart: A circle having a “slice of
pie” for each category
Bar Graph: A graph that displays a
vertical bar for each category
Agresti/Franklin Statistics, 25 of 63
Example: Sources of Electricity Use
in the U.S. and Canada
Agresti/Franklin Statistics, 26 of 63
Pie Chart
Agresti/Franklin Statistics, 27 of 63
Bar Chart
Agresti/Franklin Statistics, 28 of 63
Pie Chart vs. Bar Chart
Which graph do you prefer?
Why?
Agresti/Franklin Statistics, 29 of 63
Graphs for Quantitative Data
Dot Plot: shows a dot for each
observation
Stem-and-Leaf Plot: portrays the
individual observations
Histogram: uses bars to portray the
data
Agresti/Franklin Statistics, 30 of 63
Example: Sodium and Sugar
Amounts in Cereals
Agresti/Franklin Statistics, 31 of 63
Dotplot for Sodium in Cereals
Sodium Data:
0 210 260 125 220 290 210 140 220 200 125
170 250 150 170 70 230 200 290 180
Agresti/Franklin Statistics, 32 of 63
Stem-and-Leaf Plot for
Sodium in Cereal
Sodium Data:
0 210
260 125
220 290
210 140
220 200
125 170
250 150
170 70
230 200
290 180
Agresti/Franklin Statistics, 33 of 63
Frequency Table
Sodium Data:
0 210
260 125
220 290
210 140
220 200
125 170
250 150
170 70
230 200
290 180
Agresti/Franklin Statistics, 34 of 63
Histogram for Sodium in Cereals
Agresti/Franklin Statistics, 35 of 63
Which Graph?
Dot-plot and stem-and-leaf plot:
Histogram
• More useful for small data sets
• Data values are retained
• More useful for large data sets
• Most compact display
• More flexibility in defining intervals
Agresti/Franklin Statistics, 36 of 63
Shape of a Distribution
Overall pattern
• Clusters?
• Outliers?
• Symmetric?
• Skewed?
• Unimodal?
• Bimodal?
Agresti/Franklin Statistics, 37 of 63
Symmetric or Skewed ?
Agresti/Franklin Statistics, 38 of 63
Example: Hours of TV Watching
Agresti/Franklin Statistics, 39 of 63
Identify the minimum and maximum
sugar values:
a.
2 and 14
c.
1 and 15
b.
d.
1 and 3
0 and 16
Agresti/Franklin Statistics, 40 of 63
Consider a data set containing IQ
scores for the general public:
What shape would you expect a histogram of
this data set to have?
a.
Symmetric
b.
Skewed to the left
c.
Skewed to the right
d.
Bimodal
Agresti/Franklin Statistics, 41 of 63
Consider a data set of the scores of
students on a very easy exam in which most
score very well but a few score very poorly:
What shape would you expect a histogram of
this data set to have?
a. Symmetric
b. Skewed to the left
c. Skewed to the right
d. Bimodal
Agresti/Franklin Statistics, 42 of 63
Section 2.3
How Can We describe the Center of
Quantitative Data?
Agresti/Franklin Statistics, 43 of 63
Mean
The sum of the observations
divided by the number of
observations
x
x
n
Agresti/Franklin Statistics, 44 of 63
Median
The midpoint of the observations
when they are ordered from the
smallest to the largest (or from the
largest to the smallest)
Agresti/Franklin Statistics, 45 of 63
Find the mean and median
CO2 Pollution levels in 8 largest nations measured in
metric tons per person:
2.3 1.1 19.7 9.8 1.8 1.2 0.7 0.2
a.
b.
c.
Mean = 4.6
Mean = 4.6
Mean = 1.5
Median = 1.5
Median = 5.8
Median = 4.6
Agresti/Franklin Statistics, 46 of 63
Outlier
An observation that falls well above
or below the overall set of data
The mean can be highly influenced by
an outlier
The median is resistant: not affected
by an outlier
Agresti/Franklin Statistics, 47 of 63
Mode
The value that occurs most
frequently.
The mode is most often used with
categorical data
Agresti/Franklin Statistics, 48 of 63
Section 2.4
How Can We Describe the Spread of
Quantitative Data?
Agresti/Franklin Statistics, 49 of 63
Measuring Spread: Range
Range: difference between the largest
and smallest observations
Agresti/Franklin Statistics, 50 of 63
Measuring Spread: Standard
Deviation
Creates a measure of variation by
summarizing the deviations of each
observation from the mean and
calculating an adjusted average of these
deviations
s
( x x )2
n 1
Agresti/Franklin Statistics, 51 of 63
Empirical Rule
For bell-shaped data sets:
Approximately 68% of the observations fall
within 1 standard deviation of the mean
Approximately 95% of the observations fall
within 2 standard deviations of the mean
Approximately 100% of the observations fall
within 3 standard deviations of the mean
Agresti/Franklin Statistics, 52 of 63
Parameter and Statistic
A parameter is a numerical summary of
the population
A statistic is a numerical summary of a
sample taken from a population
Agresti/Franklin Statistics, 53 of 63
Section 2.5
How Can Measures of Position
Describe Spread?
Agresti/Franklin Statistics, 54 of 63
Quartiles
Splits the data into four parts
The median is the second quartile, Q2
The first quartile, Q1, is the median of the lower
half of the observations
The third quartile, Q3, is the median of the
upper half of the observations
Agresti/Franklin Statistics, 55 of 63
Example: Find the first and third
quartiles
Prices per share of 10 most actively traded stocks on
NYSE (rounded to nearest $)
2 4 11 12 13 15 31 31 37 47
a.
b.
c.
d.
Q1 = 2
Q1 = 12
Q1 = 11
Q1 =11.5
Q3 =
Q3 =
Q3 =
Q3 =
47
31
31
32
Agresti/Franklin Statistics, 56 of 63
Measuring Spread: Interquartile
Range
The interquartile range is the distance
between the third quartile and first
quartile:
IQR = Q3 – Q1
Agresti/Franklin Statistics, 57 of 63
Detecting Potential Outliers
An observation is a potential outlier if
it falls more than 1.5 x IQR below the
first quartile or more than 1.5 x IQR
above the third quartile
Agresti/Franklin Statistics, 58 of 63
The Five-Number Summary
The five number summary of a
dataset:
• Minimum value
• First Quartile
• Median
• Third Quartile
• Maximum value
Agresti/Franklin Statistics, 59 of 63
Boxplot
A box is constructed from Q1 to Q3
A line is drawn inside the box at the median
A line extends outward from the lower end of
the box to the smallest observation that is not
a potential outlier
A line extends outward from the upper end of
the box to the largest observation that is not a
potential outlier
Agresti/Franklin Statistics, 60 of 63
Boxplot for Sodium Data
Sodium Data:
0 200
70 210
125 210
125 220
140 220
150 230
170 250
170 260
180 290
200 290
Five Number Summary:
Min: 0
Q1: 145
Med: 200
Q3: 225
Max: 290
Agresti/Franklin Statistics, 61 of 63
Boxplot for Sodium in Cereals
Sodium Data:
0 210
260 125
220 290
210 140
220 200
125 170
250 150
170 70
230 200
290 180
Agresti/Franklin Statistics, 62 of 63
Z-Score
The z-score for an observation measures how far
an observation is from the mean in standard
deviation units
observatio n - mean
z
standard deviation
An observation in a bell-shaped distribution is a
potential outlier if its z-score < -3 or > +3
Agresti/Franklin Statistics, 63 of 63