Chapter 1: Data Collection

Download Report

Transcript Chapter 1: Data Collection

Chapter 2: Descriptive
Statistics
2.1 Organizing Qualitative Data
2.2 Organizing Quantitative Data
2.3 Additional Displays
2.4 Misrepresentations of Data
September 10, 2008
1
Categorical Variables
•
•
Each observation (data point) for a categorical variable belongs to one
category among different categories
Variable:
– Gender (Categories: male or female)
– Religious Affiliation (Protestant, Catholic, Jew, Muslim, etc.)
– Home State or Country (NJ, AR, CA, FL, Canada, etc.)
– Favorite Singer (Elvis, Sting, Sinatra, etc.)
– Eye Color (brown, green, blue, hazel, black)
– Favorite Type of Music (jazz, country, rock, etc.)
Section 2.1
2
Frequency Tables for
Categorical Data
Consider a population that has N categorical variables: C1 ,C2 ,K ,C N . For example
consider the population of freshman at Vanderbilt during the present semester and
the categorical variables for this population: C1 ,C2 ,C3  gender, state, favorite color.
For each categorical variable, we list the possible categories for this variable: for C j and say


it can have k values, x j1 , x j 2 ,K , x jk . For example, C1 has the possible categories male, female
i.e., k  2; for C2 has 51 (50 states + other) possible categories i.e., k  51.
Definition: For a population or a sample and a particular categorical variable,
the number of times that the variables is in a particular category is called the
frequency of this category. The category that has the highest frequency is
called the mode for the variable. A table composed of the frequencies for the
categories is sometimes called the frequency distribution or simply
distribution of the categorical variable.
Remark: It makes sense to construct frequency tables for a discrete
quantitative variable since we can consider each discrete value of the variable
a category.
3
Relative Frequency
Definition : Suppose that a categorical variable has N categories. Furthermore, suppose for category
k it has a frequency of fk and n  f1  f2  ...  fn is the total number of data points in the sample. Then the
relative frequency is of the kth category is defined as fk 
fk
, k  1, 2,..., N. The relative frequency is also called
n
the proportion.
Example: The categorical variable is the color of a ball in a population. A sample
of 10 red, green and blue balls
Category
Frequency
Relative Frequency
Red
5
5/10 = 0.5
Green
2
2/10 = 0.2
Blue
3
3/10 = 0.3
4
Example
Consider the population of vehicles that are parked in the 25th Avenue
Garage and consider the categorical variable for the type of
transmission (automatic or manual) in the vehicles. One hundred cars
were surveyed. We construct a frequency table.
Category
Number of Vehicles
Automatic
Manual
73
27
The frequency of automatics is 73 and the frequency of manuals
is 27. The mode for the categorical variable and sample is 73.
The relative frequency of automatics is 73/100 = 0.73 (73%).
5
Remarks on Frequency Tables
• A method of organizing data
• Lists of all possible categories for a variable along with the number of
observations for each value of the variable.
• In addition, we sometimes add columns for the proportion and
percentage for each value of the variable.
6
Example
Florida:
289
289
 0.3931972and
100 39.3%
735
735
7
Example (categorical)
We are interested in the dominant color of cars that are parked on the Vanderbilt campus.
Suppose we go the 25th Avenue Garage and survey the color (black, white, red, blue, green,
other) of 100 cars for a sample. In the table below we summarize the counts of this
categorical variable.
Color
Black
White
Red
Blue
Green
Other
Frequency
20
10
15
35
10
20
8
Bar Chart
Definition: A bar chart for a categorical variable is series of horizontal or vertical bars with the
height of each bar representing the frequency of a particular category for the variable.
Color
Frequency
Black
20
White
10
Red
15
Blue
35
Green
10
Other
20
Bar charts can also be
constructed using Excel.
9
Bar Chart for Relative Frequency
Remark: Instead of the bars representing the frequency of a category, they could represent the
relative frequency.
Color
Frequency
Relative
Frequency
Black
20
0.182
White
10
0.091
Red
15
0.136
Blue
35
0.318
Green
10
0.091
Other
20
0.182
10
Pie Chart
Definition: A pie chart for a categorical variable is a circle divided into sectors with each
sector representing the frequency of a category for the variable.
Color
Frequency
Black
20
White
10
Red
15
Blue
35
Green
10
Other
20
11
Variations of Pie Chart
12
Pie Chart with Excel
Create a pie chart for the following data using Excel.
Color
Frequency
Black
20
White
10
Red
15
Blue
35
Green
10
Other
20
13
Example (Doctorates)
Doctorate Recipients: 1983, 1993, 2003. For each year we have six
categories: type of degree.
Year
Physical Sciences
Engineering
Life
Sciences
Social
Sciences
Humanities
Education
1983
4425
2781
5553
6096
3500
7174
1993
6496
5698
7395
6545
4481
6689
2003
5963
5265
8369
6777
5412
6627
14
(continued)
Green - 1983
Red - 1993
Orange - 2003
15
Pareto Charts
Definition: A Pareto Chart is a bar graph whose bars are drawn in decreasing order
of frequency or relative frequency.
In a bar chart, if we order the bars (categories) from tallest to smallest, then this bar
chart is called a Pareto Chart. The reason for doing this is that the “most important”
category appears first.
16
Example
Consider the following sample composed of Vanderbilt students who are
studying at least one foreign language.
(a)
(b)
(c)
(d)
(e)
Spanish
Chinese
Spanish
Spanish
Spanish
Chinese
German
Spanish
Spanish
French
Spanish
Spanish
Japanese
Latin
Spanish
German
German
Spanish
Italian
Spanish
Italian
Japanese
Chinese
Spanish
French
Spanish
Spanish
Russian
Latin
French
Construct the frequency distribution for this sample.
Construct the relative frequency distribution.
Construct the bar chart for the frequency.
Construct the bar chart for the relative frequency.
What is the mode of the frequency distribution?
17
Solution
Category
Frequency
Relative
Frequency
French
3
3/30 = 0.100
Latin
2
2/30 = 0.067
Russian
1
1/30 = 0.033
Japanese
2
2/30 = 0.067
Italian
2
2/30 = 0.067
German
3
3/30 = 0.100
Chinese
3
3/30 = 0.100
Spanish
14
14/30 = 0.467
18
Organizing Quantitative Data
Two Types of Quantitative Data
• Discrete
• Tables
• Frequency Tables
• Relative Frequency Tables
• Dot Plots
• Stem-and-Leaf Plots
• Histograms
• Continuous
• Histograms
Section 2.1
19
Tables and Discrete Data
Remark: There is essentially no difference between categorical data and
discrete quantitative data. Each number represents a category.
Example: Consider a discrete set of quantitative data:
{1,-1,1,0,0,2,3,1,0,2} .
We can construct a frequency table for the numbers in this set of
numbers.
Data Point
Frequency
Relative Frequency
-1
1
1/10 = 0.1
0
3
3/10 = 0.3
1
3
3/10 = 0.3
2
2
2/10 = 0.2
3
1
1/10 = 0.1
Sum
10
1.0
20
Frequency Chart
21
Histograms
Definition: A histogram is a special type of bar chart that shows the
frequency of quantitative data that is separated into intervals (bins
or classes).
22
Example
Construct a histogram for the data, {1.1,1.8, 0.9, 0.2, 2.5, 1.3 ,2.1, 2.1, 2.9, 2.0},
using the bins: [0,1), [1,2), [2,3).
[0,1): 0.9, 0.2 (frequency = 2)
[1,2): 1.1, 1.8, 1.3 (frequency = 3)
[2,3): 2.5, 2.1, 2.1, 2.9, 2.0 (frequency = 5)
23
Dot Plots
Definition: A dot plot is a chart for discrete quantitative data where each observation is
represented by a dot where the possible values of data is represented along the horizontal axis.
• Primarily for discrete quantitative data
• Similar to a bar chart or histogram
• Includes information about frequency i.e., how many times a data
point appears as a single number or in a range of values.
24
Example (quantitative)
Suppose we stand at the entrance of the Math. Building and count the number of people
entering over a 10 minute period in 1 minute increments. Below we have a table that
summarizes our sample and the resulting dot plot.
Time
Interval
Count
1 (0-1)
3
2 (1-2)
1
5 (4-5)
3
6 (5-6)
4
10 (9-10)
7
In the table, we didn’t put intervals during
which no people entered.
25
Example
This table summarizes the
about of sodium (mg) and
sugar (g) for some popular
breakfast cereals. It also
characterizes the type (adult
or child) of cereal. Hence, we
have three pieces of data
(variables) for each cereal: 2
quantitative and 1 categorical.
We will use the dot plot for
the sodium.
26
Dot Plot of Sodium
Notice that the a dot plot gives information about the frequency that a number in a
numerical data sample reoccurs, e.g., 70 occurs once and 200 twice.
27
Stem-and-Leaf Plots
•
•
•
•
•
•
•
A stem-and-leaf plot organizes data to show its shape and distribution.
Each data point is represented by a stem and a leaf.
Usually, the leaf is the last digit of the numerical data point and the other
digits to the left of the leaf form the stem. For example, if 9834 is a data
point, then 4 is the leaf and 983 is the stem. (stemleaf)
In a set of data, a stem may have several leaves.
For one digit data (0,1,2,…,9), we can represent the data as 00,01,…09.
For a data point 0X, the leaf is X and stem is 0.
We usually organize by stems.
It is sometimes to modify this representation when large numbers are
involved. In this case the stem will represent a class of numbers of the
form: d x 10s.
28
Example
Suppose a sample contains the following data points: {9, 15, 17, 24, 50, 65, 101, 170, 171}.
Number
Stem
Leaf
9 = 09
0
9
15
1
5
17
1
7
24
2
4
50
5
0
65
6
5
Stems
Leaves
0
9
1
57
2
4
3
4
5
0
6
5
7
8
9
10
1
11
12
101
10
1
170
17
0
171
17
1
13
14
15
16
17
29
01
Example
Construct a Stem-and-Leaf plot for the data: {5.4, 4.3, 4.1, 8.6, 6.0, 7.9, 9.1, 6.1, 3.1,14.5,
12.5, 8.3, 10.1, 8.2, 6.8, 10.9, 2.3, 1.0, 8.3, 8.9, 6.1, 6.5, 6.0, 9.4, 0.1, 13.9, 3.7, 10.1, 9.9,
4.9, 6.4, 10.3, 2.3. 11.9, 11.7, 12.1, 9.8, 7.8, 2.9, 6.7}.
We ignore the the decimal point or alternatively multiple each number by 10.
Stems
0
Leaves
1
1
2
339
3
17
4
139
5
4
6
00114578
7
89
8
23369
9
1489
10
1139
11
79
12
15
13
9
14
5
30
On-line Stem-and-Leaf Plotter
http://www.shodor.http://www.shodor.org/interactivate/activities/StemAndLeafPlotter/
31
Stem-and-leaf Plots and Frequency
Consider a sample {101,103,104,108,109}. If we constructed the
stem-and-leaf plot for this data, then there is a single stem (10) and
five leaves (1,3,4,8,9). Hence, the number of leaves i.e., 5, the
frequency that the data appears in the interval [100,109]. Hence, we
can conclude that there is a connection in the number of leaves and
the number of times data fall in 10 integer length intervals.
32
Bottom Line
Dot plots and stem-and-leaf plots segregate the data into bins (or
numerical ranges or classes) and they show the frequency of data
within those classes. This is useful information, but it is not practical
when one has a sample with a large number of data points.
33
Remark: Frequency Tables & Dot Plots
A frequency table and a dot plot give basically the same information.
Sodium Data:
000 210
260 125
220 290
210 140
220 200
125 170
250 150
170 70
230 200
290 180
The frequency of a
sodium interval level
can be gotten from
the dot plot.
34
Continuous Data described by
Histograms
Definition: A histogram is a type of bar chart that gives the frequencies or
relative frequencies of occurrences of a quantitative variable (either discrete
or continuous) in specified intervals.
Interval
Frequency
0-39
1
40-79
1
80-119
0
120-159
4
160-199
3
200-239
7
240-279
2
280-319
2
35
Construction of Histograms
•
•
•
•
•
•
Define intervals of equal width for the variable under consideration. For
example if our data in our sample are integers and ranges from 0 to 50, we
might choose the intervals (bins) [0,9],[10,19],[20,29],[30,39],[40,49,[50,60].
The intervals or bins are called classes. The length of a class is called the
class width.
Count the number of data points are in each bin. In the above example, we
would calculate 6 nonnegative integer values.
Construct a bar chart with the intervals specifying the width of the bars and
the frequencies giving the height of the bars. Note that the width of the bar
is arbitrary as long as we know the length of the intervals over which we do
the frequency counting.
The heights of the bars in the histogram are called the distribution of the
sample.
Histograms could be used for categorical data.
Remark: Instead of using the frequency counts, we could use the fraction of
the total sample size (percentage) as the height.
36
Example
Construct a histogram (using percentages) for the following sample:
{1.1, -1.0, 2.1, 3.5, -2.1, 0.9, 0.75, -0.5, 0.25, 4.5, 4.1}.
Interval
Frequency
Fraction
[-3,-2)
1
1/11~0.091
[-2,-1)
0
0/11
[-1,0)
2
2/11~0.181
[0,1)
3
3/11~0.273
[1,2)
1
1/11
[2,3)
1
1/11
[3,4)
1
1/11
[5,5)
2
2/11
37
Histogram for Example
38
Example (IQ Scores)
IQ Range
Frequency
60-69
2
70-79
3
80-89
13
90-99
42
100-110
58
110-119
40
120-129
31
130-139
8
140-149
2
150-159
1
39
(continued)
How many students were sampled?
What is the width of the intervals?
IQ Range
Frequency
60-69
2
70-79
3
80-89
13
90-99
42
100-110
58
110-119
40
120-129
31
130-139
8
140-149
2
150-159
1
Which range of IQ had the highest frequency?
Which range of IQ had the lowest frequency?
40
Dot, Stem-and-leaf, or Histogram?
• Dot plot and Stem-and-Leaf plot:
– Useful for showing information about small data
sets.
– Shows actual data.
• Histogram
– Useful for showing information about large data
sets.
– Can be used for continuous or discrete data.
– Most compact plot.
– Has flexibility in defining intervals.
41
The Shape of the Distribution
For a histogram, we can associate the graph of a function by
drawing a smooth curve through the midpoints of each bar.
The shape of this curve can be used to describe the shape
of the histogram.
42
Unimodal and Bimodal
Unimodal: one hump
Bimodal: two humps
43
Skewed Distributions
Skewed to the right
Skewed to the left
44
Symmetric
Distribution Terminology
• The value of the highest bar in a histogram is called the mode of
the distribution. Hence, the terminology unimodal and bimodal.
• A distribution is said to be symmetric in there is a vertical line
that separates the distribution into identical pieces.
• A distribution that is not symmetric is said to be skewed.
• The “ends” of a distribution are called the tails of the distribution.
45
Outliers
A bar that is completely separated from the cluster of bars is called an outlier.
46
Hours of TV Watching
47
Wechsler Adult Intelligence Scale (IQ)
Range
%
<55
0.15
55-70
1.85
70-85
13.0
85-100
35.0
100-115
33.0
115-130
15.0
130-145
1.80
>145
0.20
The distribution is almost symmetric.
48
Additional Displays for
Quantitative Data
Alternative to histograms for quantitative data: Frequency
Polygons.
Definition: Suppose that an interval, [a,b), represents a class for a
set of quantitative data. The class midpoint is defined as (a+b)/2.
Definition: A frequency polygon is a graph that is constructed from
the class midpoints and their frequencies.
Bins (class)
Class Midpoint
Frequency
[a,b)
(a+b)/2
f
…
…
…
…
…
…
Section 2.3
49
Example
Mathematica Demonstration
50
Cumulative Frequency
Distribution
Suppose that
f1, f2 ,..., fk  is the set of frequencies for some data set of size n.
That is, suppose that we
subdivide the interval between the largest and smallest values of the data set into k categories (subintervals).
We then count the number of data points that lie in each subinterval. The cumulative frequency of category j is
j
defined as f1  f2  ...  f j   fi . Note the cumulative frequency of category k, f1  f2  ...  fk  n.
i 1
51
Cumulative Frequency
52
Example
data = {3.1, 0.1, 0.9, 1.1, 1.3, 1.6, 2.5, 0.3, 2.5, 1.6, 1.6, 3.5, 1.8}
bins = [0,1), [1,2), [2,3), [3,4)
n = 13
k=4
Bin
Frequency
Cumulative
Frequency
[0,1)
3
3
[1,2)
6
3+6 = 9
[2,3)
2
3+6+2 = 11
[3,4)
2
3+6+2+2 = 13
53
Cumulative Relative
Frequency Distribution
If f1 , f2 ,..., fk  are the frequencies in bins (classes),
a , a ,a , a ,...,a , a , for a set of data such that
1
2
2
f1  f2  ...  fk  n, then we define the relative frequencies: rj 
3
k
k 1
fj
. We note that
n
r1  r2  ...  rk  1. The cumulative relative frequency for bin j is defined as r1  r2  ...  rj .
54
Example
data = {3.1, 0.1, 0.9, 1.1, 1.3, 1.6, 2.5, 0.3, 2.5, 1.6, 1.6, 3.5, 1.8}
bins = [0,1), [1,2), [2,3), [3,4)
n = 13
k=4
Bin
Frequency
Cumulative Frequency
Relative Frequency
(rounded)
Cumulative Relative
Frequency
[0,1)
3
3
3/13 = 0.230
0.230
[1,2)
6
3+6 = 9
6/13= 0.462
0.230+0.462 = 0.692
[2,3)
2
3+6+2 = 11
2/13 = 0.154
0.692+0.154 = 0.846
[3,4)
2
3+6+2+2 = 13
2/13 = 0.154
0.846+0.154 = 1.000
55
Relative Frequency
Distribution (histogram)
56
Ogive
Definition: An ogive is a graph of the cumulative frequency or the
relative cumulative frequency as a function of the bins used to construct
the cumulative or relative cumulative frequency. It is constructed by
using a cumulative frequency (or relative cumulative frequency) table.
57
Example
Bin
Frequency
Cumulative Frequency
Relative Frequency
(rounded)
Cumulative Relative
Frequency
[0,1)
3
3
3/13 = 0.230
0.230
[1,2)
6
3+6 = 9
6/13= 0.462
0.230+0.462 = 0.692
[2,3)
2
3+6+2 = 11
2/13 = 0.154
0.692+0.154 = 0.846
[3,4)
2
3+6+2+2 = 13
2/13 = 0.154
0.846+0.154 = 1.000
58
Time-series Data
Definition: Data about a particular variable collected over a period of
time is called time-series data.
Example: Closing prices of IBM stock since Jan. 1, 2008.
59
Bad Graphical Representation
of Data
Problem: Graphs can give an incomplete or even a
misrepresentation of the sample (data).
Section 2.4
60
The Scale Problem
The number of bachelor’s degrees in engineering for 1999-2003 is
given in the following table:
Year
Number of
Degrees
1999
62,372
2000
63,731
2001
65,113
2002
67,301
2003
70,949
61
Misleading Bar Chart
62