Transcript STT 315

STT 315
This lecture is based on Chapter 2 of the textbook.
Acknowledgement: Author is thankful to Dr. Ashok Sinha, Dr. Jennifer Kaplan
and Dr. Parthanil Roy for allowing him to use/edit some of their slides.
1
Topic of this chapter
• These materials can be read from Chapter 2.12.5 of the textbook.
• We shall first cover some descriptive statistics of
qualitative variables (Ch2.1).
• Later we shall study descriptive statistics of
quantitative variables (Ch2.2-2.5).
• In descriptive statistics we summarize data
through graphs and tables.
2
How to display Qualitative Data?
• Frequency Tables
• Bar graph (or bar chart)
• Pie chart (or pie diagram)
• Pareto chart (or Pareto diagram)
3
Qualitative variables
• Qualitative or categorical variable cannot be usually
measured in numerical scale, and simply records
quality.
• Each category of a qualitative variable is also called
class or level.
For instance, the qualitative variable GENDER has two
classes, namely Male and Female.
• If we count number of observations belonging to each
class, then this count is called class frequency or
simply frequency.
• Relative frequency of a class is obtained by dividing
the class frequency by total number of observations.
4
Frequency Tables
• These are tables in which classes (categories)
are written in the left most column and the
corresponding counts are written in the second
column. Count is also known as frequency.
• Sometimes proportions (or percentages) are
also written instead of or in addition to the actual
counts. Proportion is also called relative
frequency.
5
Frequency Table: An Example
Frequency Table of the number of Golf Balls sold
in different days of a week
Day
# of Golf Balls Sold
% of Golf Balls Sold
(Frequency)
Monday
17
19.54
Tuesday
13
14.94
Wednesday
15
17.24
Thursday
20
22.99
Friday
22
25.29
Total
87
100
6
Bar Charts
• A bar chart or bar graph is a chart with
rectangular bars with lengths proportional
to the values that they represent.
• The bars can be plotted vertically (more
common) or horizontally (less common).
• The percentage or relative proportions can
also be plotted instead of the actual
values.
7
Bar Chart: Golf Balls Sold
# of Golf Balls Sold
25
22
20
20
17
15
15
13
10
5
0
Monday
Tuesday
Wednesday
Thursday
Friday
8
Bar Chart: % Golf Balls Sold
% golf balls
30
25.29
25
20
22.99
19.54
17.24
14.94
15
10
5
0
Monday
Tuesday
Wednesday
Thursday
Friday
9
Pie Chart
• A pie chart (or a circle graph) is a circular
chart divided into sectors, illustrating
proportion.
• The arc length of each sector (and
consequently its central angle and area),
is proportional to the quantity it represents.
• The math is carried out based on the
following: 100% is same as 360 degrees.
10
Pie Chart: Golf Ball Sold
% of Golf Balls Sold
20%
25%
Monday
Tuesday
Wednesday
15%
Thursday
Friday
23%
17%
11
Pie Chart: An Example
Pie Chart of English Native Speakers
12
Bar Chart vs. Pie Chart
• Bar chart is used more often to represent
the actual values while pie chart is used to
represent relative proportions (in %).
• When comparison of relative proportion is
important, pie chart is more appropriate.
• When the absolute counts or values are
more important, a bar chart should be
used.
13
Major points so far
• First step in
organizing data
– draw a picture
• Appropriate pictures
for categorical data
– Pie chart
– Bar chart
14
Pareto diagram
• Pareto diagram is a particular type of bar
diagram in which the classes are arranged
on the horizontal axis in decreasing
frequencies.
• That means in Pareto diagram the leftmost
class has the highest frequency bar,
followed by the class with next highest
frequency bar, and so on.
15
The following Pareto diagram
represents the incarceration rate (per
100000 people) of various countries.
16
Displaying Quantitative Data
• Histograms
• Stem-and-Leaf Displays
• Dotplots
17
Histograms
• Histogram is a graphical representation,
showing a visual impression of the distribution of
quantitative data.
• It consists of adjacent rectangles, erected over
intervals (also known as bins or classes).
The lengths of the intervals may be different.
The interval may contain a single value.
• The heights are equal to the number (frequency)
of the observations in the corresponding bins.
• Sometimes percentages (or relative frequencies)
are also represented by the heights.
18
Histogram: An Example
The heights of 31 Black Cherry trees
19
A Few Questions
• How to choose the
bin size?
Let the computer
decide it for you. 
• What happens for the
observations in the
boundary of two
bins?
Put them in the
higher bin.
• Don’t we lose
information?
Yes, we do. 
20
Stem-and-Leaf Display
• Another device for presenting quantitative data
in a graphical format.
• Assists in visualizing the shape of the
distribution of the observations.
• Unlike histograms, stem-and-leaf displays retain
the original data.
• Contains two columns separated by a vertical
line. The left column contains the stems and the
right column contains the leaves.
Suppose we have the following data on weights (in
lb) of 17 school-kids:
88 47 68 76 46 106 49 63 72 64 84 66
68 75 72 81 44
21
How do they work?
Sorted data:
44 46 47 49 63 64 66 68 68 72
72 75 76 81 84 88 106
Stem Leaf
4 4 6 7 9
5
6 3 4 6 8 8
7 2 2 5 6
8 1 4 8
9
10 6
key: 6|3 = 63
leaf unit: 1.0
stem unit: 10.0
22
Dotplots
• A dotplot is a statistical chart consisting of
group of data points plotted on a simple
scale.
• They can be drawn both horizontally and
vertically.
23
Summary
• We have learnt three methods of
displaying quantitative data: histogram,
stem-and-leaf display and dotplot.
• When the data-size is small, stem-and-leaf
display and dotplot are more useful.
• When the data-size is large, histogram is
more useful.
24
Distribution of the Data-points
Three important features:
 Shape of the distribution,
 Center of the distribution,
 Spread of the distribution.
25
Shape of a Distribution: Modes
The peaks of a histogram are called modes.
A distribution is
unimodal if it has one mode,
bimodal if it has two modes,
multimodal if it has three or more
modes.
26
Unimodal, Bimodal or Multimodal?
Unimodal
Bimodal
Multimodal
27
Uniform Histogram
• A histogram that doesn’t appear to have
any mode.
• All the bars are approximately the same.
28
Shape of a Distribution: Symmetry
• If the histogram can be folded along a vertical
line through the middle and have the edges
match pretty closely, then the distribution is
symmetric.
• Otherwise, it is skewed.
29
Skewed to the left or right?
• Skewed to the left
• ( tail is in left)
• Skewed to the right
• (Tail is in right)
30
Shape of a Distribution: Outliers
• Outliers are the data-points that stand off
away from the body of the histogram.
• They are too high or too low compared to
most of the observations.
31
The following distribution is …
A.
B.
C.
D.
E.
Unimodal and skewed to the left
Bimodal and skewed to the right
Bimodal and symmetric
Multimodal and symmetric
Unimodal and skewed to the right
32
Does this distribution have an
outlier?
(a) Yes, it does
(b) No, it doesn’t
33
The following distribution is …
A.
B.
C.
D.
E.
Unimodal and skewed to the left
Bimodal and skewed to the right
Bimodal and symmetric
Multimodal and symmetric
Unimodal and skewed to the right
34
Numerical measures for
quantitative data
35
Center of a Distribution
• Median: The middlemost observation
when the data is sorted in increasing order
• Median can always be used as the center
of a distribution.
• Mean: The average of all data-points.
• Mean can be used as the center of a
distribution when the distribution is
symmetric.
36
What is Median?
• Median is the middlemost observation
when the data is sorted in increasing
order.
• Data: 23, 33, 12, 39, 27
• Sorted Data: 12, 23, 27, 33, 39
• Median: 27
37
What if there are even number of
observations?
• Take the average of two middlemost
observations in that case
• Data: 23, 33, 12, 39, 27, 10
• Sorted Data: 10, 12, 23, 27, 33, 39
• Median = (23+27)/2 = 25.
38
What is the general rule?
• Suppose there are n observations.
• Sort them in increasing order.
• If n is odd then the median is the
observation in the (n+1)/2th position.
• If n is even, then the median is the
average of the observations in the (n/2)th
and (n/2 + 1)th positions.
39
When n is odd
• Data: 23, 33, 12, 39, 27
• n = 5 (odd)
• Sorted Data: 12, 23, 27, 33, 39
• Median = observation in the (5+1)/2th
position
= observation in the 3rd position
= 27.
40
When n is even
• Data: 23, 33, 12, 39, 27, 10
• n = 6 (even)
• Sorted Data: 10, 12, 23, 27, 33, 39
• Median = average of the observations in the
(6/2)th and (6/2 +1)th positions
= average of the observations in the 3rd
and 4th positions
= (23+27)/2
= 25.
41
What is mean?
• Mean is the average of all the observations (i.e.,
add up all the values and divide by the number
of values).
• If an observation repeats, we add it the number
of times it repeats when we calculate the
average.
• Mean can be used as the center of a distribution
when the distribution is symmetric.
Data: 10, 13, 18, 22, 29
Mean = (10 + 13 + 18 + 22 + 29)/5 = 18.40
42
Mean vs. Median
• Data: 10, 13, 18, 22, 29
Without the outlier:
• Mean = 18.40
Median = 18
• Data: 10, 13, 18, 22, 29, 68
With the outlier:
• Mean = 26.67
Median = 20
Conclusion: Mean is more outlier-sensitive
compared to the median.
43
Mean vs. Median
• Mean is more outlier-sensitive compared to
median.
• For a symmetric distribution, mean = median.
Thus mean is more useful as the center of a
distribution when the distribution is symmetric.
But median can always be used as the center
of a distribution.
• For a right-skewed distribution, mean > median.
• For a left-skewed distribution, mean < median.
Learn to use TI 83/84 Plus to compute mean and median.
44
TI 83/84 Plus commands
• To enter the data:
–
–
–
–
Press [STAT]
Under EDIT select 1: Edit and press ENTER
Columns with names L1, L2 etc. will appear
Type the data value under the column; each data
entry will be followed by ENTER.
• To clear data:
– Pressing CLEAR will clear the particular data.
– To clear all data from all columns press [2nd] & + and
then choose 4: ClrAllLists.
45
TI 83/84 Plus commands
46
Effect of Linear Transformation
• Suppose every observation is multiplied by a
fixed constant. Then
median of transformed observations is the median of
the original observations times that same constant.
mean of transformed observations is the mean of the
original observations times that same constant.
Data: 10, 13, 18, 22, 29
Mean = 18.40.
Median = 18.
Suppose transformed data = (-3)*original data.
So transformed data: -30, -39, -54, -66, -87
Mean = (-3)*18.40 = -55.20.
Median = (-3)*18 = -54.
47
Effect of Linear Transformation
• Suppose a fixed constant is added to (or
subtracted from) each observation. Then
median of transformed observations is the median of
the original observations plus (or minus) that same
constant.
mean of transformed observations is the mean of the
original observations plus (or minus) that same
constant.
Data: 10, 13, 18, 22, 29
Mean = 18.40.
Median = 18.
Suppose transformed data = original data + 2.5.
Hence transformed data: 12.5, 15.5, 20.5, 24.5, 31.5
Mean = 18.40 + 2.5 = 20.90. Median = 18 + 2.5 = 20.50.
48
Spread of a Distribution
Are the values concentrated around the center of
the distribution or they are spread out?
Range,
Interquartile Range,
Variance,
Standard Deviation.
Note: Variance and standard deviation are more
appropriate when the distribution is symmetric.
49
Range
• Range of the data is defined as the difference
between the maximum and the minimum values.
• Data: 23, 21, 67, 44, 51, 12, 35.
Range = maximum – minimum = 67 – 12 = 55.
• Disadvantage: A single extreme value can make
it very large, giving a value that does not really
represent the data overall. On the other hand, it is
not affected at all if some observation changes in
the middle.
50
Interquartile Range (IQR)
• What is IQR?
IQR = Third Quartile (Q3) – First Quartile (Q1).
• What are quartiles?
Recall: Median divides the data into 2 equal
halves.
The first quartile, median and the third
quartile divide the data into 4 roughly
equal parts.
51
Quartiles
• The first quartile (Q1, lower quartile) is that value
which is larger than 25% of observations, but
smaller than 75% of observations.
• The second quartile (Q2) is the median, which is
larger than 50% of observations, but smaller
than 50% of observations.
• The third quartile (Q3, upper quartile) is that
value which is larger than 75% of observations,
but smaller than 25% of observations.
• Obviously, Q1 < Q2 (= median) < Q3.
• How to compute the quartiles?
We shall use TI 83/84 Plus.
52
IQR vs. Range
• IQR is a better summary of the spread of a
distribution than the range because it has
some information about the entire data,
where as range only has information on
the extreme values of the data.
• IQR is less outlier-sensitive than range.
53
Outlier-sensitivity
• Data: 10, 13, 17, 21, 28, 32
Without the outlier
• IQR = 15
Range = 22
• Data: 10, 13, 17, 21, 28, 32, 59
With the outlier
• IQR = 15
Range = 49
Conclusion: IQR is less outlier-sensitive than range.
54
Variance and Standard Deviation
• The sample variance (s2) is defined as:
1
2
s 
( x1  x ) 2    ( xn  x ) 2 .
n 1


• Subtract the mean from each value, square each
difference, add up the squares, divide by one
fewer than the sample size.
• The sample standard deviation (s), is the
positive square root of sample variance, i.e.
s s .
2
55
Variance and Standard Deviation
• Larger the variance (and standard
deviation) more dispersed are the
observations around the mean.
• The unit of variance is square of the unit of
the original data,
whereas standard deviation has the same
unit as the original data.
• Both variance and standard deviation are
more appropriate for symmetric
distributions.
56
Standard Deviation: An Example
Data: 3, 12, 8, 9, 3 (n=5 in this case)
Mean = (3+12+8+9+3)/5 = 35/5 =7.
Data Deviations from mean Squared Deviations
-----------------------------------------------------------------------------3
3 – 7 = -4
(-4)x(-4) =16
12
12 – 7 = 5
5 x 5 =25
8
8–7= 1
1x1= 1
9
9–7= 2
2x2= 4
3
3 – 7 = -4
(-4)x(-4) =16
-----------------------------------------------------------------------------Total = 62
Now divide by n-1=4: s2 = 62/4 = 15.50. s = √15.5 = 3.94.
Answer: The standard deviation in this example is 3.94
and the variance is 15.50.
57
Effect of Linear Transformation
• Suppose every observation is multiplied by a fixed
constant. Then
range/IQR/standard deviation of transformed observations is
the range/IQR/standard deviation of the original observations
times the absolute value of that same constant.
variance of transformed observations is the variance of the
original observations times the square of that same constant.
Temperature data (in F): 10, 13, 18, 22, 29
Range = 19 F, IQR =14 F, s = 7.5 F, s2 = 56.25 F2.
Suppose transformed data = (-3)*original data.
So transformed data (in F): -30, -39, -54, -66, -87
Range = |-3|*19 = 57 F, IQR = |-3|*14 = 42 F,
s = |-3|* 7.5 = 22.50 F, s2 = (-3)2*56.25 = 506.25 F2.
58
Effect of Linear Transformation
• Suppose a fixed constant is added to (or subtracted
from) each observation. Then
range/IQR/standard deviation/variance of
transformed observations remains the same as
that of the original observations.
Temperature data (in F): 10, 13, 18, 22, 29
Range = 19 F, IQR =14 F, s = 7.5 F, s2 = 56.25 F2.
Suppose transformed data = original data + 2.5.
Hence transformed data (in F): 12.5, 15.5, 20.5, 24.5, 31.5
Range = 19 F, IQR =14 F, s = 7.5 F, s2 = 56.25 F2.
59
Empirical rule
&
Chebyshev’s rule
60
Empirical rule
For approximately symmetric unimodal (bellshaped/mound shaped) distribution
• Approximately 68% of observations fall within 1
standard deviation of mean.
• Approximately 95% of observations fall within 2
standard deviations of mean.
• Approximately 99.7% of observations fall within 3
standard deviations of mean.
61
Empirical rule
62
Empirical rule
63
Chebyshev’s rule
1
𝑘2
For any distribution at least 1 −
of the observations
will fall within k standard deviations of mean, where 𝑘 ≥ 1.
• Chebyshev’s rule is for any distribution, whereas the
empirical rule is valid only for approximately symmetric
unimodal (mound-shaped) distribution.
• If k=1, not much information is available from
Chebyshev’s rule.
• According to Chebyshev at least 75% observations fall
within 2 standard deviations of mean.
• According to Chebyshev at least 88.9% of observations
fall within 3 standard deviations of mean.
64
Box plot
65
Box Plot
Box plot is a graphical representation of the
following 5 number summary:
1.
2.
3.
4.
5.
Minimum Value,
Lower Quartile,
Median (the middle value),
Upper Quartile,
Maximum Value.
NOTE: Data must be ordered from lowest
value to highest value before finding
the 5 number summary.
66
Box Plots
• Are a representation of the
five number summary
(Minimum, Maximum,
Median, Lower Quartile,
Upper Quartile).
• Half the data are in the box
• One-quarter of the data are in
each whisker.
• If one part of the plot is long,
the data are skewed.
• Box-plot is very useful for
comparing distributions
• This box plot indicates data
are skewed to the left.
67
Box Plot
• Box Plot is a pictorial representation of the 5-number
summary.
68
Outliers
• Any observation farther than 1.5
times IQR from the closest
boundary of the box is an
outlier.
• If it is farther than 3 times IQR, it
is an extreme outlier, otherwise
a mild outlier.
• One can also indicate the
outliers in a box plot, by drawing
the whiskers only up to 1.5 times
IQR on both sides, and indicating
outliers with stars or crosses (or
other symbols).
69
An example
Suppose
min = 2, Q1 = 18, median = 20, Q3 = 22, max = 35.
Which of the following observations are
outliers?
A. 10
B. 15
C. 25
D. 30
70
Histogram vs. Box plot
• Both histogram and box plot capture the
symmetry or skewness of distributions.
• Box plot cannot indicate the modality of
the data.
• Box plot is much better in finding
outliers.
• The shape of histogram depends to
some extent on the choice of bins.
71
Comparing Distributions
We can compare between distributions of
various data-sets using
 Box Plots (or the 5-Number Summary),
 Histograms.
We shall first compare distributions using
box plots.
Which type of car has the largest median Time to
accelerate?
A.
B.
C.
D.
E.
upscale
sports
small
large
family
73
Which type of car has the smallest median
time value?
A.
B.
C.
D.
E.
upscale
sports
small
Large
Luxury
74
Which type of car always take less than 3.6
seconds to accelerate?
A.
B.
C.
D.
E.
upscale
sports
small
Large
Luxury
75
Which type of car has the smallest IQR
for Time to accelerate?
A.
B.
C.
D.
E.
upscale
sports
small
Large
Luxury
76
What is the shape of the distribution of
acceleration times for luxury cars?
A. Left skewed
B. Right skewed
C. Roughly
symmetric
D. Cannot be
determined
from the
information
given.
77
What percent of luxury cars accelerate to 30
mph in less than 3.5 seconds?
A.
B.
C.
D.
E.
Roughly 25%
Exactly 37.5%
Roughly 50%
Roughly 75%
Cannot be
determined
from the
information
given
78
What percent of family cars
accelerate to 30 mph in less than
3.5 seconds?
A.
B.
C.
D.
E.
Less than 25%
More than 50%
Less than 50%
Exactly 75%
None of the
above
79
Comparing Distributions
Use of Histograms
A
FREQUENCY
FREQUENCY
Which data have more
variability?
B
6
5
4
3
2
1
6
5
4
3
2
1
0
2
4 6 8 10 12
SCORE
0
2 4 6 8 10 12
SCORE
A. Graph A
B. Graph B
C. Both have the same variability
81
A
Which data have more
variability?
B
A. Graph A
B. Graph B
C. Both have the same variability
82
A
Which data have a higher
median?
B
A. Graph A
B. Graph B
C. Both have the same median
83
FREQUENCY
FREQUENCY
A
Which data have more
variability?
B
6
5
4
3
2
1
6
5
4
3
2
1
0
2 4 6 8 10 12
SCORE
0
2 4 6 8 10 12
SCORE
A. Graph A
B. Graph B
C. Roughly, both have the same variability
84
z-score
85
How to compare apples with oranges?
• A college admissions committee is looking at
the files of two candidates, one with a total
SAT score of 1500 and another with an ACT
score of 22. Which candidate scored better?
• How do we compare things when they are
measured on different scales?
• We need to standardize the values.
86
How to standardize?
• Subtract mean from the value and then divide
this difference by the standard deviation.
• The standardized value = the z-score
value  mean

std .dev.
• z-scores are free of units.
87
z-scores: An Example
Data: 4, 3, 10, 12, 8, 9, 3 (n=7 in this case)
Mean = (4+3+10+12+8+9+3)/7 = 49/7 =7.
Standard Deviation = 3.65.
Original Value
z-score
-------------------------------------------------------------4
(4 – 7)/3.65 = -0.82
3
(3 – 7)/3.65 = -1.10
10
(10 – 7)/3.65 = 0.82
12
(12 – 7)/3.65 = 1.37
8
(8 – 7)/3.65 = 0.27
9
(9 – 7)/3.65 = 0.55
3
(3 – 7)/3.65 = -1.10
--------------------------------------------------------------
88
Interpretation of z-scores
• The z-scores measure the distance of the data values
from the mean in the standard deviation scale.
• A z-score of 1 means that data value is 1 standard
deviation above the mean.
• A z-score of -1.2 means that data value is 1.2
standard deviations below the mean.
• Regardless of the direction, the further a data value is
from the mean, the more unusual it is.
• A z-score of -1.3 is more unusual than a z-score of
1.2.
89
How to use z-scores?
• A college admissions committee is looking at the files
of two candidates, one with a total SAT score of 1500
and another with an ACT score of 22. Which
candidate scored better?
• SAT score mean = 1600, std dev = 500.
• ACT score mean = 23, std dev = 6.
• SAT score 1500 has z-score = (1500-1600)/500 = -0.2.
• ACT score 22 has z-score = (22-23)/6 = -0.17.
• ACT score 22 is better than SAT score 1500.
90
Which is more unusual?
A. A 58 in tall woman
z-score = (58-63.6)/2.5 = -2.24.
B. A 64 in tall man
z-score = (64-69)/2.8 = -1.79.
C. They are the same.
Heights of adult women have
 mean of 63.6 in.
 std. dev. of 2.5 in.
Heights of adult men have
 mean of 69.0 in.
 std. dev. of 2.8 in.
91
Using z-scores to solve problems
An example using height data and U.S. Marine and
Army height requirements
Question: Are the height restrictions set up by the
U.S. Army and U.S. Marine more restrictive for
men or women or are they roughly the same?
92
Data from a National Health Survey
Heights of adult women have
– mean of 63.6 in.
– standard deviation of 2.5 in.
Heights of adult men have
– mean of 69.0 in.
– standard deviation of 2.8 in.
Height Restrictions
Men
Minimum
Women
Minimum
U.S. Army
60 in
58 in
U.S. Marine Corps
64 in
58 in
93
Heights of adult men have
– mean of 69.0 in.
– standard deviation of 2.8 in.
Men Minimum
U.S.
Army
U.S.
Marine
Heights of adult women have
– mean of 63.6 in.
– standard deviation of 2.5 in.
Women minimum
60 in
58 in
z-score = -3.21
z-score = -2.24
Less restrictive
More restrictive
64 in
58 in
z-score = -1.79
z-score = -2.24
More restrictive
Less restrictive
94
Effect of Standardization
• Standardization into z-scores does not change
the shape of the histogram.
• Standardization into z-scores changes the
center of the distribution by making the mean
0.
• Standardization into z-scores changes the
spread of the distribution by making the
standard deviation 1.
95
Z-score and Empirical Rule
When data are bell shaped, the z-scores of the data
values follow the empirical rule.
96
Outlier detection with z-score
• Empirical Rule tells us that if data are mound-shaped
distributed, then almost all the data-points are within
plus minus 3 standard deviations from the mean. So an
absolute value of z-score larger than 3 can be
considered as an outlier.
97
2004 Olympics
Women’s Heptathlon
Austra Skujyte (Lithunia)
Shot Put = 16.40m,
Long Jump = 6.30m.
Mean
Shot Put
Long Jump
13.29m
6.16m
1.24m
0.23m
28
26
Carolina Kluft (Sweden)
Shot Put = 14.77m,
Long Jump = 6.78m.
(all contestant)
Std.Dev.
n
98
Which performance was better?
A. Skujyte’s shot put,
z-score of Skujyte’s shot put = 2.51.
B. Kluft’s long jump,
z-score of Kluft’s long jump = 2.70.
C. Both were same.
Mean
Shot Put
Long Jump
13.29m
6.16m
1.24m
0.23m
28
26
(all contestant)
Std.Dev.
n
99
Based on shot put and long jump whose
performance was better?
A. Skujyte’s,
z-score: shot put = 2.51, long jump = 0.61.
Total z-score = (2.51+0.61) = 3.12.
B. Kluft’s,
z-score: shot put = 1.19, long jump = 2.70.
Total z-score = (1.19+2.70) = 3.89.
C. Both were same.
100
Scatterplot
101
Example: Height and Weight
• How is weight of an individual related to
his/her height?
• Typically, one can expect a taller person to be
heavier.
• Is it supported by the data?
• If yes, how to determine this “association”?
102
What is a scatterplot?
• A scatterplot is a diagram which is used to
display values of two quantitative variables
from a data-set.
• The data is displayed as a collection of points,
each having the value of one variable
determining the position on the horizontal
axis and the value of the other variable
determining the position on the vertical axis.
103
Example 1: Scatterplot of height and
weight
104
Example 2: Scatterplot of hours watching
TV and test scores
105
Looking at Scatterplots
We look at the following features of a scatterplot:• Direction (positive or negative)
• Form (linear, curved)
• Strength (of the relationship)
• Unusual Features.
When we describe
histograms we mention
• Shape
• Center
• Spread
• Outliers
106
Asking Questions on a Scatterplot
• Are test scores higher or lower when the TV watching
is longer? Direction (positive or negative association).
• Does the cloud of points seem to show a linear
pattern, a curved pattern, or no pattern at all? Form.
• If there is a pattern, how strong does the relationship
look? Strength.
• Are there any unusual features? (2 or more groups or
outliers).
107
Positive and Negative Associations
• Positive association means for most of the datapoints, a higher value of one variable corresponds to
a higher value of the other variable and a lower
value of one variable corresponds to a lower value of
the other variable.
• Negative association means for most of the datapoints, a higher value of one variable corresponds to
a lower value of the other variable and vice-versa.
108
This association is:
A. positive
B. negative.
109
This association is:
A. positive
B. negative.
110
Linear Scatterplot
• Unless we see a curve, we shall call the scatterplot
linear.
111
Curved Scatterplot
• When the plot shows a clear curved pattern,
we shall call it a curved scatterplot.
112
Which one has stronger linear association?
A.left one,
B.right one.
Because, in the right
graph the points are
closer to a straight line.
113
Which one has stronger linear
association?
A.left one,
B.right one.
Hard to say.
114
Unusual Feature: Presence of Outlier
• This scatterplot clearly has an outlier.
115
Unusual Feature: Two Subgroups
• This scatterplot clearly has two subgroups.
116
Time series plot
(Time plot)
117
Time plot
• Time series is a collection of observations made
sequentially through time.
• In time plot (or time series plot) the time series data are
plotted (on vertical axis) against the time (on horizontal
axis), and the plots are connected with straight line.
• From time series plot one can find the movement of the
observed values over time and find patterns such as:
–
–
–
–
Trend
Seasonality
Business cycle (for business data)
Unusual features
118
Example: US population
Time Series Plot of US
350000000
300000000
250000000
US
200000000
150000000
100000000
50000000
0
1800 1820 1840 1860 1880 1900 1920 1940 1960 1980 2000
t
119
Example: US accidental death
Time Series Plot of deaths
11000
deaths
10000
9000
8000
7000
1
7
14
21
28
35
42
Index
49
56
63
70
120
Example: Australian red wine sell
121