Summary Measures

Download Report

Transcript Summary Measures

Chapter 3, Numerical Descriptive
Measures
• Data analysis is objective
– Should report the summary measures that best
meet the assumptions about the data set
• Data interpretation is subjective
– Should be done in fair, neutral and clear manner
Summary Measures
Describing Data Numerically
Central Tendency
Variation
Arithmetic Mean
Range
Median
Interquartile Range
Mode
Variance
Geometric Mean
Standard Deviation
Quartiles
Coefficient of Variation
Shape
Skewness
Arithmetic Mean
• The arithmetic mean (mean) is the most common measure of
central tendency
• Mean = sum of values divided by the number of values
• Affected by extreme values (outliers)
n
X
Sample size
X
i1
n
i
X1  X2    Xn

n
Observed values
Geometric Mean
• Geometric mean
– Used to measure the rate of change of a variable over time
XG  ( X1  X 2    Xn )
1/ n
• Geometric mean rate of return
– Measures the status of an investment over time
RG  [(1  R1 )  (1  R 2 )    (1  Rn )]1/ n  1
– Where Ri is the rate of return in time period I
Median: Position and Value
• In an ordered array, the median is the “middle”
number (50% above, 50% below)
• The location (position) of the median:
n 1
Median position 
position in the ordered data
2
• The value of median is NOT affected by
extreme values
Mode
•
•
•
•
•
•
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may may be no mode
There may be several modes
Quartiles
• Quartiles split the ranked data into 4 segments
with an equal number of values per segment
• Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position: Q1 = (n+1)/4
Second quartile position: Q2 =2 (n+1)/4 (the median
position)
Third quartile position: Q3 = 3(n+1)/4
where n is the number of observed values
Measures of Variation
Variation
Range
Interquartile
Range
Variance
Standard
Deviation
Coefficient
of Variation
• Measures of variation
give information on the
spread or variability of
the data values.
Same center,
different variation
Range and Interquartile Rage
• Range
– Simplest measure of variation
– Difference between the largest and the smallest observations:
Range = Xlargest – Xsmallest
– Ignores the way in which data are distributed
– Sensitive to outliers
• Interquartile Range
– Eliminate some high- and low-valued observations and calculate
the range from the remaining values
– Interquartile range = 3rd quartile – 1st quartile
= Q3 – Q1
Variance
• Average (approximately) of squared
deviations of values from the mean
n
– Sample variance:
Where
S 
2
 (X  X)
i1
X = arithmetic mean
n = sample size
Xi = ith value of the variable X
i
n -1
2
Standard Deviation
•
•
•
•
Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data
It is a measure of the “average” spread around the mean
• Sample standard deviation:
n
S
 (X
i1
i
 X)
n -1
2
Coefficient of Variation
• Measures relative variation
• Always in percentage (%)
• Shows variation relative to mean
• Can be used to compare two or more sets of data
measured in different units
S
  100%
CV  

X
Shape of a Distribution
• Describes how data are distributed
• Measures of shape
– Symmetric or skewed
Left-Skewed
Symmetric
Right-Skewed
Mean < Median
Mean = Median
Median < Mean
Using the Five-Number Summary to
Explore the Shape
• Box-and-Whisker Plot: A Graphical display of data using
5-number summary:
Minimum, Q1, Median, Q3, Maximum
• The Box and central line are centered between the
endpoints if data are symmetric around the median
Min
Q1
Median
Q3
Max
Distribution Shape and
Box-and-Whisker Plot
Left-Skewed
Q1
Q2 Q3
Symmetric
Q1 Q2 Q3
Right-Skewed
Q1 Q2 Q3
Relationship between Std. Dev. And
Shape: The Empirical Rule
• If the data distribution is bell-shaped, then the interval:
–
μ  1σ
contains about 68% of the values in the population or
the sample
–
μ  2σ
contains about 95% of the values in the population or
the sample
–
μ  3σ
or the sample
contains about 99.7% of the values in the population
Population Mean and Variance
N
Population Mean

X
i1
N
i
X1  X2    XN

N
N
Population variance
σ2 
 (X
i 1
i
 μ)
N
2
Covariance and Coefficient of
Correlation
• The sample covariance measures the strength of the
linear relationship between two variables (called
bivariate data)
• The sample covariance:
n
cov ( X , Y ) 
 ( X  X)( Y  Y )
i1
i
i
n 1
• Only concerned with the strength of the relationship
• No causal effect is implied
• Covariance between two random variables:
• cov(X,Y) > 0
X and Y tend to move in the same direction
• cov(X,Y) < 0
X and Y tend to move in opposite directions
• cov(X,Y) = 0
X and Y are independent
• Covariance does not say anything about the relative strength of
the relationship.
• Coefficient of Correlation measures the relative strength of the
linear relationship between two variables
n
r
 ( X  X)( Y  Y )
i1
n
i
2
(
X

X
)
 i
i1
i
n
2
(
Y

Y
)
 i
i 1
cov ( X , Y )

SX SY
• Coefficient of Correlation:
– Is unit free
– Ranges between –1 (perfect negative) and 1(perfect
positive)
– The closer to –1, the stronger the negative linear
relationship
– The closer to 1, the stronger the positive linear
relationship
– The closer to 0, the weaker any positive linear relationship
– At 0 there is no relationship at all
Correlation vs. Regression
• A scatter plot (or scatter diagram) can be used
to show the relationship between two
variables
• Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
– Correlation is only concerned with strength of the
relationship
– No causal effect is implied with correlation