Transcript Powerpoint
Notes Unit 1
Chapters 2-5
Univariate Data
• Statistics is the science of data. A set of data
includes information about individuals. This
information is organized into different categories
or characteristics called variables. For example
in our class survey, each one of you is an
individual represented in the data set. We
collected information about the variables gender,
height, etc…
• We are always interested in the context of
the data. That means…where did it come
from, who did we include, when was it
collected, why were we interested, what
did we collect etc…Without context, data
is meaningless.
• After we understand the context, the next
thing we should always do is GRAPH the
data.
Graphs
• Be sure to always:
•
*Title your graphs
•
*Label your axis including units of
•
measure
•
*number your axes in a consistent
•
and reasonable manner
Categorical Data
Categorical variables record which of
several groups or categories an individual
belongs to.
Quantitative Data
Quantitative variables take numerical
values for which it makes sense to do
arithmetic operations like adding or
averaging.
Quantitative Data
The distribution of a variable tells
us what values the variable
typically takes and how often it
takes them. It is a generalization
about the variable values.
• When describing any Quantitative
distribution:
•
C – Center
•
U – Unusual Features
•
S – Shape
•
S – Spread
• &
•
B – Be
•
S - Specific
• Common Shapes of distributions/graphs
•
Symmetric
•
Skewed to the right
•
Skewed to the left
•
Bimodal
•
Uniform
• Once you have chosen a
shape, you choose a
measure of center and
spread based on that
shape.
x
If a distribution is symmetric, we use mean
for center.
Mean: the average
formula:
x
x
i
n
If the distribution is symmetric, we
use standard deviation for spread.
Standard deviation:
1
2
sx
( xi x )
n 1
Measure of Center when the
distribution is not symmetric:
Median – the middle value in an ordered
list. If there are two values in the middle,
then average them.
Measure Spread or Variability when
the distribution is not Symmetric
• We can also examine spread by looking at
the range of middle 50% of the data. This
is called the:
Interquartile Range (IQR).
IQR = Q3 – Q1
We also need to talk about the 5-number
summary.
The 5-number summary is made up of the
minimum, the first quartile, Q1 (where 25%
of the data lies below this value), the median,
the third quartile, Q3 (where 75% of the data
lies below this value), and the maximum.
Another Measure of Spread or
Variability
• Range – the difference between the
maximum and the minimum observations.
This is the simplest measure of spread.
We typically use this as preliminary
information or if it is the only measure of
spread we can calculate.
Another measure of spread or
variability
• Variance is the average of the squares of
the deviations of the observations from
their mean. It is the standard deviation
squared.
• An outlier is an individual observation in
data that falls outside the overall pattern of
the data.
Using the IQR, we can perform a test for
outliers.
Outlier Test:
Any value below
Q1 – 1.5(IQR)
or above
Q3 + 1.5 (IQR)
is considered an outlier.
Measures that are not strongly affected by
extreme values are said to be resistant.
The median and IQR are more resistant than
the mean and standard deviation.
The standard deviation, is even less resistant
than the mean.
Measures of Spread or Variability – Why?
We measure spread because it’s an important
description of what is happening with the data.
We need to know about the amount of variation
we can expect in a data set.
Simpson’s Paradox occurs when we
combine groups or look at marginal totals
instead of looking at groups individually. .
The group percentages can be misleading, it
is always better to compare percentages
within each level of a variable.