Introduction - Texas A&M University Department of Statistics
Download
Report
Transcript Introduction - Texas A&M University Department of Statistics
Chapter 1
Introduction
• Individual: objects described by a set of
data (people, animals, or things)
• Variable: Characteristic of an individual. It
can take on different values for different
individuals.
Examples: age, height, gender, favorite
class, speed, moisture, etc.
Types of Variables
• Quantitative: numerical values, can be added,
subtracted, averaged, etc.
– ________: takes on values which are spaced. That is,
for two values of a discrete variable that are adjacent,
there is no value that goes between them.
– ________: values are all numbers in a given interval.
That is, for two values of a continuous variable that are
adjacent, there is another value that can go between the
two.
• Categorical: An individual is placed into one of
several groups or categories. These groups or
categories are not usually numerical.
Types of Variables
Examples:
Variable
Length
Hours Enrolled
Major
Zip Code
Numeric
Discrete
Continuous
Categorical
Distribution of a Variable
• The distribution of a variable tells us the
possible values for the variable and the
probability that the variable takes these
values.
• Two ways to describe a distribution
– Numerically
– Graphically
Categorical Variables
• Suppose we poll 46 people on an issue. How can
we exhibit their response?
• Numerically:
– Counts
– Proportions
– Percentages
• Graphically:
– Frequency Tables
– Bar Charts
– Pie Charts
Categorical Variables
• Suppose we poll 46 people on an issue.
How can we exhibit their response?
– Frequency Tables:
• counts (14 agree)
• proportions (14/46 = .304 agree)
• percents (30.4% agree)
VOTE
Valid
agree
disagree
undecid.
Total
Frequency
14
23
9
46
Percent
30.4
50.0
19.6
100.0
Valid Percent
30.4
50.0
19.6
100.0
Cumulative
Percent
30.4
80.4
100.0
Categorical Variables
• Suppose we poll 46 people on an issue.
How can we exhibit their response?
– Bar Chart:
30
can have counts,
percents or
proportions on
vertical axis
20
Count
10
0
agree
VOTE
disagree
undecid.
Categorical Variables
• Suppose we poll 46 people on an issue.
How can we exhibit their response?
– Pie Chart:
undecid.
agree
disagree
Examining a Distribution
• To describe a distribution we need 3 items:
– Shape: modes, symmetric, skewed
– Center: mean, median
– Spread: range, standard deviation, IQR
• Look for the overall pattern and for striking
deviations
– Outlier-individual value that falls outside the
overall pattern
Numeric Variable Distributions
Shape:
Modes: Major peaks in the distribution
Symmetric: The values smaller and larger than the midpoint are mirror
images of each other
Skewed to the right: Right tail is much longer than the left tail
Skewed to the left: Left tail is much longer than the right tail
Center:
Mean: The arithmetic average. Add up the numbers and divide by the
number of observations.
Median: List the data from smallest to largest. If there are an odd
number of data values, the median is the middle one in the list. If
there are an even number of data values, average the middle two
in the list
Numeric Variable Distributions
Spread:
Range: The difference in the largest and smallest value. (Max – Min)
Standard Deviation: Measures spread by looking at how far observations are
from their mean.
The computational formula for the standard deviation is
s
1
( xi x ) 2
n 1
Interquartile Range (IQR): Distance between the first quartile (Q1) and the third
quartile (Q3). IQR = Q3 – Q1
Q1 – 25% of the observations are less than Q1 and 75% are greater
than Q1.
Q3 – 75% of the observations are less than Q3 and 25% are greater
than Q3.
Numeric Variable Distributions
• Example 1.5 on page 11 of the book shows
how much 50 consecutive shoppers spent in a
store. The data appear as follows:
$3.11
$18.30
$24.50
$36.30
$50.30
$8.88
$18.40
$25.10
$38.60
$52.70
$9.26
$19.20
$26.20
$39.10
$54.80
$10.80
$19.50
$26.20
$41.00
$59.00
$12.60
$19.50
$27.60
$42.90
$61.20
$13.70
$20.10
$28.00
$44.00
$70.30
$15.20
$20.50
$28.00
$44.60
$82.70
$15.60
$22.20
$28.30
$45.40
$85.70
$17.00
$23.00
$32.00
$46.60
$86.30
$17.30
$24.40
$34.90
$48.60
$93.30
Numerical Variables
• How can we describe the distribution of these 50
numbers?
– Numerically
• Center: Mean or Median
• Spread: Quartiles, Range, IQR, or Standard deviation
– Graphically
•
•
•
•
•
Frequency Table
Histogram
Boxplot
Stem and Leaf
Normal Quantile Plot
Descriptive statistics
The descriptives box from SPSS gives the mean,
median, variance, standard deviation, minimum,
maximum, range, and IQR.
Descriptiv es
Mean
95% Confidence
Interval for Mean
5% Tri mmed Mean
Median
Variance
Std. Deviation
Mi ni mum
Maximum
Rang e
Interq uartil e Rang e
Skewness
Kurtosis
Lower Bound
Upper Bound
Stati sti c
34.6550
28.4891
Std. Error
3.0682
40.8209
33.1929
27.8000
470.704
21.6957
3.11
93.3
90.2
26.7000
1.104
.711
.337
.662
Percentiles
• 50th percentile is also called the median – the
middle data value if ordered smallest to
largest
• 25th and 75th percentiles are also called the
quartiles: Q1 and Q3 respectively – the middle
data value of each half
Percentiles
5
Weig hted
Average(Definition 1)
Tukey's Hinges
9.0890
10
12.7100
25
Percentiles
50
75
19.0000
27.8000
45.7000
19.2000
27.8000
45.4000
90
69.3900
95
85.9700
Frequency Table
– Notice the amount
spent is broken into
categories or groups
– Recall, frequency
tables can be used for
categorical variables
as well
Category
Count or
Frequency
Percent
0 - 10
3
6.00%
10 - 20
12
24.00%
20 - 30
13
26.00%
30 - 40
5
10.00%
40 - 50
7
14.00%
50 - 60
4
8.00%
60 - 70
1
2.00%
70 - 80
1
2.00%
80 - 90
3
6.00%
90 - 100
1
2.00%
Histogram
– Breaks the range of values
of a variable into intervals
(midpoint is displayed here)
– Displays only the count
or percent of the observations
that fall into each interval
14
12
10
8
6
4
2
0
5
15
25
35
45
55
65
75
85
95
Box Plot
Minimum, Q1, Median, Q3, and Maximum
100
50
These five numbers
80
are called the
____________________ 60
49
48
40
What are these points?
20
0
-20
N=
50
Stem and Leaf Plot
• Works best for smaller data sets
– Example 1.4 on pg 10
• Here are the numbers of homeruns that Babe Ruth hit in each
of his 15 years with the New York Yankees from 1920-1934:
– 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22
Normal Quantile Plot
– Normal Quantile Plot (This compares the
distribution of the sample to the Normal
Normal Q-Q Plot of
Distribution):
100
the straight line
Expected Normal Value
is normal,
compare dots
to the line
If dots fall close to the normal
line then the data comes
from a normal distribution.
80
60
40
20
0
-20
-20
0
Observed Value
20
40
60
80
100
Describing Numeric Variable Distributions
• Now, we examine the appearance of other data:
– Modes are major peaks in the distribution
The histogram below
has two modes-bimodal
The histogram below has one
mode-unimodal
20
14
12
10
8
10
6
4
Std. Dev = 1.
Std. Dev = 2.67
2
Mean = 5.1
Mean = 5.4
N = 59.00
0
1.0
DATA
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
N = 60.00
0
1.0
DATA
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
Describing Numeric Variable Distributions
• Now, we examine the appearance of other data:
– This example is called right
skewed since the distribution has
a long right tail.
This is an example of a boxplot
that is skewed to the _______.
20
40
12
39
10
Count
35
8
31
0
4
0
4.00
8.00
12.00
data
16.00
-10
N=
46
DATA
Describing Numeric Variable
Distributions
• ________: observations that are unusually far from the
bulk of the data.
• What are some possible explanations for outliers?
– The data point was recorded wrong.
– The data point wasn’t actually a member of the population we
were trying to sample.
– We just happened to get an extreme value in our sample.
• The 1.5 x IQR Criterion for Outliers: Designate an
observation a suspected outlier if it falls more than
1.5 x IQR below the first quartile or above the third
quartile.
1.5*IQR Criterion Example
• Suppose you had the following data set:
-2, 15, 3, 7, 10, 21, 1, 5, 12, 8, 1, 35, 10
List data from smallest to largest:
Find Q1, Median, Q3, Min, and Max:
IQR = Q3 – Q1 = ______
1.5*IQR = _______
Q1 – 1.5*IQR = ________If less than this number, then outlier
Q3 + 1.5*IQR = ________If more than this number, then outlier
Are there any outliers in this data set?
Describing Numeric Variable
Distributions
• Symmetry versus Skewness:
14
15
6
12
Count
Count
10
4
8
10
6
2
5
4
Std. Dev = 3.68
2
Mean = 8.4
0
4.00
8.00
12.00
N = 41.00
0
16.00
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
data
0
0.00
__________
20
5.00
10.00
_________
20
___________
20
17
16
10
10
10
0
0
0
-10
N=
-10
48
DATA
N=
15.00
data
DATA
-10
41
DATA
N=
73
DATA
Mean versus Median:
•
For a skewed distribution, the mean is farther out in the longer tail than is the
median.
14
15
12
6
Count
10
Count
8
4
10
6
5
4
2
Std. Dev = 3.68
2
Mean = 8.4
N = 41.00
0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
0
4.00
8.00
12.00
16.00
0
0.00
DATA
Symmetric
Left Skewed
5.00
10.00
15.00
data
data
Right Skewed
20
20
20
17
16
10
10
10
0
0
0
-10
N =
-10
N =
41
DATA
48
DATA
N =
73
DATA
mean<median
To describe distributions use:
Median and IQR
-10
mean=median
Mean and standard deviation
mean>median
Median and IQR
Strategy for Exploring Data on a
Single Quantitative Variable
1) Always plot your data: make a graph usually a
stem and leaf or histogram
2) Look for overall pattern and for outliers
3) Calculate an appropriate numerical summary to
briefly describe center and spread
4) Sometimes the overall pattern of a large number
of observations is so regular that it can be
described by a smooth curve
Introducing the Normal Distribution
It is customary to describe a normal distribution in the following way:
N m ,s
Properties of the Normal Distribution:
1)
Symmetric, bell-shaped
2)
Mean, μ and standard deviation, σ
3)
Area under the curve is 1
s
m
2
The Normal Distribution
Normal distributions can take on many different means and
standard deviations. Only the general bell shape must
remain the same.
Here are some examples of normal distributions:
m=0
s1
m=3
s2
0
3
N 0,1
N 3,2
2
m = -2
s 0.5
-2
N 2,0.52
Distribution Properties
• Introducing:
The Standard Normal
Distribution
Properties:
1. _________________
2. _________________
3. _________________
Distribution Properties
• Empirical Rule (The 68-95-99.7 Rule): If the
distribution is normal, then
– Approximately 68% of the data falls within one standard
deviation of the mean
– Approximately 95% of the data falls within two standard
deviations of the mean
– Approximately 99.7% of the data falls within three
standard deviations of the mean
Distribution Properties
Empirical Rule
Percentiles of a Standard Normal Curve
Empirical Rule Example
• If the grades on an exam are normally
distributed with a mean of 68 and a variance
of 16, what grade do you have to make to be
in the top 15% of the class?
Distribution Properties
• Shift Changes: adding or subtracting a number
from the each of the values.
mean
mean + c
mean - c
Distribution Properties
• The mean, median, Q1, Q3, minimum, and
maximum all shift when there is a shift
change. The shift change, say c, is added or
subtracted to each of the statistics
accordingly.
• The measures of spread (standard deviation,
variance, IQR, and range) do not change
when there is a shift change.
Distribution Properties
• Scale Changes: multiplying or dividing each
of the values by a number.
mean
Distribution Properties
• Scale Changes: multiplying or dividing each
of the values by a number.
mean*c
Distribution Properties
• Scale Changes: multiplying or dividing each
of the values by a number.
mean/c
Distribution Properties
• The mean, median, Q1, Q3, minimum, and
maximum all change when there is a scale change
unless they are zero. Each is multiplied or divided
by the scale change c.
• The measures of spread (standard deviation,
variance, IQR, and range) always change when
there is a scale change. The standard deviation,
IQR, and range are multiplied or divided by the
scale change c. The variance is multiplied or
divided by c2.
Shift Change Example
• Suppose we measure the weight of everyone on
a football team and obtain the following
statistics for a team report:
–
–
–
–
–
Mean: 230 lbs.
Std. Dev.: 50 lbs.
Variance: 2500 sq. lbs.
Min.: 170 lbs.
Max.: 350 lbs.
Median: 240 lbs.
Q1: 200 lbs., Q3: 280 lbs.
IQR: 80 lbs
Range: 180 lbs.
Shift Change Example
• Now suppose we found out the scale was 10
lbs. under so we need to add 10 lbs. to every
weight. What would happen to each of the
following statistics?
Original
Mean: 230 lbs.
Median: 240 lbs.
s: 50 lbs.
Q1: 200 lbs.
Q3: 280 lbs.
After Shift Change
Mean:________
Median:_________
s:_______
Q1:________
Q3:________
Shift Change Example
• Now suppose we found out the scale was 10
lbs. under so we need to add 10 lbs. to every
weight. What would happen to each of the
following statistics?
Original
Variance: 2500 sq. lbs.
IQR: 80 lbs.
Min: 170 lbs.
Max: 350 lbs.
Range: 180 lbs.
After Shift Change
Variance: ________
IQR: _________
Min: _________
Max: _________
Range: _________
Shift and Scale Change Example
• Further, suppose we found out that we are supposed to
report the weights and statistics in kilograms, not lbs
(Remember, 1 lb = 0.6 kilograms). What would
happen to each of the following statistics?
After Shift Change
Mean: 240 lbs.
Median: 250 lbs.
s: 50 lbs.
Q1: 210 lbs.
Q3: 290 lbs.
After Shift and Scale Change
Mean: ______________
Median: ______________
s: _____________
Q1: _____________
Q3: _____________
Shift and Scale Change Example
• Further, suppose we found out that we are supposed to
report the weights and statistics in kilograms, not lbs
(Remember, 1 lb = 0.6 kilograms). What would
happen to each of the following statistics?
After Shift Change
Variance: 2500 sq. lbs.
IQR: 80 lbs.
Min: 180 lbs.
Max: 360 lbs.
Range: 180 lbs.
After Shift and Scale Change
Variance: _______________
IQR: _______________
Min: _______________
Max: ________________
Range: _________________
x
Linear Transformations
If you are given a mean, x (or m), and a standard
deviation, s (or s), and want to convert your data so you
have a new mean, xnew (or mnew), and new standard
deviation, snew (or snew), all you need is to remember what
shift and scales changes affect.
• In our linear transformation formula: xnew a bx
•
– a is the shift change
– b is the scale change
• Standard deviation are only affected by scale changes, but
means are affected by both shift and scales changes.
snew scale* s
xnew shift scale* x
Linear Transformation Example
• For example: x = 12 and s = 7 but we want xnew = 25 and snew = 10.
snew = scale*s
10 = scale*7
scale = 10/7
scale = 1.43
• substituting in: xnew = shift + scale* x
25 = shift + 1.43*12
shift = 25 1.43*12
shift = 7.84
• So our linear transformation equation is: x new = 7.84 + 1.43*x