Transcript lecture1

Topic 1: Descriptive
Statistics
CEE 11 Spring 2001
Dr. Amelia Regan
These notes draw liberally from the class text, Probability and Statistics for
Engineering and the Sciences by Jay L. Devore, Duxbury 1995 (4th edition)
definitions

A population consists of all objects of a
certain type that are relevant to a particular
study or analysis.
 all

students at UCI represent a population
A sample is a subset or portion of the
population
 students
in this class represent a sample of
the population of students at UCI
frequency distributions
and histograms



A frequency is a count, the number of
occurrences in the sample of a particular
value which are within a particular class.
Classes must be mutually exclusive (no
overlap allowed) and collectively exhaustive
(the full range of the data must be covered).
A histogram is a bar chart of the frequency
distribution.
guidelines for forming
class intervals
 Use
intervals of equal length with
midpoints at convenient round numbers.
 For
 For
large data sets use more intervals
small data sets use a small number
of intervals
Example
 30
students are asked to submit
their weights with these results
Men (18 in sample)
140
170
145
157
160
130
190
185
155
190
165
155
130
155
150
148
150
140
Women (12 in sample)
140
118
120
122
130
115
138
102
121
115
125
150
Example
 We
might break the sample into classes and
construct the following frequency table
class
100-<120
120-<140
140-<160
160-<180
180-<200
frequency
4
8
11
3
4
rel freq.
0.133
0.267
0.367
0.10
0.133
Class mid. pt
110
130
150
170
190
From the table we can easily construct a
histogram for the sample
number of observations
Frequency Histogram for Weight Data
12
10
8
6
4
2
0
100-<120
120-<140
140-<160
classes
160-<180
180-<200
mean

The mean of a sample or data set is simply
the arithmetic average of the values in the
set, obtained by summing the values and
dividing by the number of values.
x1  x2  ...xn
1 n
x
  xi
n
n i 1
The mean of the sample of weights is
144.63 pounds
mean of a frequency distribution

When we summarize a data set in a frequency
distribution, we are approximating the data set by
"rounding" each value in a given class to the class
mark.
n
1 n
x   fi xi   pi xi
n i 1
i 1
where fi  the frequency of the ith observation and
pi = the proportion associated with the ith observation
The mean of the weight data obtained in this way is
146.67
median


The median is the value that is roughly in the
middle of the data set. If n is odd, the median is the
single value in the middle, namely the value with
rank (n + 1)/2.
If n is even, there is not a single value in the
middle, so the median is defined to be the average
of the two middle values, namely the values with
ranks n/2 and n/2 + 1.
The median for our example is (140+145)/2 =
142.5 lbs.
mode

The mode of a data set is the value that
appears most often.
The modal values for our sample are 130 and 140
-- the mode need not be a single value

If data are broken into classes, the modal
class is the class with the most members.
The modal class for our sample is 140-<160
range

The range or spread of of a data set is the
difference between its largest and smallest
values
The range for the weight data is 102 to
190 or 88 lbs
variance


The variance of a population is the average
of the squared deviations from the mean
The variance of a sample is approximately
the average of the squared deviations from
the mean (note that we divide the sum of the
squared deviations by n-1 rather than n)

S
2
2
1

N
n
2
(
x


)
 i
i 1
n
1
2

(
x

x
)

i
( n  1) i 1
standard deviation
The standard deviation is the square root of the
variance

The standard deviation is useful because it is in the
same units as the mean (and the original data)
therefore it provides better insight into the relative
variability a sample.

 

1
N
n
 ( xi
i 1
 )
2
S 
n
1
( xi  x ) 2

( n  1) i 1
The variance and standard deviation of the weight
data are 559.14 lbs2 and 23.64 lbs
coefficient of variation
The coefficient of variation is the
standard deviation divided by the mean


The coefficient of variation is used to
examine the relative variability of more
than one data set
s
for the weight data the coefficient
c.v. 
of variation is 0.163
x

c.v. 

shortcut formula for the
variance

Its sometimes more convenient to use
the following formula for the variance


  xi 
n
2
 i 1 
x


i
n
 i 1
n 1
n
n
s2 
 x  x 
i 1
i
n 1
2
2
Class exercise


The national weather service maintains
and publishes historical weather data for
100 US cities. The average annual rain
fall in inches for the cities in the data
base beginning with A are listed below.
Calculate the mean, median, range,
variance and standard deviation for the
following data
Albany
Albuquerque
Anchorage
Asheville
Atlanta
Atlantic City
Austin
35.74
8.12
15.20
47.71
48.61
41.93
31.50
properties of S2



Let x1, x2, x,...,xn be a sample and c be any nonzero
constant.
If y1 = x1 + c, y2 = x2 + c,...,yn = xn + c, then S2y = S2x
If y1 = cx1, y2 = cx2,...,yn = cxn, then S2y = c2S2x, Sy
= |c|S2x
In other words -- if we add a constant to a
sample we do not increase the variance -- if we
multiply by a constant we increase the variance
by the square of the constant
related properties of the
sample mean

Let x1, x2, x,...,xn be a sample and c be any nonzero
constant.

If y1 = x1 + c, y2 = x2 + c,...,yn = xn + c then y  x  c

If y1 = cx1, y2 = cx2,...,yn = cxn, then
y  cx
In other words if we add or multiply the sample
by a constant we add or multiply the mean by
the same constant
Class exercise
Without using your calculators, calculate the mean
and variance of the following data

Xi | 35
40
45
50
55
---------------------------------------------fi | 13
11
14
13
12

Hint, shift the observations “to the left” by
subtracting a constant and then divide by another
constant