Transcript 7 & 8
Exploratory Data Analysis
Observations of a single variable
Example
In 1798 Cavendish made 29 determinations
of the density of the Earth, relative to that of
water. His results are stored in R:
> density
[1] 5.50 5.57 5.42 5.61 5.53 5.47 4.88 5.62 5.63 4.07 5.29 5.34 5.26 5.44 5.46
[16] 5.55 5.34 5.30 5.36 5.79 5.75 5.29 5.10 5.86 5.58 5.27 5.85 5.65 5.39
Source: The Data and Story Library:
http://lib.stat.cmu.edu/DASL.Note that these
are observations of a continuous variable, as
in general are measurements of all kinds.
Of interest, of course, is to estimate the
true density of the Earth. A useful simple
display is given by stem(density), while
simple summary statistics are produced
by the use of the functions mean, median,
sd, summary, etc. In particular we have
> mean(density)
[1] 5.42
> median(density)
[1] 5.46
The
standard
deviation
of
the
observations is 0.34.
A histogram is given by
> hist(density,breaks=seq(4,6,0.2), xlab = "relative
density of Earth")
Clearly there is at least one low outlier
in the data. Thus the median may give
a better estimate than the mean of the
true density.
Now, let us investigate the extent to
which the data can be modelled as a
random sample from some underlying
normal distribution..
A normal Q-Q plot can be used to
examine this. Recall that this is a plot
of the sorted observations against
what is effectively a idealised sample
of the same size from the N(0, 1)
distribution.
The fitted line corresponds to the
normal distribution with the same first
and third quartiles as the data.
The plot and the fitted line are
constructed with
> qqnorm(density)
> qqline(density)
The line has intercept 5.46 and slope
0.23 which provide a reasonable
estimate of the mean and standard
deviation of the best fitting normal
distribution. The plot again suggests
thatat least the lowest observation
should be ignored.
An approximate 95% confidence
interval for the true mean of the
underlying distribution of the data,
based on using all the data, is given by
>mean(density)+c(-1,1)*qnorm(0.975)
*sqrt(var(density)/length(density))
This gives a response of:
[1] 5.30 5.54
To correct for the fact that the sample
variance is an estimate of the underlying
true variance, we can use t.test(density)
which gives a 95% confidence interval
of [5.29 5.55].
The generally accepted modern day true
value for the relative density of Earth is
5.52.
Example The R variable photons contains
a count of the number of photons produced
in each of 60 successive seconds by a very
weak light source.
> photons
[1] 1 4 1 0 0 1 0 1 2 1 2 2 1 2 4 1 4 5 1 2 1 4 4 1 1 2 4 2 0 3 4 4 4 4 3 1 1 3
[39] 1 2 6 2 1 2 0 3 0 2 1 2 4 6 1 2 0 1 1 0 3 4
Here the variable photons is a count (and
so discrete). We have 60 observations of it.
In addition to the usual R summary
functions, the R function table gives a
frequency table:
> table(photons)
photons
0 1 2 3 4 5 6
8 19 13 5 12 1 2
A histogram can be produced with
>hist(photons,breaks=seq(-1,6)).
However, since this variable is a count it
is interesting to compare its distribution
with that of the Poisson distribution with
the same mean (2.08). The appropriate
diagrams are produced with the
commands:
>barplot(table(photons),xlab="photon
count", ylab="frequency" ylim=c(0,20))
>barplot(60*dpois(0:8,2.08),
names=0:8, xlab="photon count",
ylab="Poisson expected frequency",
ylim=c(0,20))
The Chi-squared distribution, 2 ,can be
used to check whether there is a
significant difference between the
observed and the expected frequencies.
x
0
1
2
3
4
5
6 or more
Observed
(O)
8
19
13
5
12
1
2
Expected
(E)
7.50
15.59
16.21
11.24
5.85
2.43
1.18
(O - E)
(O - E)2/E
0.50
3.41
-3.21
-6.24
6.15
-1.43
0.82
0.03333
0.74587
0.63566
3.46420
6.46538
0.84152
0.56983
The sum of the last column is 12.7558
This value of
obs
2
(O E )
E
2
can then be compared with tabulated values
of chi-squares for a particular degree of
freedom (here 6).
It can be shown that the two distributions
are not significantly different at a 5% level
of significance.