Introduction to Statistics
Download
Report
Transcript Introduction to Statistics
Central Tendency
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Resources
• Crawley, MJ (2005) Statistics: An Introduction
Using R. Wiley.
• Gonick, L., and Woollcott Smith (1993) A Cartoon
Guide to Statistics. HarperResource (for fun).
Clustering
• Most data cluster around an intermediate value.
• If the data values you measure are actually a sum of
multiple independent random variables, you can prove
this is the case.
• This is known as the Central Limit Theorem: the sum
of a large number of independent random variables has
a normal (bell-shaped) distribution.
• In particular, this is why estimates of the mean (or
‘average’) are distributed normally. This will be the
case in repeated experiments.
Example: Normal Distribution
Other Measures of Clustering
• The median is the middle value of a sample or a
distribution.
• The mode is the most frequent value in a sample
or a distribution.
• These can be convenient to use, especially if the
data are not normally distributed.
Application to Experimental Design
• One way you to disprove a null hypothesis:
– show the mean (average) value of your experimental data is
far enough different from the mean value implied by the null
hypothesis that its chance of occurring is very small.
– You first need to show that your data are normally
distributed to be able to estimate this chance.
To Check the Data are Normal
• yvals<-read.table("c:\\wherever\\yvalues.txt",
header = T)
• attach(yvals)
• hist(y)
• qqnorm(y)
• qqline(y,lty=2)
What it Looks Like
Normal Data
•
•
•
•
y<-rnorm(1000)
hist(y)
qqnorm(y)
qqline(y,lty=2)
Appearance of Normal Data
Non-Normal Data
•
•
•
•
y<-seq(0,1.0,0.001)
hist(y)
qqnorm(y)
qqline(y,lty=2)
Appearance of Non-Normal Data
Geometric Mean
• This is used when the data are generated as the product
rather than the sum of independent random variables.
An example might be a series of risks, each being the
product of a rate, a probability of success, and an
estimate of the consequences.
• The geometric mean is calculated as
(∏yi)1/n
• Where there are n elements being averaged over.
• In R, you calculate this as exp(mean(log(data)))
Harmonic Mean
• If your concern is not the absolute value of the
random variables, but rather their ratios, the
mean of interest is the harmonic mean. An
example might be current population relative to
the ‘carrying capacity’ of a region.
• This is the ‘reciprocal of the average of the
reciprocals’.
• To calculate this in R, use 1/mean(1/data))
R Demonstrations of all this…
• From the book.