Introduction to Statistics
Download
Report
Transcript Introduction to Statistics
Single Samples
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Resources
• Crawley, MJ (2005) Statistics: An Introduction
Using R. Wiley.
• Gentle, JE (2002) Elements of Computational
Statistics. Springer.
• Gonick, L., and Woollcott Smith (1993) A Cartoon
Guide to Statistics. HarperResource (for fun).
Questions of Interest About Single
Samples
• What is the mean value?
• Is the mean value significantly different from
expectation or theory?
• What is the level of uncertainty associated with
the estimate of the mean value?
Facts Needed for Answers
• Are the values normally distributed (bellshaped) or not?
• Are there outliers in the data?
• If the data were collected over a period of time,
is there evidence for serial correlation?
To use standard parametric tests, you need
normal data, without outliers, and without
serial correlation.
Data Summary
data<-read.table("das.txt",header=T)
> names(data)
[1] "y"
> attach(data)
> summary(y)
Min. 1st Qu. Median
Mean 3rd Qu.
1.904
2.241
2.414
2.419
2.568
> plot(y)
Max.
2.984
plot(y)
Querying your data
y[50]<- 21.79386
plot(y)
which(y>10)
50
y[50]<-2.179386
boxplot(y,ylab="data values”)
Results
Normal Distribution
• The Central Limit Theorem implies anything
produced by adding a large number of random samples
(such as the mean) is normally distributed.
– dnorm(z) is the normal distribution, with mean 0.0 and
standard deviation (i.e., √variance) of 1.0. (z here is the
standard unit for the normal distribution)
– pnorm(x) is the probability of a z value of x or less.
– qnorm(c(p1,p2)) gives the corresponding values of z that
produce the probabilities of p1and p2
Plots for Testing Normality
• The simplest and often the best test of normality is the
quantile-quantile plot
– qqnorm(y)
– qqline(y, lty=2)
• If the resulting plot shows a marked S-shape, it
indicates non-normality. You’ve already seen this
demonstrated.
• If the data are non-normal, use Wilcoxon's signed rank
test (wilcox.test) rather than Student's t-test (t.test)
Inference
• Demonstration with speed of light data
• Another way to test this is bootstrapping
– Demonstration
• Demonstration of Student's t
– dt(z,df)
– pt(z,df)
– qt(c(p,q),df)
• Comparison between Student's t and normal
distributions.
Skew
• Dimensionless version of the third moment
about the mean.
m3 = Sum(y-ymean)3/n
s3 = (√s2)3
skew = 1 = m3/s3
• Measures the extent to which the distribution
has a tail on one or the other side.
• Demo of skew test.
Kurtosis
• Dimensionless version of the fourth moment
about the mean.
m4 = Sum(y-ymean)4/n
s4 = (s2)2
kurtosis = 2 = m4/s4 -3
• Measures the extent to which the distribution is
peaky or flat-topped.
• Demo of kurtosis test.
Conclusions
• A generalisation of these individual tests is the
Kolmogorov-Smirnov test (ks.test), which is
usually used to compare two distributions.
• If variance was ill-behaved, skew and kurtosis
are worse.
• We've seen ways of testing for normality and
outliers. Serial correlation will be discussed
when we learn about analysis of variance.