Introduction to Statistics

Download Report

Transcript Introduction to Statistics

Statistical Analysis
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Resources
• Rowntree, D. (1981) Statistics Without Tears.
Harmondsworth: Penguin.
• Hinton, P.R. (1995) Statistics Explained. London:
Routledge.
• Hatch, E.M. and Farhady, H. (1982) Research Design And
Statistics For Applied Linguistics. Rowley Mass.: Newbury
House.
• Crawley, MJ (2005) Statistics: An Introduction Using R.
Wiley.
• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to
Statistics. HarperResource (for fun).
Module Outline
• Day One Lectures
– Introduction
– Using R
– Probability (the laws of chance)
• Day Two Lectures
– Data analysis (the gathering, display, and summarisation of
data)
– Experimental design (planning and sampling)
– Statistical inference (the drawing of conclusions from your
data knowing probability)
– Data modelling (regression, ANOVA and ANCOVA)
Why the second day is important
• You don’t know which tests to use unless you know
how your data are structured, so you do data analysis.
• Your experimental design is based on what you know
beforehand of the data.
• Inference is the drawing of conclusions for your
research—what can you prove.
• Modelling tells you what more detailed conclusions are
supportable. This involves throwing out the factors that
are not important.
Data Analysis
•
•
•
•
•
•
Central tendency
Degrees of freedom
Variance
A worked example
Confidence intervals
Single sample
Measures of Central Tendency
•
•
•
•
•
•
•
•
•
•
yvals<-read.table("yvalues.txt", header=T)
Attach(yvals)
Create a histogram of the data: hist(y)
Observe the mode, the most common value.
Arithmetic mean is (sum of data values)/number
total <- sum(y)
n <- length(y)
ybar <- total/n
ybar
mean(y)
Median
•
•
•
•
•
•
•
•
•
•
•
The ‘middle value’
ysorted<-sort(y)
middleIndex<-ceiling(length(y)/2)
ysorted[middleIndex]
median(y)
set<-c(1,10,1000,10,1)
Geometric mean: exp(mean(log(set)))
Harmonic mean: 1/mean(1/set)
detach(yvals)
ls
rm(any variables you don’t need)
Measures of Spread
• In addition to describing the central point of a
data set, we’re concerned with the data spread.
• Two measures:
– Interquartile spread
– Standard deviation/variance
Interquartile Range
• Break the data into four equal groups:
– First through third quartiles
– The median is the second quartile, Q2
– The median of the low group is the first quartile or
Q1
– The median of the high group is Q3
– The IQR is Q3-Q1
Box and Whiskers Plot (Tukey)
Outlier—
outside 1.5
IQR
Q1
Median
Q3
IQR
“Whiskers” extend to furthest non-outlier in both directions
Standard Deviation and Variance
• Standard measure of spread (called std in R)
• Defined as the distance that an average value differs
from the mean. The “squared” distance is used.
(Remember geometry?) The square of the standard
deviation is the variance. (Called var in R).
• When sample data (count = N) are used to compute
estimates of both the mean and the variance, the latter
is computed by dividing by N-1. If the variance is
estimated by dividing by N, the result is biased low.
• The sample mean and standard deviation describe a
bell-shaped curve very well if N is at least 30.
• For N<30, the t distribution applies.
Using R for this
•
•
•
•
•
•
•
•
Data<-c(3,5,7,7,38)
mean(data)
std(data)
var(data)
median(data)
quantile(data)
fivenum(data)
boxplot(data)
Random Variables
• Imagine an experiment repeated many times.
The notation for a random variable is X.
• The notation for a single value of X is x.
• You can define central tendency and spread just
like you can for sample data. You can also
predict their values.
• R gives you basic functions to compute these.
Plotting a random variable
• hist(rbinom(10000,2,0.5)) (coin flip)
die<-c(1,2,3,4,5,6)
for(i in 1:10000){
+ a[i]<-sample(die,1,replace=TRUE,c(1,1,1,1,1,1))}
• hist(a,breaks=0:6+0.5) (die role)
for(i in 1:10000){
+ a[i]<-sample(die,1,replace=TRUE,c(1,1,1,1,1,1))+
+ sample(die,1,replace=TRUE,c(1,1,1,1,1,1))}
• hist(a, breaks=0:12+0.5) (dice role)
Mean and Variance of Random
Variables
• µ = sum over all possible values x of (x times the
probability of x)
• Note this involves area and can deal with continuous
probability like the normal distribution.
• This is the mean
• The variance, 2 is the sum over all x of ((x- µ)2 times
the probability of x)
• The standard deviation is .
Some Continuous Distributions
• The density function is the probability of a sample X
lying between x and x+∆x.
• The density is labelled d'name' where name is used in
R. For example dbinom or dnorm. The integral of the
curve is called the cumulative probability distribution.
So you get:
–
–
–
–
dnorm, the density function
pnorm, the cumulative probability function
qnorm, the inverse of the cumulative probability function
rnorm, to draw random numbers from the distribution
Excursis: Degrees of Freedom
• Suppose you have a sample of five numbers (2,7,4,0,7)
and their mean is 4. What is the sum of the five
numbers?
• If you know the mean and four of the numbers, how
many values can the fifth one have?
• This means that if you are calculating the sample
standard deviation and you have the sample mean, you
have one less data point than you think you do.
• df = sample size minus the number of parameters, p,
you’ve estimated from the data. (Memorize!)
• variance = (sum of squares)/(degrees of freedom)
A Worked Example
• gardens.txt in Data
• Note that you can test whether two samples probably
come from the same distribution (the null hypothesis).
You do this by calculating the ratio of the variances,
and apply the F test.
• In R, this is handled by applying var.test.
• The chi2 and ANOVA tests comparing means assume
equal variance, so you must check this first! If the F
test tells you don’t have equal variance, don’t go any
further.
Confidence Intervals
• Variance is used for testing hypotheses and for
establishing confidence intervals (measures of
unreliability)
• You want your measure of unreliability to
– Go up if variance increases
– Go down if the sample size increases
• SE (standard error) = sqrt(s2/n) has those properties.
• You write this as:
– “the mean ozone concentration in Garden A was 3.0+/-0.365
pphm (1 s.e., n=10)”
More on Confidence Intervals
• You can use the assumption of a normal distribution if
n>= 30, but if you have a smaller sample, you usually
use Student’s t-distribution.
• For the quantiles of this distribution, use qt()
• For a 95% confidence interval, use t associated with
alpha = 0.975. qt(0.975,9) = 2.262 standard errors,
qt(0.995,9) = 3.249836, and qt(0.9975,9) = 3.689662.
• For Garden B (small sample)
– “the mean ozone concentration in Garden B was 5.0+/-0.826
(95% C.I., n = 10).”
• There is a better way—bootstrapping—but it’s complex.
Single Sample
• Questions to answer:
– What is the mean value?
– Is the mean value significantly different from expectation or
theory?
– What is the level of uncertainty associated with our estimate
of the mean?
• To be reasonably certain, we need to know if the data
are normally distributed, have outliers, or show serial
correlation.
Worked Example
•
•
•
•
•
Load das.txt and follow me.
summary()
plot()
boxplot()
hist()
Normal Distribution
• According to the central limit theorem, if you
take a large set of samples from a population
and take their means, the means will be
normally distributed.
• Why is deep math.
• The quartiles of the normal distribution are
calculated by qnorm()
• Examples from book (55ff)
Testing Normality
• A normal distribution is very easy to use, but you need
to check first.
• Use qqnorm() and qqline()
• Examples (y)
• Examples (speed)
• Note non-normality. To test a mean when the
distribution is non-normal, you don’t use Student’s t.
Instead you use Wilcoxon’s signed rank test.
• library(ctest)
• wilcox.text(speed, mu=990)
Student’s t
• Use if sample sizes are <30 and normally
distributed.
• Use pt instead of pnorm; qt instead of qnorm
• Examples from book (67ff)
Test Statistics for the Mean
• If you have 30 or more samples (n), the distribution of
(X-µ)/(s/√n) is approximately normal. You can test
whether the mean you computed (X) is significantly
different from µ by calculating that probability.
• If you have less than 30 samples, (X-µ)/(s/√n) follows
Student’s t distribution, and you need to use that
instead.
• Guess why ‘30’ is important…
Comparing two samples
• To compare two variances, use Fisher’s F test,
var.test(). Do this first!
• For comparing sample means with normal errors,
Student’s t test, t.test() (can be used for paired data)
• For comparing sample means with nonnormal errors,
Wilcoxon’s rank test, wilcox.test()
• For proportions, use the binomial test, binom.test()
(binary data) or prop.test() (binomial proportions)
• For independence in contingency tables, chi-square
test, chisq.test(), or Fisher’s exact test, fisher.test()
• For two correlated variables, cor.test()
Two Sample Examples
• Follow me on these. (73ff)
Using 2
• Lots of statistical data are in the form of counts
• Contingency tables show all the possible occurrences
in a sample.
Blue eyes
Brown eyes
Fair hair
38
11
Dark hair
14
51
The question is are these statistically different?
Completing the table
Blue eyes
Brown eyes
Row totals
Fair hair
38
11
49
Dark hair
14
51
65
Column
totals
52
62
114
Computing the probability of fair
hair and blue eyes
• If and only if the two traits are independent, then the probability
of the combination will equal the products of the probabilities of
the individual cases.
• That can be estimated as about 22 cases.
• Since the cell value is 38, the assumption of independence is at
risk
• What is the chance of the observed frequencies occurring by
chance?
The 2 Test
• The degrees of freedom in a contingency table equal (r-1)x(c-1), where
r and c are the number of columns.
• Here, df = 1.
• What certainty level do you want? 95% is typical.
• qchi(0.95,1) = 3.841459
• count<-matrix(c(38,14,11,51),nrow=2)
• The data should be entered columnwise (like before)
• To test, chisq.test(count)
• Here, the correlation between fair hair and blue eyes is highly
significant.
• If the expected frequencies are <= 5, use Fisher’s exact test instead,
fisher.test(count) or combine cells.
Summary
• We have seen ways of
– Describing data
– Testing single sample data against null hypotheses
– Testing two sample data against null hypotheses
Experimental Design
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Lecture Outline
• Experimental Design
– The process of defining how to collect data that will
allow you to falsify a hypothesis.
– How to do it.
– Replication
– Randomization
Categorical variables
• These take discrete values.
• A complete experimental design investigates
every combination. This is called a factorial
design. This is required for reliable results.
• For example, if you have two categorical
variables, A and B, with two states, 1 and 2,
each, you have to explore A1B1, A1B2, A2B1,
and A2B2.
Continuous Variables
• You have to sample at multiple values.
• For example, if an explanatory variable ranges
between 1 and 10, you should run an
experiment at 1 and another at 10, and a few
between.
• This converts the continuous variable into a
categorical variable.
Sampling
• You may not be able to control the values of the
categorical and continuous variables. In natural
experiments, you need to sample randomly.
• The goal of random sampling is to move
systematic response into the error term
• Take care to avoid systematic sampling. If
necessary, flip a coin or generate a random
number.
Replication
• This means you repeat a measurement with a
specific value of a categorical and/or continuous
explanatory variable.
• This allows you to assess natural variability and
measurement error.
• In many experiments, 30 replications is about
the maximum necessary. Less may have to be
accepted, but then take care in your analysis.
Randomization
• You randomize to eliminate systematic errors.
• Avoid correlating your measurements in time
and space.
• Avoid doing things that might introduce
systematic effects.
• Avoid allowing your judgment to affect when,
where, and with what/whom you do a given
experiment. Assign treatments randomly.
The Design
• The elements of an experimental design are the
experimental units.
• The treatments are assigned to the units. (Note
that this translates continuous variables to
categorical ones).
• The objective of the design is to compare the
treatments.
Local Control
• Consider ways to reduce natural variability.
• One way is to group similar experimental units
into blocks.
• Running all treatments on all blocks produces a
complete randomised block design.
• If you have enough subjects, you can repeat the
design. This increases replication.
Analyzing the Results
• You allocate total variability among the
different sources: each factor, systematic
effects, and natural variability/measurement
error.
• This is done using analysis of variance.
Statistical Inference
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Statistical Inference
• Statistical inference is the drawing of conclusions from
specific data knowing probability.
• Basically, you are assessing the probability of a
hypothesis given your data.
• A null hypothesis is plausible but probably not true.
• You show this by demonstrating that the probability of
the data you collected being generated if the null
hypothesis were true is very small.
• This is called ‘falsifying a hypothesis’.
How Do You Falsify a Hypothesis?
• Discuss
The Null Hypothesis
• You start with a null hypothesis—a statistical statement that you
intend to show is very unlikely.
• This is usually that the observations are due to chance
• Testing can involve the mean, the variance, or a comparison
between two (or more) samples where one has a treatment and
the other doesn’t.
The Test Statistic
• This will be a statistic that assesses the evidence
against the null hypothesis.
• This may be a normal distribution (continuous
data), a binomial distribution (coin flipping), or
comparison to a second experiment with the
treatment missing.
Calculating the p value
• This is the probability of your results assuming
the null hypothesis.
Compare the p-value to a fixed
significance level, a
• a is the probability of a false conclusion that
you’re accepting. 0.05, 0.01, and 0.001 are
typical.
• Choose a before calculating p. Otherwise you’re
cheating.
Large sample significance test for
proportions
• The null hypothesis corresponds to a binomial
distribution with some probability p0 of the coin
coming up heads.
• The alternate hypothesis depends on the direction of
the effect (p greater than or smaller than p0 or both)
• The test statistic is
– z = (pexp-p0)/((√p0(1-p0))/√n)
• This has the standard normal distribution
• You can use qnorm() to calculate the values that
correspond to your significance level
Tests for the population mean
• The sample spread from the mean is
– (X-µ0)/(s/√n)
• For n>= 30, this is normally distributed
• Apply qnorm() to calculate the values for the
significance level.
• If n<30, use Student’s t, qt().
Comparing Samples
• Discussed earlier
– First compare the variances using Fisher’s F test.
– If they’re not significantly different, then compare
the means.
• Back to the gardening example
Data Modelling
Harry R. Erwin, PhD
School of Computing and Technology
University of Sunderland
Modelling
• This is the process of defining a minimal model
for the data.
• There are five kinds of models
–
–
–
–
–
Null model
Minimal adequate model
Current model
Maximal model
Saturated model
Parsimony
• Prefer:
–
–
–
–
–
A model with less parameters
A model with less explanatory variables
A linear model
A model without a hump
A model without interactions
Regression Analysis
• Handles continuous data.
• Fits a linear combination of explanatory
variables to the data.
– y = ax+b.
• x is the independent or predictor variable
• y is the dependent or response variable.
• This says the value of y is equal to ax+b plus an
error term.
Example
• We’ll work an example using R. (125ff)
ANOVA
• When we’re working with categorical variables, we
use analysis of variance. The best model for the data is
the one that minimizes the average error term. You
minimize that by minimizing the error variance.
• Worked example (155ff)
• One categorical variable results in one-way ANOVA.
• N variables results in N-way ANOVA, because we
consider interactions between variables.
ANCOVA
• A mix of variables results in a mixed approach
called analysis of covariance.
• Example. (187ff)
Simplifying the Model
• You start with a model containing all variables
and interactions and remove one by one the
ones that aren’t significant. If deletion results in
insignificant increase in deviance, leave it out,
else leave it in.
• Example (103ff)
Other Actions
• You can transform the response and explanatory
variables.
• You need to consider
– Constancy of variance
– Normality of errors
– Additivity
• Check your models!
Summary
• There’s a lot more to statistics. We’d need about four
times as much time to cover introductory statistics
adequately
• Crawley is a good reference if you’re planning to do a
statistical analysis.
• If your analysis has any complexity, consult a working
statistician. I am an experimental scientist, not a
working statistician, but I do run a weekly statistical
surgery. This semester, it meets in DGIC 109 from 2-3
pm on Wednesdays.
• Good luck!