Click! - sihua.us

Download Report

Transcript Click! - sihua.us

3. Introductory Statistical Principles
Sihua Peng, PhD
Shanghai Ocean University
2016.9
Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Introduction to R
Data sets
Introductory Statistical Principles
Sampling and experimental design with R
Graphical data presentation
Simple hypothesis testing
Introduction to Linear models
Correlation and simple linear regression
Single factor classification (ANOVA)
Nested ANOVA
Factorial ANOVA
Simple Frequency Analysis
3. Introductory Statistical Principles
Statistics is a branch of mathematical
sciences that relates to the collection,
analysis, presentation and interpretation of
data and is therefore central to most
scientific fields.
Four Fundamental Terms

Fundamental to statistics is the concept that
samples are collected and statistics are
calculated to estimate populations and their
parameters.
Terminology
The population parameters are the characteristics
(such as population mean, variability etc) of the
population.
Since it is usually not possible to observe an entire
population, the population parameters must be
estimated from corresponding statistics calculated
from a subset of the population known as a sample.
Population, Target population, and Sample
sample observations are drawn randomly from
populations.
A particular subgroup of a population
3.1 Distributions
 The set of observations in
a sample can be
represented by a sampling
or frequency distribution.
 A frequency distribution
(or just distribution)
represents how often
observations in certain
ranges occur (see Figure
3.1a).
Discrete Probability Distributions
 For discrete random variables, the
probability distribution is fully defined
by the probability mass function
Bernoulli Distribution
 The binary random variable X with possible values 0
and 1 has a Bernoulli distribution with parameter θ ,
where P(X = 1) = θ and P(X = 0) = 1 − θ . We denote this
as X ∼ Bernoulli(θ), where 0 ≤ θ ≤ 1.
Plot of Bernoulli(0.8) distribution
Bernoulli trial
 In the theory of probability and statistics, a
Bernoulli trial (or binomial trial) is a
random experiment with exactly two
possible outcomes, “success” and “failure”, in
which the probability of success is the same
every time the experiment is conducted. It is
named after Jacob Bernoulli, a Swiss
mathematician of the 17th century.
Binomial Distribution
The random variable representing the number of
times the outcome of interest occurs in n Bernoulli
trials (i.e., the sum of Bernoulli trials) has a
Binomial(n, θ) distribution, where θ is the
probability of the outcome of interest (a.k.a. the
probability of success).
A binomial distribution is defined by the number
of Bernoulli trials n and the probability of the
outcome of interest θ for the underlying Bernoulli
trials.
Binomial Distribution
Binomial Distribution
 Plot of Binomial(50, 0.8) distribution
Poisson Distribution
 In probability theory and statistics, the Poisson distribution is a
discrete probability distribution that expresses the probability of a
given number of events occurring in a fixed interval of time and/or
space if these events occur with a known average rate and
independently of the time since the last event.
 A Poisson distribution is specified by a parameter λ, which is
interpreted as the rate of occurrence within a time period or
space limit.We show this as X ∼ Poisson(λ), where λ is a
positive real number (λ>0).
 The mean and variance of a random variable with Poisson(λ)
distribution are the same and equal to λ. That is, μ = λ and σ2 =
λ.
Poisson Distribution
 Probability of events for a Poisson distribution
 An event can occur 0, 1, 2, … times in an interval. The average
number of events in an interval is designated λ. Lambda is
the event rate, also called the rate parameter. The probability
of observing k events in an interval is given by the equation
Poisson Distribution
 Plot of Poisson(2.5) distribution
Continuous Probability Distributions
 For continuous random variables, we use
probability density functions to specify
the distribution.
e.g.,
>MACNALLY <- read.table("macnally.csv", header=T, sep=",")
> plot(density(MACNALLY$EYR))
The normal distribution
 It has been a long observed mathematical phenomenon
that the accumulation of a set of independent random
influences tend to converge upon a central value (central
limit theorem) and that the distribution of such
accumulated values follow a specific ‘bell shaped’ curve
called a normal or Gaussian distribution (see Figure 3.1b).
 The normal distribution is a symmetrical distribution.
 Many biological measurements are likewise influenced by
an almost infinite number of factors and thus many
biological variables also follow a normal distribution.
How to perform a test to see if a data set
follows normal distribution?
 One of the methods is to perform Shapiro Test using R
function shapiro.test().
>mydata<-c(3.4,4.2,1.9,5.2,3.5,4.2,3.7,3.2)
>shapiro.test(mydata)
Shapiro-Wilk normality test
data: mydata
W = 0.95509, p-value = 0.7623
If the p-value>0.05,the data follows the normal
distribution.
How to perform a test to see if a data set
follows normal distribution?
 ## Generate two data sets
## First Normal, second from a t-distribution
words1 = rnorm(100); words2 = rt(100, df=3)
## Have a look at the densities
plot(density(words1));plot(density(words2))
## Perform the test
shapiro.test(words1); shapiro.test(words2)
## Plot using a qqplot
qqnorm(words1);qqline(words1, col = 2)
qqnorm(words2);qqline(words2, col = 2)
How to perform a test to see if a data set
follows normal distribution?
 Boxplot() function can also test if a data set follows
normal distribution.
> VAR1<-rlnorm(15,4,.5)
> boxplot(VAR1)
>VAR2<-rnorm(25,2,.5)
>boxplot(VAR2)
> VAR3<-log(VAR1)
> boxplot(VAR3)
Student’s t-distribution
 Another continuous probability distribution that is
used very often in statistics is the Student’s t distribution or simply the t -distribution.
 As we will see in later chapters, the t -distribution
especially plays an important role in testing hypotheses
regarding the population mean.
 A t -distribution is specified by only one parameter
called the degrees of freedom df. The t -distribution
with df degrees of freedom is usually denoted as t (df )
or tdf , where df is a positive real number (df > 0).
 The mean of this distribution is μ = 0, and the variance
is determined by the degrees of freedom parameter, σ2 =
df/(df −2), which is of course defined when df > 2.
Student’s t-distribution
Comparing of a standard normal distribution to t –
distributions with 1 degree of freedom and then with
4 degrees of freedom.
How to obtain random data set with
various distributions?
Normal distribution: rnorm(n, mean = 0, sd = 1)
Chisquare distribution: rchisq(n,df,ncp=0 )
T distribution: rt(n,df,ncp=0)
F distribution: rf(n,df1,df2,ncp=0)
Parameter Estimation
 Estimation refers to the process of guessing the unknown
value of a parameter (e.g., population mean) using the
observed data. For this, we will use an estimator, which is a
statistic.
 A statistic is a function of the observed data only. That is, it
does not depend on any unknown parameter, and given the
observed data, we should be able to find its value.
 For example, the sample mean is a statistic. Given a sample
of data, we can find the sample mean by adding the
observed values and dividing the result by the sample size.
No unknown parameter is involved in this process.
Population Mean
 For a population of size N, μ is calculated as
N

X
i 1
i
,
N
where xi is the value of the random variable for the ith member
of the population.
Given n observed values, X1,X2, . . . , Xn, from the population, we
can estimate the population mean μ with the sample mean:
n
X
In this case, we say that
X
i 1
n
X
i
,
is an estimator for μ.
Point Estimation of Population Mean
 We usually have only one sample of size n from the
population x1, x2, . . . , xn. Therefore, we only have one
value for X , which we denote x :
n
x
x
i 1
n
i
,
where xi is the ith observed value of X in our sample, and x is
the observed value of X .
Population Variance
 The population variance is denoted as σ2 and
calculated as
2
(
x


)
i 1 i
N
2 
N
.
This is the average of squared deviations of each
observation xi from the population mean μ.
Sample variance
 Given n randomly sampled values X1,X2, . . . , Xn from
the population and their corresponding sample
mean X , we can estimate the variance. A natural
estimator for variance is
2
(
X

X
)
i 1 i
n
S 
2
n 1
.
Point Estimation of Population Variance
Again, we regard the estimator S2 as a random variable
since it changes as we change the sample.
However, in practice, we usually have one set of
observed values, x1, x2, . . . , xn, and therefore, only one
value for S2, which we denote as s2:
2
(
x

x
)
i1 i
n
s2 
n 1
.
Four important distributions
 1. Let X ~ N ( ,  2 ), (X1,X2,…,Xn) are samples from a
population, X is mean of the samples. Then the
following distribution holds
X 
2
X ~ N ( ,
)
X ~ N (0, 1)
2
n
 /n
2
X
~
N
(

,

), (X ,X ,…,X ) are samples from a
 2. Let
1 2
n
population, X is mean of the samples, and S2 is the
variance of the samples. Then the following distribution
holds
X 

2
S /n
~ t (n  1)
Four important distributions
 3. If X ~ N ( 1 ,  12 ), Y ~ N (  2 ,  2 2 ), X and Y be independent
of each other, and (X1,X2,…,Xn),(Y1,Y2,…,Yn) are samples
from population X and Y, with means of X ,Y ,
respectively, Then the following distribution holds
X  Y ~ N ( 1   2 ,
 12
n1
( X  Y )  ( 1   2 )
1
2
n1

2
2
n2

 22
n2
)
~ N (0, 1)
Four important distributions
 4. If X ~ N ( 1 ,  12 ) , Y ~ N (  2 ,  2 2 ) , X and Y be
independent of each other, and
(X1,X2,…,Xn),(Y1,Y2,…,Yn) are samples from
population X and Y, with means of X , Y and
2
2
s
,
s
variances of 1 2 . Then the following distribution
holds
( X  Y )  ( 1   2 )
(n1  1) s  (n2  1) s 1 1
(  )
n1  n2  2
n1 n2
2
1
2
2
~ t (n1  n2  2)
Log-normal distribution
 Many biological variables have a lower limit of zero.
 Such circumstances can result in asymmetrical
distributions that are highly truncated towards the left
with a long right tail (see Figure 3.1c).
 In such cases, the mean and median present different
values , see Figure 3.1d.
 These distributions can often be described by a
log-normal distribution.
 Consequently, when such data are collected on a linear
scale, they might be expected to follow a non-normal
distribution.
Log-normal distribution
In probability theory and statistics, the log-normal
distribution is a random distribution of the probability
distribution of random variables.
If X is a random variable subject to a normal
distribution, exp (X) follows a lognormal distribution;
similarly, if Y follows a lognormal distribution, ln (Y)
follows a normal distribution.
If a variable can be seen as the product of many small
independent factors, then this variable can be seen as a
lognormal distribution.
3.2 Scale transformations
Data transformation is the process
of converting the scale in which the
observations were measured into
another scale.
Scale transformations: an example
 Fig 3.2 Ficticious illustration of scale transformations. Leaf length measurements
collected on a linear a) and logarithmic b) scale yielding log-normal and normal
sampling distributions respectively. Leaf length measurements collected on a linear scale
can be normalized by applying a logarithmic function (inset) to each measurement. Such
a scale transformation only alters the relative spacing of measurements c). A largest leaf
has the largest values on both scales.
Scale transformations
 The purpose of scale transformation is to normalize the
data so as to satisfy the underlying assumptions of a
statistical analysis.
 As such, it is possible to apply any function to the data.
Nevertheless, certain data types respond more favourably
to certain transformations due to characteristics of those
data types.
 Common transformations and R syntax are provided in
Table 3.2.
Scale transformations