Statistics_in_Biolog..
Download
Report
Transcript Statistics_in_Biolog..
Pharmamatrix Workshop 2010
Statistics in Biology and Medicine
Richard Tseng
July 14, 2010
The goal of statistics is to analyze, interpret
and present data collected to study systems of
interest!!
Outline
• Descriptive statistics
• Inferential statistics
– Probability theory
– Hypothesis test
– Regression
• Some other tools
– Tools for component analysis
– Bayesian statistics
• Summary
Descriptive statistics
• Definitions
– Set: A well-defined collection of objects and each
object is called an element
– Operation of sets: union and intersection
For example,
A = {1, 2, 3, 4} and B = {3, 4, 5, 6}
A B {1, 2, 3, 4, 5, 6}
A B {3, 4}
• Data type
– Interval scale
• For example, body weight (g): 1, 1.5, 2, 3 …
– Ordinal scale
• For example, scores for patient responses to
treatment
Response
Much
worse
Score
-2
Bit worse
-1
About
same
0
Bit better
1
Much
better
2
– Nominal scale
• Categorical data. For example, factors to influence
treatments
• How large are the
numbers?
– Mean
– Median
[1]
• How variable are the
numbers?
– Standard deviation (SD)
– Coefficient of variance
(CV = SD/mean)
[1]
Inferential Statistics:
Probability theory
• Law of large numbers
– The mean of elements in a set converges to the
expected value when the number of elements
close to infinite
• Law of small numbers
– There are not enough small numbers to satisfy all
the demands placed on them
• Central limit theorem
– states conditions under which the mean of a
sufficiently large number of independent random
variable, each with finite mean and variance, will
be approximately normally distributed
http://www.stat.sc.edu/~west/javahtml/CLT.html
• Probability
– Meaning
• Frequency interpretation: A number are associated
with the rate of occurrence of an event in a well
defined random physical systems
• Bayesian interpretation: A number assigned to any
statement whatsoever, even when no random process
is involved, as a way to represent the degree to which
the statement is supported by the available evidence
• Probability
– Basic rules
• Subtraction
P A 1 P A'
• Addition
P A B P A PB P A B
• Multiplication
P A B P APA B
• Probability
– Bayesian rule
P A B P B
Posterior
Prior
PB A
P A
Likelihood
• Probability
– Maximum entropy principle: The most honest
probability distribution assignment to a system is
the one that maximizes the entropy of the system
subject to any information available in hand.
Inferential Statistics:
Regression
• Goal: To correlate the study outcomes of
systems of interest and possible factors.
• Model:
– Linear model
– Logistic model
R a bx
exp a bx
R
exp a bx 1
• Optimization
Suppose there are n outcomes di of a study
– Least-square method
2
R
d
i i
a aˆ ,b bˆ
min
i
– Maximum Likelihood estimate: Supoose a
likelihood function is given by L(a,b|d)
max ˆ La, b d
a aˆ ,b b
• Regression tests
– Residual analysis
residual Ri di
– Standard errors of regression
coefficients
n
SE
2
R
d
i i
i 1
n 1n 2 Ri d
n
2
i 1
– Coefficient of determination
ˆ SD( R)
R b
SD
(
d
)
2
2
• Example 1: Linear regression
• Example 2: MLE solution of Emax and EC50 in
Michales-Menten equation
Likelihood function
MLE solution
Inferential Statistics:
Hypothesis test
• Goal: Test of significance
• Rationale
– Null hypothesis: H0, outcomes of a study purely
result from chance
– Alternative hypothesis: H1, outcomes of a study
are influenced from non-random sources
– Appropriate model: Normal distribution, tdistribution…
• Rationale
– Appropriate analysis method
• P-value: The probability of observing a sample statistic
as extreme as the test statistic, assuming the null
hypothesis is true.
• Parametric method: t-test, F-test, Chi-square test
• Non-parametric method: Kolmogorov-Smirnov test,
Mann-Whitney test
P-value for significant test:
1. What is the probability of a test value from a random
population? One or two tailed?
t-distribution
http://socr.ucla.edu/htmls/dist/StudentT_Distribution.html
2. If p-value is less than the confidence level a, the null
hypothesis is rejected
•Parametric test
Test method
one sample t-test
Test statistic
Null hypothesis
R d
t
SD / n
two sample F-test
F
Pearson Chi-sqaure test
the means of normally
distributed populations, all
having the same standard
deviation, are equal
SD1
SD2
n
Ri d i 2
i 1
di
2
the means of two normally
distributed populations are
equal
whether theoretical
population R and real
population d are different
Two sample t-test: (Online calculator
http://www.usablestats.com/calcs/2samplet)
N
Mean
StDev
SE Mean
Sample 1
15
0.633
0.2162
0.056
Sample 2
15
0.931
0.2021
0.052
Observed difference (Sample 1 - Sample 2): -0.298
Standard Deviation of Difference : 0.0764
Unequal Variances
DF : 27
95% Confidence Interval for the Difference ( -0.4548 , -0.1412 )
T-Value -3.9005
Population 1 ≠ Population 2: P-Value = 0.0006
Population 1 > Population 2: P-Value = 0.9997
Population 1 < Population 2: P-Value = 0.0003
Equal Variances
Pooled Standard Deviation: 0.2093
Pooled DF: 28
95% Confidence Interval for the Difference ( -0.4545 , -0.1415 )
T-Value -3.8992
Population 1 ≠ Population 2: P-Value = 0.0006
Population 1 > Population 2: P-Value = 0.9997
Population 1 < Population 2: P-Value = 0.0003
Some Statistics Worth to Know
• Tool for component analysis:
– Principle Component Analysis (PCA): A way to
identify patterns in data, and express in a way to
highlight their similarities and differences
– Independent Component Analysis (ICA): A way to
separate independent components in data
– Variable and model selection: Akaike Information
Criterion (AIC), Bayesian Information Criterion
(BIC)
• Bayesian statistics
Summary
• What is “right” null hypothesis?
• What is the appropriate distribution function?
• What is the appropriate test statistics?
“Know” your data before analyze that!!
Information theory based statistics:
Bayesian statstics
• Goal: Using Bayesian method to design and
analyze data
• Bayesian inference
– Appropriate distribution functions
– Appropriate sampling techniques
• Maximum entropy method based inference
– Appropriate form of entropy
– Appropriate constriants
Information theory based statistics:
Method of maximum entropy
Reference
[1] P. Rowe, Essential Statistics for
Pharmaceutical Sciences, Wiley 2007.