Transcript Document

Probability&Statistics - based models
Raina Robeva – Sweet Briar College
August 1, 2007
MathFest 2007
San Jose, CA
Probability&Statistics - based models
Introduction
Quantitative Traits (Limit Theorems)
Luria – Delbruck Experiments
Evaluating risks from time series
data
Elementary Probability
Random Variables
X  X (),   
(,  , P)- Probability Space
Histograms
1000
500
400
300
200
Frequency
RR interval (msec)
600
100
10
1
-50
0
50
100
150
Elementary Probability
Set of all outcomes - 
Examples:
  H, T 
1) Flipping a coin:
2) Rolling a die:
  1,2,3,4,5,6
3) Rolling two dice:
Elementary Probability
Elementary Events – the elements of   
Events – the subsets of  :
A, B, C  
Definition of Probability:
| A|
P( A) 
, |  | num berof elem ents
||
How do we find probabilities?
We Count!
Chromosomes and Genes
Chromosomes are large DNA
molecules found in the cell’s nucleus
Genes are found on chromosomes
and code for a specific trait
Each gene has a specified place on
the chromosome called a locus.
The possible alternative forms of the
genes are called alleles.
The human Chromosome 11 contains
28 genes. The first 5 genes from the
tip of the short arm form a cluster of
genes that encode components of
hemoglobin
Problem
One gene, two types of alleles: a (recessive) and A (dominant)
k = number of dominant alleles (0, 1, or 2)
If E = “exactly k dominant alleles”, find P(E).
 - All possible sequences of length 2 comprised of a and A
  aa, aA, Aa, AA
|E|
P( E ) 
||
1, when k  0

| E | 2, when k  1
1, when k  2

Problem (cont.)
  aa, aA, Aa, AA
Gregor Mendel – experiments with peas
Round - dominant
Phenotypic
Ratios
1:3 (1:2:1)
Wrinkled - recessive
x
Parental
Generation
P
First Filial
Generation
F1
Second Filial
Generation
F2
x
1 / 4, when k  0
|E| 
P( E ) 
 1 / 2, when k  1
|| 
1 / 4, when k  2
only round
peas in F1
3:1 ratio of round
vs. wrinkled in F2
P(wrinkled)  14  25%
P(round)  1  14  75%
Quantitative Traits (1909)
P
First Filial
Generation
F1
Second Filial
Generation
F2
Phenotypic Ratios
1:4:6:4:1
1 : 6 : 15 : 20 : 15 : 6 : 1
…
x
Two new shades appear
Herman Nilsson – Ehle
x
Parental
Generation
All of intermediate
color
Quantitative Traits – Examples
Polygenic Hypothesis
n genes, two types of alleles: a and A
N = 2n – total positions
k = number of dominant alleles (0, 1, 2, …, N)
If E = “exactly k dominant alleles”, find P(E) = ?
Polygenic Hypothesis – set of outcomes
1
3
2
4
 - All possible sequences of length 8 comprised of a and A
8
N
2n
|  | 2 In general, |  | 2  2
Polygenic Hypothesis
N = 2n – total positions
Alleles a and A are equally likely
k = number of dominant alleles (0, 1, 2, …, N)
If E = “exactly k dominant alleles”, find P(E).
|  | 2 N
N
N!
| E |   
 k  k!( N  k )!
N
 
|E| k 
P( E ) 
 N
|| 2
Example: Nilsson-Ehle (1909)
Nilsson – Ehle: Two genes (n = 2), N = 2n = 4
number of alleles
X – number of a alleles
in the N loci
 N   4
   
 k   k 
P(X = k) =
2N
16
Random Variables
X  X ( )
Discrete – X takes integer values
X is “known” when we know P(X=k) for all possible k
Continuous – X can be any value from an interval
X is “known” when we know:
F ( x) 
x
 f (t )dt

the distribution function F(x) = P(X< x);
the probability density function f(x) = d/dx [F(x)]
Common Discrete Random Variables
Bernoulli X takes values k = 0, 1
P(X=1) = p; P(X=0) = 1-p
Binomial X N=
takes
20,1,
p =2,0.5
20,values
p = 0.7kN== 0,
N=
20, …,
p =N0.2
N k
P( X  k )    p (1  p) N k
k
Poisson
X takes values k = 0, 1, 2, 3, …
P( X  k )  e

Parameters

Bernoulli (p)
k!
Po( )
k
Bin(N, p)
Common Continuous Random Variables
Exponential X takes values x  (0, )
F ( x) 1  ex
f ( x)  ex
Gaussian (Normal) X takes values x  (, )
f ( x) 
N ( , )
1
 2
e
 ( x )2
2 2
f ( x) 
N (0,1)
1
2
e
 x2
2
Bell - Shaped Distr. of Quantitative Traits
Traits are controlled not by one
but by several different genes. The
genes are independent and
contribute cumulatively to the
expression of the characteristic
(Polygenic Hypothesis)
Distribution of the trait is
Binomial (2n, p), where n –
number of genes and p frequency
of the non-contributing allele in
the population.
Distribution is approximately
Gaussian.
Further “smoothing” by
environmental factors
Why the “bell-shaped” distribution of
quantitative traits?
Central Limit Theorem
When Np is large and N(1-p) is large,
then
Binomial (N,p) ~ Normal (Np, Np(1  p) )
1667 - 1754
Moivre
1749 - 1827
Laplace
N = 20, p = 0.5
N=8, p = 0.2
N=50, p = 0.7
Aggregate Characteristics
Mean Value
E( X )   kP( X  k )
E ( X )   xf ( x ) dx
Standard Deviation
Var( X )  E[ X  E( X )]2  E( X 2 )  [ E( X )]2
Moments of order m
E( X m )   k m P( X  k )
E ( X m )   x m f ( x)dx
Examples
Binomial (N, p)
E ( X )  Np
Var( X )  Npq
Poisson(  )
E (X )  
Var(X )  
Gaussian (  ,  ) E (X )  
Var( X )   2
Poission Distribution Arises When…
Events of low intensity occurring in time
0
t
Average number of events per unit time =
time

X(t) – the number of events that have occurred in [0,t]
X(t) has a Poisson distribution with parameter
(  t ) k e  t
P( X  k ) 
k!
t
Poission Distribution Arises When…
Events of low intensity occurring independently of one
another
Average number of events per unit surface/volume per unit
time = 
X– the number of events that have occurred in a unit
surface/volume over time t
X has a Poisson distribution with parameter t
(  t ) k e  t
P( X  k ) 
k!
The Law of Large Numbers (1713)
If X is a random variable with
E( X )  ,
then
X1  X 2    X n
  , as n  ,
n
or, equivalently,
X  , as n  .
Example – Ordinary Coin Toss Game
1. Toss a coin
2. If Heads, win $1
3. If Tails, win nothing
4. Let Xi be your win for game i
5. Average payback to you
  E( X i )  (1/ 2)  0  (1/ 2) 1  $0.50
6. By the Law of Large Numbers
Simulation Example
X1  X 2    X n
   0.5, as n  .
n
Example – St. Petersburg Game
1. Toss a coin
2. If Heads, win $2
3. If Tails, keep tossing until it falls Heads
4. If first Heads on N-th toss, win $2N
H
TH
TTH
TTTH
$2
$4
$8
$16 etc.
5. With probability 1/(2N) we win $2N
6. Average payback to you
1
1
1
2
  ( )  2  ( 2 )  2  ( 3 )  23  
2
2
2
 1 
1

1   
St. Petersburg Game – a sample run
Random Processes (Temporal Stochastic Models)
Random Process: X(t) – Random variable that changes in time
When t = 0, 1, 2, … – Discrete Random Process
When t changes continuously – Continuous Random Process
In addition, since for any value of t, X(t) can be discrete or
continuous random variable, there are four possibilities for the
process {X(t), t}.
{X(t), t} is defined through its probability distribution.
pxi (t )  P( X (t )  x | X (0)  i)
For example, if X(t) can take values x = 0,1,2,…, then
pi (t )  [ p0i (t ), p1i (t ), p2i (t ),...] is the probability distribution
of X.
Single Population Immigration-Death Process
Deterministic Model
X(t) = population size at time t
I = rate of immigration
dX
 I  aX
dt
a = per capita death rate
Stochastic Model (Kolmogorov – Chapman DE)
X (t  t )  x can happen when:
X(t) = x and no change over t . (Event A)
X(t) = x + 1 and one death over t . (Event B)
X(t) = x -1 and one immigration overt . (Event C)
Probability for more than unit change over t  o(t ) . (D)
Kolmogorov – Chapman Equations
pn (t )  Pr(X (t )  n)
pn (t  t )  a(n  1)t pn1 (t )  It pn1 (t )  (1  an  I  o(t ))t pn (t )  o(t )
P(B)
P(C)
Subtract pn (t ) , divide by t , and let
P(A)
t  0
d
pn (t )  a(n  1) pn 1 (t )  Ip n 1 (t )  ( I  an ) pn (t ), n  0
dt
d
p0 (t )   Ip 0 (t )  ap1 (t ), n  0
dt
Demo
P(D)
How are the Stochastic and Deterministic Models Related?
Define X  EX   npn (t )
Multiply the K-C equation by n and sum over n
d
of the
pn (t )  a(n  1The
) pn 1 (mean
t )  Ip nvalue
n0
1 (t )  ( I  an ) pn (t ),
dt
stochastic process X
satisfies the
deterministic equation
d
d
X  na(n  1) pX
) I
nIp
(t )  n( I  an) pn (t )
an1X
n 1 (t
dt
dt
d
X   [npn 1a (n  1) npn an]  I  [npn 1  npn ]
dt
a  npn (t )  a X
1
Luria-Delbruck Experiments
When do mutations occur?
Lamarckian Model - mutations
evolve only in response to an
environmental cue.
Darwinian Model - mutations are
equally likely to occur at any
moment in time.
Luria-Delbruck Experiments (1943)
Luria SE & Delbruck M. Mutations of Bacteria from Virus Sensitivity to Virus
Resistance. Genetics 28:491(1943).
Large number of bacterial cultures, starting each one from a
small number of cells.
Control
Plate the cultures on nutrient agar plates
that on which a large amount of a virus has
been plated first. Incubate.
Hypotheses
Hypothesis 1 (Mutation): Mutations occur randomly, but the
probability that a bacterium mutates from sensitive to resistant is
small. This mutation is completely independent from the
presence of the virus. When the bacteria are added to the plates,
the mutants are already resistant to the virus. Only these mutants
proliferate into colonies on the plate.
Hypothesis 1 (Acquired Immunity): A small number of bacteria
mutated to acquire resistance only after they are exposed to the
virus. Survival confers immunity not only to the individual but
also to its offspring, and the colonies grow.
Count the Number of Colonies
Two opposing hypotheses
Hypothesis 1 (Acquired Immunity, Directed Mutation): A small
number of bacteria mutated to acquire resistance only after they
are exposed to the virus. Survival confers immunity not only to
the individual but also to its offspring, and the colonies grow.
killer virus
Two opposing hypotheses
Hypothesis 2 (Mutation + Selection): Mutations occur randomly,
but the probability that a bacterium mutates from sensitive to
resistant is small. This mutation is completely independent from
the presence of the virus. When the bacteria are added to the
plates, the mutants are already resistant to the virus. Only these
mutants proliferate into colonies on the plate.
killer virus
What is the Distribution of the Mutant
Cells at the time of plating?
Under the Directed Mutation Hypothesis
killer virus
Poisson
E ( X )  Var ( X )
E( X ) / Var( X )  1
Under the Mutation + Selection Hypothesis
Non-Poisson
E ( X )  Var ( X )
Var(X ) is very large
killer virus
Luria-Delbruck Distribution
Large variation in the number of mutants
What is the average
numberNof resistant
N
N
E
(X
)

1

1

1

1




E
(
X
)

p
2

p
2

p
2




cells under continuous mutation?
Assume that mutation can only occur at the time of division
Assume that each cell can mutate with a constant probability p
Generation
(i)
Average number of mu- Expected number of mutants at the
tant cells in generation i end from this generation
0
p
p2 N
1
2p
2 p2 N 1  p2 N
2
3
4
5
22 p
23 p
24 p
22 p 2 N  2  p 2 N
23 p2 N 3  p2 N
25 p
26 p
25 p2 N 5  p2 N
26 p2 N 6  p2 N
6
24 p 2 N  4  p 2 N
Biological
ESTEEM
Mutation.xls
AcqIm.xls
Lea and Coulson (1949)
Theorem. Let Xt denote the number of mutant cells in the
culture at time t. If p is the probability for a single cell to
mutate and m = p2n, then the probability generating function
of the distribution defined by
( x, m)   P( X t  k ) x
k
has the form
 ( x, m)  (1  x)
m(1 x ) / x
Lea, D.E. and Coulson, C.A. (1949) The distribution of the number of mutants
in bacterial populations. J. Genetics 49, 264-285
More recent work on the Luria-Delbruck
distribution
Evaluating risk from time series data
Glucose Variability and Risk Assessment
in Diabetes
Hearth Rate Variability and the Risk for
Neonatal Sepsis
Blood Glucose Fluctuation Characteristics
Quantified from Self-Monitoring Data
In both human and economic
terms, diabetes is one of the nations most
costly diseases. Diabetes is the leading cause of kidney failure,
blindness in adults, and amputations. It is a major risk factor for
heart disease, stroke, and birth defects. Diabetes shortens
average life expectancy by up to 15 years, and costs our nation in
excess of $100 billion annually in health-related
Sixteen Million people
in the United States have
Diabetes Mellitus.
expenditures- more than any other single chronic disease.
Diabetes spares no group, affecting young and old, all
races and ethnic groups, the rich and the poor.
Definitions
• Type 1 Diabetes also referred to as Insulin Dependent
Diabetes Mellitus (IDDM) is the type of diabetes in which the
pancreas produces no insulin or extremely small amounts;
• Type 2 Diabetes is the type of diabetes in which the body
doesn’t use its insulin effectively or doesn’t produce enough
insulin
• Insulin a hormone secreted by the pancreas that regulates
metabolism of glucose.
• Blood Glucose (BG) is the concentration of glucose in the
bloodstream;
• The BG levels are measured in mg/dl (USA) and in mmol/L
(most elsewhere);
• The two scales are directly related by: 18 mg/dl= 1mM;
Hyperglycemia
Target Blood
Glucose Range:
70-180 mg/dl
(DCCT, 1993)
Food
Insulin
Insulin
Hypoglycemia
Counterregulation
Severe Hypoglycemia
Severe Hypoglycemia
•
Defined as a low BG resulting in stupor, seizure, or unconsciousness that
precludes self-treatment (The Diabetes Control and Complications Trial
Research Group, 1997). Four percent of the deaths among individuals with
IDDM are attributed to SH (DCCT Study Group, 1991).
•
Although most severe hypoglycemic episodes are not fatal, there remain
numerous negative sequelae leading to compromised occupational and
scholastic functioning, social embarrassment, poor judgment, serious accidents,
and possible permanent cognitive dysfunction (Gold AE et al., 1993; Deary et al.,
1993; Lincoln et al., 1996).
•
Fear of severe hypoglycemia is identified as the major barrier to improved
metabolic control (Cryer et al., 1994).
BG Fluctuations: T1DM
600.00
500.00
400.00
300.00
200.00
100.00
0.00
0.00
5.00
10.00
15.00
20.00
25.00
30.00
BG Fluctuations: T2DM
600.00
500.00
400.00
300.00
200.00
100.00
0.00
0.00
5.00
10.00
15.00
20.00
25.00
30.00
Average Glycemia and Glucose Variability
Person A: HbA1c=8.0%
400
Blood Glucose (mg/dl)
350
300
250
200
150
100
50
0
Person B: HbA1c=8.0%
Blood Glucose (mg/dl)
400
350
300
250
200
150
100
50
Time (days)
0
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
Blood Glucose (BG) Monitoring Systems
Self-Monitoring BG Devices
(typically 3-10 measurements/24 hours)
Continuous BG Monitoring Systems
(up to 288 measurements/24 hours)
The Distribution of the BG Levels:
(Mean=6.7, SD=3.6, Normality hypothesis is rejected, P<0.05)
Frequency
30
Hypo-
Target Range
Hyperglycemia
25
20
15
10
5
Clinical
Center
Numerical
Center
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Standard Data Range
Data Range, if Symmetrization is used
BG (mM)
Symmetrization of the BG Scale:
Assumptions:
A1: The transformed whole BG range should be symmetric around 0.
A2: The transformed target BG range should be symmetric around 0.
Transformation:
f(BG,a,b) = [(ln (BG ))a  b ], a, b > 0
That satisfies the conditions:
A1: f (33.3,a,b ) = - f (1.1,a,b ) and A2: f(10,a,b ) = - f(3.9,a,b ).
Which leads to the equations:
(ln (33.3))a  b =  [(ln (1.1))a  b]
(ln (10.0))a  b   [(ln (3.9))a  b]
g . [(ln (33.3))a  b]  g . [(ln (1.1))a  b]   10 (scaling)
When solved numerically:
a1.033, b1.871 and g1.774 (when BG is in mM)
a1.084, b5.3811 and g1.509 (when BG is in mg/dl)
Symmetrization Function:
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-2.5
-3
-3.5
f(BG) = 1.774 * (ln(BG)^1.033 - 1.871)
Numerical
Center
Clinical
Center
1
4
7
10
13
16
19
22
BG (mM)
25
28
31
34
Distribution of the Transformed BG Levels:
50
Hypoglycemia
Target Range
Hyperglycemia
Frequency
40
30
20
Clinical and
Numerical
Center
10
0
-2.5
-2
-1.5
-1 -0.5
0
0.5
1
Symmetrized Data Range
f(BG)
1.5
2
2.5
Defining the Low and High
Blood Glucose Indices:
The BG risk function: r(BG)=10.f(BG)2
Let x1, x2, ... xn be a series of n BG readings,
and let
rl(BG)=r(BG) if f(BG)<0 and 0 otherwise;
rh(BG)=r(BG) if f(BG)>0 and 0 otherwise.
The Low Blood Glucose [Risk] Index (LBGI) and the
High BG [Risk] Index (HBGI) are then defined as:
1 n
LBGI =  rl( xi )
n i=1
1 n
HBGI =  rh( xi )
n i=1
Symmetrization
of the BG Measurement Scale
Risk Analysis of Blood Glucose Data: Theory and Algorithms
100
r(BG)
80
60
40
20
0
Hypoglycemia
Target Range
Hyperglycemia
Low BG Risk
y = 10 * x^2
High BG Risk
Clinical and
Numerical
Center
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1
Transformed BG Scale
• Evaluation of HbA1c
• Assessment of Long-Term Risk
for [Severe] Hypoglycemia
• Assessment of Short-Term
Risk for [Severe] Hypoglycemia
1.5 2
2.5 3
• Predicts 40% of SH episodes for the subsequent 6 months;
• Predicts 50% of imminent SH episodes (24 hours);
• The technology has been licensed by Lifescan Inc, Milpitas, CA;
The Blood Glucose Risk Function:
(As Defined on the Original Blood Glucose Scale)
100
r(BG)
80
Low BG Risk
High BG Risk
60
40
Target Range
20
0
0
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32 34
BG Level (mM)
Hearth Rate Variability and the Risk for Neonatal
Sepsis
• 4 million births
• 40,000 very low birth weight
(<1500 grams) infants
• 15,000 NICU beds
• 400,000 NICU admissions
Neonatal Sepsis: A Major Public Health
Problem
• Risk of sepsis is high
– 25 - 40% of VLBW infants develop sepsis while
in the neonatal intensive care unit
• Significant mortality and morbidity
– In VLBW infants, sepsis doubles the risk of
dying
– Length of stay is increased by 1 month
– Health care costs are increased
Current Practice for Infants at Risk for
Sepsis
• Nurse relates that infant in NICU is “not
acting right” or “looks a little off”
• Physicians must take the cautious approach,
suspecting sepsis
• Assessment includes invasive tests:
– CBC, blood culture, urine culture, lumbar
puncture
• Intervention: antibiotics
Baby
Problems with
Current Medical Practice
• Nurses and physicians’ subjective assessments
are neither sensitive nor specific
• Diagnostic tests have important limitations:
– invasive
– not performed until infant has clinical signs
– various CBC components range from 11% to 77%
Need for Better Risk
Assessment for Neonatal Sepsis
• Tremendous need for continuous non-invasive
monitoring for sepsis
• Any device that adds objective information about
infant’s state of health from continuous risk
assessment monitoring would be helpful
600
A
500
400
300
[
Magnitude of RR interval [Msec]
600
B
500
400
300
600
C
500
400
300
0
512
1024
1536
2048
2560
3072
Time [RR interval number]
3584
4096
600
10,000
A
18,000
Sample Asymmetry=1.37
R1=42
R2=57.5
A
1,000
500
16,000
12,000
100
8000
400
10
300
4000
1
0
[
Magnitude of RR interval [Msec]
600
median
10,000
B
B
Sample Asymmetry=2.97
R1=27
R2=79.5
1,000
500
18,000
16,000
12,000
100
8000
400
10
300
600
4000
1
0
median
10,000
C
C
Sample Asymmetry=11.8
R1=45.5
R2=538.5
1,000
500
18,000
16,000
12,000
100
8000
400
10
300
4000
1
0
512
1024
1536
2048
2560
3072
Time [RR interval number]
3584
4096
0
-20
0
20
40
60
80
100
Difference from median [msec]
120