P(t) - Rice University Statistics

Download Report

Transcript P(t) - Rice University Statistics

Modeling Signals in DNA
Objectives
• Finding patterns in DNA
– Annotation: detecting coding and regulatory
regions
– Detecting repeat regions
– Detecting regularities like CpG-islands
– …
• Developing statistical tests
– Differentiating apparent from true regularities
– …
Dogma of molecular biology
•  – time of gene transcription
•  – turnover rate of the protein
• R(t) – measured “gene expression”
level
• P(t) – measured “protein expression”
level
Gene regulation
Gene structure in Eukaryotes
Regulatory logic
Some transcription factors act at certain times, some
in certain cells and others in response to signals
DNA as a sequence of
independent random variables
Joint frequencies
Xn-1 A
Xn
A
C
G
T
pAA
pAC
pAG
pAT
C
G
T
pCA
pCC
pCG
pCT
pGA
pGC
pGG
pGT
pTA
pTC
pGA
pTT
Association test
• If independence holds, we should observe pij = pi.p.j
• We do not observe probabilities, but frequencies
(counts)
Xn
A
C
G
T
Xn-1
A
C
G
YAA
YAC
YAG
YAT
YCA
YCC
YCG
YCT
YGA
YGC
YGG
YGT
YTA
YTC
YGA
YTT
T
• We compare Yij against Eij = Yi. Y.j /Y..
• Under null hypothesis of independence, the index I has
2 distribution with (4-1)(4-1)=9 df
• Usual testing theory applies (Section 3.4.3, Example 4)
I 
ij
(Yij  Eij ) 2
Eij
Weight matrices
• Representation of nucleotide frequency at n (here, n = 5) distinctive
points in the DNA sequence (indexed i1, i2, …, in and called “the
signal”).
• In the case of Markov-process representation, these are marginal
distributions of the state of the process at discrete times i1, i2, …, in.
A
C
G
T
Xi1
Xi2
Xi3
Xi4
Xi5
pA1
pC1
pG1
pT1
pA2
pC2
pG2
pT2
pA3
pC3
pG3
pT3
pA4
pC4
pG4
pT4
pA5
pC5
pG5
pT5
Maximal dependence decomposition
• Purposes:
– To determine the residue of the signal most strongly impacting other residues
in the signal.
– To compute weight matrices for the remaining residues, conditional on the
value of the impacting residue.
• Independent reading, Section 5.3.4.
Detection of long repeats
• Purpose: To detect if long runs of a nucleotide (say, A) are
“accidental”, i.e., caused by a Bernoulli mechanism.
• Technique: Derive probability of maximum repeat length (in a
sequence of length N) exceeding given number under Bernoulli trials.
• Derivation:
– Probability of #(repeats) equal to y
– Probability of #(repeats) equal at
least to y
– Probability of the maximum run of
repeats equal at least to y, given
number of runs
– Expected #(failures) (non-A) =
#(runs of length  0)
– Finally ...
Pr[Y  y ]  (1  p ) p y ,
y  0,1, 2, 
Pr[Y  y ]  p y
Pr[Ymax  y | n]  1  (1  p y ) n
Ymax  max( Y1 ,  , Yn )
n  N (1  p )
Pr[Ymax  y ]  1  exp[  p y N (1  p)]
More accuracy
• Improvement: Use the distribution of the number of failures
N
Pr[ n  j ]   (1  p) j p N  j
 j
N
Pr[Ymax  y ]   Pr[ n  j ][1  (1  p y ) j ]
j 0
• Testing (explain using examples)
r-scans (1-scans)
• Purpose: Inspect occurrences of a relatively short “word”
(motif) in a long sequence, and determine if they are
“randomly” distributed, against the possibilities they are
clumped or overdispersed.
• Idealizations:
– Occurences of the motif represented as points of the unit interval
– Null hypothesis, coordinates of the points are iid uniform on (0,1)
– Alternative hypotheses not clearly specified
Partitions of the unit interval
• Distances between words = differences of order statistics of iid
uniform(0,1) rv’s
U 1 , U 2 ,  , U n 1
U i  Vi  Vi 1 , i  1,  , n  1
0  V0  V1    Vi  Vi 1    Vn 1  1
• Distances can be represented in terms of ratios of iid exponential
rv’s with arbitrary parameter 
Xi
U i  n 1
, i  1, , n  1
 Xk
k 1
Distribution of the maximum
X 1 , X 2 ,  , X n 1 , iid , ~ exp(  )
Y1  min( X 1 ,  , X n 1 ) ~ exp[( n  1) ]
Y2  min( remaining X i ' s ),
by memoryless property , Y2 ~ exp[ n ]
Y3 ~ exp[( n  1) ]

Yn 1 ~ exp[ 2 ]
Yn ~ exp[  ]
X max  Y1    Yn
For mathematical
background see
Sections 2.10 and 2.11
(independen t )
1
1
1
1



( n  1) n
2 
1
1
1
1
V ( X max ) 





( n  1) 2 2 n 2 2
42 2
E ( X max ) 
Asymptotic expressions
• For large n, we have
E ( X max ) 
V ( X max ) 
1
1
1
1



( n  1) n
2 
1
( n  1) 
2 2

1
n 
2 2

  ln( n  1)

~
1
4
2

1
2
2

62
where =0.577216 is the Euler’s constant.
• Using asymptotic expansions (chapter 2), we obtain
Pr[U max  u ] ~ 1  exp{ ( n  1) exp[ ( n  1)u ]}
ln( n  1)  u 

Pr U max 
~ 1  exp[  exp( u )]

n 1


which provides us with a test statistic.
Generalizations
• More stable r-scans
Ri ( r ) 
i  r 1
U j
j i
“sliding windows” of Uj’s (Karlin and Macken)
Analysis of Patterns
• Purpose: Answer questions concerning occurrence of a word of
finite length (“gaga”) in a long sequence of nucleotides.
• Null hypothesis: As usual, DNA sequence result of a series of
Bernoulli trials.
• Specific questions:
– How frequent is the word in a sequence of length N ?
– What is the distance between successive repeats of the word ?
• Counting method: Including overlaps (otherwise difficult)
Counting word occurrences with
overlaps
• Take word “gaga”; denote by Ij the indicator that the end-point of
“gaga” is located at position j in the sequence. So, number of
occurrences is equal to
Y1 ( N )  I 4  I 5    I N
• Expected value of the indicator Ij is equal to 1/24, so
independently of the word
N 3
E[Y1 ( N )] 
256
• What about variance?
 N  3
Var[Y1 ( N )]  E[( I 4  I 5    I N ) ]  

 256 
2
2
Variance of the repeat count
E [( I 4  I 5    I N ) ]
2
 E ( I 42  I 52    I N2 )
 2( N  4 ) E [ I j I j  1 ]
 2( N  5) E [ I j I j  2 ]
 2( N  6 ) E [ I j I j  3 ]
 2( N  6)( N  7) E [ I j I k ] (| j  k | 3)
281N  895

65536
Remarks
• Variance depends on sequence (5.29)
• Distributions of Y1(N) not binomial
• Read Section 5.7.3 for distance between occurrences
• When overlaps counted, things much more challenging