Applications of scan statistics in molecular biology and

Download Report

Transcript Applications of scan statistics in molecular biology and

Applications of
scan statistics in
molecular
biology and
neuroscience
by Chan Hock Peng
Dept of Statistics and Applied
Probabilty
Outline
• 1. General introduction
• 2. Applications in molecular biology
(weighted scan statistics)
• 3. Tail probability computations
• 4. Applications in neuroscience
(template matching problem)
• 5. Tail probability computations
• 6. Extensions and other applications
Notation
• M u : The maximum score in any window
of length u.
•  : The underlying rate of events
occurring under normal circumstances.
• n: The length of the interval under
consideration.
Example 1
• (USA Today, 1996) On Feb 22, US Navy
suspended all operations of F-14 jet
after third crash in one month.
• The three crashes in a month was seven
times expected rate based on 5 year
period.
• M 30=3, n=5*365,  =1/70.
Example 2
• (Home News, 1995) In 10 month period,
11 residents died at a Tennessee State
Institution. Number was twice what was
expected.
• Judge was angry and ordered mental
health commissioner to spend one in four
weekends at institution.
• M10=11, n=?,  =11/20.
Clusters of DAM sites in
E.Coli DNA
•
•
•
•
•
Karlin and Brendel (1992).
DAM site--occurrence of the pattern GATC.
Important in repair and replication of DNA.
M 245=8, n=4.7 million,  =1.1/250.
P-value approx. of Naus (1982),
P{M 245  8}  0.87
P{M 245  10}  0.03
Palindromes in DNA
• A-T and C-G are complementary
bases.
• Complement of CCACGTGG is
GGTGCACC.
• CCACGTGG is palindromic pattern
because its complement reads the
same as itself backwards.
Palindromic sequences in
viruses
• Masse et al. (1992) & Leung et al. (1994).
• Palindromic sequences clusters around
origin of replication.
• Event occurs if there is palindromic pattern
of length at least 10 base pairs.
• HCMV sequence. M 1000 =10, n=229354,
=0.001. p-value=0.00195.

Extensions to general scoring
functions (weighted scan)
• In Chew, Choi and Leung (2005),
longer palindromic patterns are given
larger weights.
• For example, a pattern of length k
can be given score of k/10.
• p-value computations ?
Other applications of
weighted scan
• Rajewsky et al. (2002) & Lifanov et
al. (2003).
• Scanning for clusters of transcription
factor binding sites.
• Position weighted matrices to score
words for similarity to a given motif.
• Siepel et al. (2005). Searching for
segments of high evolutionary
conservation.
P-value computations for
weighted scan
• Chan and Zhang (2006).
  (n  u )e uI ( k / u ) (k / u   ) 

P{M u  k}  1  exp
2uK ' '


where
• I is a large deviation rate function.
•  is an overshoot function.

• K is the moment generating function
of
the scores.
Template matching in
neuroscience
• Neurons are basic units of
information processing in brain.
• Generate small and highly peaked
electric potentials known as spikes.
• Pattern of spikes modeled as point or
counting process, e.g. Poisson
process.
Template pattern
• Dave and Margoliash (2000) and Mooney
(1)
(d )
(2000), w  (w ,...,w ) the spike patterns
of a zebra finch when it is listening to a
bird song.
(i )
• Each w contains the times in which
spikes were generated for ith neuron in
an interval of time [0,T).
Longer spike train patterns
y  ( y ,..., y ) be
• Let
corresponding spike train patterns when
finch is sleeping, observed over a longer
period of time [0,a).
• If w matches well with a segment of y, then
evidence of bird song replay and hence
song learning during sleep.
(1)
(d )
Scoring function
• Consider kernel function f, e.g. let
f(x) = 1 if x < 0.025 ms, f(x)=-0.3 if
x> 0.025 ms.
• For the illustration below, consider
d=1 and T=0.2ms.
• Let w={.01, .05, .09, .12}.
• Let y ={.32, .75, 1.03, 1.15, 1.25 }.
• To check if there is a match between w
and the segment of y starting at time t=1,
compare w = {.01,.05,.09,.12} against y1 = {.03,.15}.
• The point .03 provides a score of 1
because there is point in w less than
0.025ms away.
• The point .15 provides a score of -0.3
because nearest point in w is more than
0.025ms away.
• Overall score at time t=1 is 1-0.3=0.7.
Scan statistics
• For d>1, add up scores over all neurons
starting at same time t.
• Scan statistics M T is the maximum
possible score over all t in the interval
[0,a-T).
• Chi (2004) obtain approx of log(P{M T  c})
• Chan & Loh (2005) more precise approx
of P{M T  c} was obtained.
Assumptions and related
information
• Each
(i )
w is stationary while
(1)
(d )
y ,..., y are independent Poisson
processes.
• Separate formulas when kernel f is
continuous and when it is not continuous.
• Number of times a large score c is
exceeded is Poisson random variable.
Table of approximations
•c
0.017
0.018
0.019
0.020
0.021
0.022
MC (s.e.)
0.0387(0.0019)
0.0237(0.0012)
0.0158(0.0008)
0.0095(0.0005)
0.0054(0.0003)
0.0033(0.0002)
C&L
0.0383
0.0241
0.0149
0.0091
0.0055
0.0033
Future works
• Higher dimension Poisson processes
e.g. 2 or 3 dimensional.
• Applications in astronomy and
imaging.
• Varying window-sizes.