Random Walk - Muskingum University

Download Report

Transcript Random Walk - Muskingum University

Random walk
Presented by Changqing Li
Mathematics
•Probability
•Statistics
What is a Random Walk?

An Intuitive understanding: A series of
movement which direction and size are randomly
decided (e.g., the path a drunk person left behind).

Formal Definition: Let X a fixed vector in the
d-dimensional Euclidean space R d and X , n  1 a
sequence of independent, identically distributed (i.i.d.)
real-valued random variables in R d. The discrete-time
stochastic process S  S n : n  1 defined by
0
n
Sn  X 0  X1   X n
is called a d-dimensional random walk
Why Random Walks?


A random walk (RW) is a useful model in
understanding stochastic processes across
a variety of scientific disciplines.
Random walk theory supplies the basic
probability theory behind BLAST ( the most
widely used sequence alignment theory).
Definitions (cont.)

If X 0 and RVs X n take values in I d ,
then S , n  1
is called d-dimensional
lattice random walk.
In the lattice walk
 case, if we only allow
the jump from X  ( x ,...,x ) to Y  ( x   ,...,x  
where
 k  1 or  1 , then the
process is called d-dimensional sample
random walk.
n

1
1
d
1
1
d
d
)
Definitions (cont.)
A random walk is defined as restricted
walk if the walk is limited to the interval
[a, b].
The endpoints a and b are called
absorbing barriers if the random walk
eventually stays there forever;
or reflecting barriers if the walk reaches
the endpoint and bounces back.
Example: DNA sequence alignment
modeled as RW
|
|
|
|||
||
|||
ggagactgtagacagctaatgctata
Gaacgccctagccacgagcccttatc
Simple scoring schemes:
at a position: +1, same nucleotides
-1, different nucleotides
*
Example: simple RW
Ladder point
Ladder Point (LP):the point in the walk lower
than any previously reached points.
Excursion: the part of the walk from a LP until
the highest point attained before the next LP.
Excursions in Fig: 1, 1, 4, 0, 0, 0, 3;
BLAST theory focused on the maximum heights
achieved by these excursions.
Example : General RW

Consider arbitrary scoring scheme
(e.g. substitution matrix)
General Walk
Suppose generally the possible step sizes are,
and their respective
 c,c  1,...,0,...,d  1, d
probabilities are, pc , pc 1,...,pd


The mean of step size is negative, i.e.,
E(S ) 
d

j c

jp j  0
The mgf of S(step size) is,
m( ) 
d

j c
p j e j
General Walk

There exists unique positive  * , such that,
d
p
j c

j
e
j *
1
To consider the walk that start at 0, with stopping
boundary at -1 and without upper boundary, impose
an artificial barrier at
y0
The possible stopping points
can be,
 c,c  1,..., y,..., y  d  1.

And Wald’s Identity states,
E(e T )  1 where, TN is the total displacement
when the walk stops.
*
N
General Walk

Thus,
1
P e
k  c
k
k
*

y  d 1

ky
Pk e
k *
1
Where, Pk is the probability that the walk
finishes at the point k.
The mean of number of steps until the walk
stops A or m0 would be
jR j

E (T N )
j 1
A

d
E (S )  
jp j
c
j  c
Random Walks in real life!
In Supernova stars – how “star stuff” gets to be inside us (eventually!)
Random Walks (in your body)
Cells inside your body
How two liquids (and air!)
mix together! (Osmosis)
Random Walks and $$ (Wall
Street)
Stock Market – predicting the price /cost of a stock in the future
Application: BLAST


BLAST is the most frequently used method for
assessing which DNA or protein sequences in a large
database have significant similarity to a given query
sequence;
a procedure that searches for high-scoring local
alignments between sequences and then tests for
significance of the scores found via P-value.
The null hypothesis to be test is that for each
aligned pair of animo acids, the two amino acids
were generated by independent mechanism.
BLAST : modeling


The positions in the alignment are numbered from
left to right as 1, 2,…, N. A score S(j, k) is allocated
to each position where the aligned amino acid pair
(j,k) is observed, where S(j,k) is the (j,k) element in
the substitution matrix chosen.
An accumulated score at position i is calculated as
the sum of the scores for the various amino acid
comparison at position 1, 2,…,i. As i increases, the
accumulated score undergoes a random walk.
BLAST : calculating parameters


Let Y1, Y2,… be the respective maximum heights of the
excursions of this walk after leaving one ladder point
and before arriving the next, and let Ymax be the
maximum of these maxima. It is in effect the test
statistic used in BLAST. So it is necessary to find its null
hypothesis distribution.
The asymptotic probability distribution of any Yi is
shown to be the geometric-like distribution. The values
of C and  * in this distribution depend on the
substitution matrix used and the amino acid frequencies
{pj} and {pj’}. The probability distribution of Ymax also
depends on n, the mean number of ladder points in the
walk.
Reference




http://mathworld.wolfram.com/RandomWal
k2-Dimensional.html
http://mathworld.wolfram.com/BorelTannerDistribution.html
http://www.bioss.ac.uk/~dirk/talks/tutorial
_Blast.pdf#page=5&zoom=auto,53,792
http://www.jstor.org/discover/10.2307/278
51819?uid=3739840&uid=2129&uid=2&uid
=70&uid=4&uid=3739256&sid=211029915
85977