DNA Analysis - Thayer School of Engineering at Dartmouth

Download Report

Transcript DNA Analysis - Thayer School of Engineering at Dartmouth

DNA Analysis
Amir Golnabi
ENGS 112
Spring 2008
1
Outline:
1. Markov Chain
2. DNA and Modeling
3. Markovian Models for DNA Sequences
4. Hidden Markov Models (HMM)
5. HMM for DNA Sequences
6. Future Works
7. References
2
1.Markov Chain :
• Alphabet: S  s1 , s2 ,...,sJ 
• s j are called states, and S is the state
space
• Notation >
S  1,2,...,J 
• Sequence of random variables: X 0 , X1 ,...,X n ,...
• A sequence of random variables X n n0 is
called a Markov Chain, (MC), if for all n>=1
and

PX n  jn X 0  j0 ,...,X n1  jn1   PX n  jn X n1  jn1 
• The conditional probability of a future
event X n  jn depends only upon the immediate
past event X n1  jn1
3
1.Markov Chain (cont.)
• Conditional Probability:
Pi j  P X n  j X n1  i , n  1, i , j  S
 
P  Pi j
• Transition Matrix
• Property:
Pi j  0 ,
J
P
j 1
i j
J ,J
i 1 , j 1
1
• Higher-Order Markov Chains:
– Second order MC:
Pi ,k j  P X n  j X n1  i , X n 2  k , n  1, i , j , k  S
4
2.DNA and Modeling:
• Bases: {A,T,C,G}
• Complementary strands > sequence of bases in a single
strand
• Sequences are always read from 5’ to 3’ end.
• DNA  mRNA  proteins (transcription and translation)
• Codons: Triples of bases which code for amino acids
• 61 + 3 ‘stop’ codons
• Specific sequence of codons  gene  Chromosomes 
genome
• exons: coding portion of genes
• introns: non-coding regions
• Goal: To determine the nucleotide sequence of entire
genomes
5
3.Markov Chains for DNA Sequences
• Nucleotides are chained linearly one by one  local
dependence between the bases and their neighbors
• Markov chains offer computationally effective ways of
expressing the various frequencies and local dependencies
• Alphabet of bases = {A,T,C,G}  not uniformly
distributed in any sequence and the composition vary
within and between sequences
• The probability of finding a particular base at one
position can depend not only on the immediate adjacent
bases, but also on several more distant bases upstream or
downstream  higher order Markov model, (heterogeneous)
• Gene finding: Markov models of coding and non-coding
regions to classify segments as either exons or introns.
• Segmentation for decomposing DNA sequences into
homogeneous regions  Hidden Markov Models
6
4.Hidden Markov Models (HMM)
• Stochastic process generated by two interrelated
probabilistic mechanisms
• Underlying Markov chain with a finite number of states
and a set of random functions, each associated with its
respective state
• Changing the states: according to transition matrix
• Only the output of the random functions can be seen
• Advantage: HMM allow for local characteristics of
molecular sequences to be modeled and predicted within a
rigorous statistical framework, and also allow the
knowledge from prior investigations to be incorporated
into analysis.
7
5.HMM for DNA Sequences
• Every nucleotide in a DNA belongs to either a “Normal”
region (N), or a GC-rich region (R).
• No random distribution: Larger regions of (N) sequence
• Example of such a sequence:
NNNNNNNNNRRRRRNNNNNNNNNNNNNNNNNRRRRRRRNNNN
• States of HMM: {N,R}
• Possible DNA sequence with this underlying collection:
TTACTTGACGCCAGAAATCTATATTTGGTAACCCGACGCTAA
• No typical random collection of nucleotides: GC in R
regions: 83% vs. 23% in N regions
• HMM: Identify these types of feature in sequences
• Ability to capture both the patchiness of N and R and
different compositional frequencies within the categories
8
6.Future work…
• Better and deeper understanding of HMM
• Different applications of HMM, such as, Segmentation of
DNA Sequence and Gene Finding
• Build an automata for a simple case
7.References
• Koski, Timo. Hidden Markov Models for Bioinformatics. Sweden:
Kluwer Academic , 2001.
• Birney, E.. "Hidden Markov models in biological sequence
analysis". July 2001:
• Haussler, David. David Kulp, Martin Reese Frank Eeckman "A
Generalized Hidden Markov Model for the Recognition of Human
Genes in DNA".
• Boufounos, Petros, Sameh El-Difrawy, Dan Ehrlich. "HIDDEN
MARKOV MODELS FOR DNA SEQUENCING".
9