1 - Ipatimup

Download Report

Transcript 1 - Ipatimup

New applications of alignment-free
methods for biological sequence
analysis and comparison
Susana Vinga
13th Portugaliae Genetica
IPATIMUP, Porto
19 March 2010
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
1
The –omics era
2
Bioinformatics & Computational Biology
Bioinformatics
Alignment-free
methodologies
based on vector maps
Multidisciplinary area
Experimental (manipulation and
measurement)
Computational (mining and modeling)
Computer science, dynamic systems,
statistics, graph theory, molecular
biology, biochemistry, physiology,…
•Genome analysis
•Sequence analysis
•Structural bioinformatics
•Gene expression
•Genetics and population analysis
•Systems biology
•Databases and ontologies
Phylogenetics
Data and text mining
http://bioinformatics.oupjournal.org
Outline
• Motivation & Introduction
– Biological sequence analysis and comparison – alignmentbased and alignment-free strategies
– Modeling strategies, problem definition and concepts
• Methods & Examples (global and local analysis)
– Resolution based methods – L-tuple composition
– Iterated function systems – CGR/USM and genomic
signatures
– Information theory – Renyi entropy of DNA
– Markov chain models – statistical significance of motifs
– Entropic profiles
• Summary & Conclusions
GenBank
DNA
How to…
•Classify
•Analyze
•Integrate
this increasing
amount of
complex
information?
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Sequence alignment
• The fundamental idea of
alignment is that sequences that
share the same substrings might
have the same function or be
related by homology.
6
Aligment-based algorithms
Needleman-Wunsh (1970)
Smith-Waterman (1981)
• Build up the best alignment by using optimal alignments of
smaller subsequences
• Example of dynamic programming (Richard Bellman,1953):
–
–
–
–
A divide-and-conquer strategy:
Break the problem into smaller subproblems.
Solve the smaller problems optimally.
Use the sub-problem solutions to construct an optimal solution for
the original problem.
7
Alignment-based algorithms
8
Multiple aligment
Clustal
9
BLAST
Times Cited
(ISI):
27,556
10
11
Aligment-free methods review (2003)
12
Vector-valued functions of biological
sequences
GAATTCT
AATCTCC
CTCTCAA
CCCTACA
GTACCCA
13
(f1,…,fn)
Words in sequences
L-tuple composition
• Count “words” in sequence

cLX  cLX,1 ,, cLX,K
f LX 
c LX
X
c
 j 1 L, j
K
 f LX,i 

c LX,i
n  L 1
X  s1s2
si
sN
WL  wL,1 , wL,2 ,
, wL,K 
K  rL
• Example DNA sequence X=GTGTGA, extract and
count (overlapping) 3-tuples
WL  GTG, TGT , TGA, AAA, AAC ,
c3X   2,1,1,0,0,

f 3X   0.5,0.25,0.25,0,0,
14


Metrics and dissimilarities
• Euclidean
5
4
• Cosine
3
2
q
1
• Minkowski - City-block (m=1)
0
15
0
1
2
3
4
5
Dissimilarity
Euclidean
Dissimilarities
Weighted
Euclidean d2
Equation

d LE  X , Y   c LX  c LY
u
  c
T

K
X
L
d 2  X , Y     i c LX,i  c LY,i


K
 c LY   c LX,i  c LY,i

2
i 1

2
L l i 1
Standard
Euclidean
• L-tuple
• Resolution-free
d
SE
L
 X ,Y   c
X
L
c
  diag s
Y T
L

,, s KK   c  c
1
11
X
L
 
K
Y
L
c
X
L ,i
 c LY,i
i 1
sii



2
u
d SE*   d LSE
L l
Mahalanobis

d LM  X , Y   c LX  c LY

T


K
K

 S 1  c LX  c LY   c LX,i  c LY,i  sijinv  c LX, j  c LY, j
i 1 j 1
u
d M *   d LM
L l
• Sucessuful
applications
Linear
correlation
coefficient
KullbackLeibler
K
d LLCC  X , Y  
i 1
 K
 K  f LX,i
 i 1
 
K
d
KL
L
K
K
i 1
i 1
K  f LX,i  f LY,i   f LX,i   f LY,i
 X ,Y    f
i 1
Cosine
d LCOS  X ,Y   q XY
X
L ,i
2
 K

   f LX,i 
 i 1

 f LX,i
 log 2  Y
 f L ,i
2 12



 K
  K  f LY,i
 i 1
 
Vinga (2007) Editors: T.D.
Pham et al, pp. 71-105.
16
CGR/USM
c 
, where cosq  
XY
Kolmogorov
complexity
d LEVOL  X ,Y    ln 1  cosq XY  2

d USM a, b   log 2 max ai  bi
d KC  X , Y   1 
i
 K

   f LY,i 
 i 1

2 12







c
K
X T
L
X
L
 cLY
c
Y
L

c
i 1

K X   K X | Y 
K  XY 
X
L ,i
 cLY,i
 c    c 
K
i 1
Evolutionary
2
X 2
L ,i
K
j 1
2
Y
L, j

W-metric
• Based in 1-tuple frequencies and PAM/Blosum weights
Vinga et al. Bioinformatics, 2004. 20(2): p. 206-215.
17
Transforming L-tuples
Processing frequnecy vectors
Normalize, filtrate, feature selection, using algebraic and statistical tools
• Normalize by expected frequencies (Pietrokovski et al., 1990)
– according to (L-1)-tuple and (L-2)-tuple: contrast L-vocabulary (CV)
• Oligonucleotide bias (Rocha et al., 1998)
– over- and under represented L-tuples in genomic datasets might indicate
phenomena of positive/negative selection in B.subtilis
• Shortest unique substrings (Haubold et al., 2005)
– Occur only once and cannot be further reduced in length without losing
the property of uniqueness. Caenorhabditis elegans, human and mouse
genomes
• UniMarkers (Chen et al., 2002)
– fixed-length unique sequence markers might be used to assign the
genomic positions of SNP sites. UM’s appear only once in the genome
thus allowing to locate SNP’s much faster that alignment-based methods.
Create synteny maps (Liao et al., 2004)
18
Human beta globin genes
19
HUMHBB classification
20
10 European Languages
http://commons.wikimedia.org/wiki/File:Languages_of_Europe.png
Natural languages
22
Phylogenetic inference
• Genome trees and the nature of genome evolution (Snel
et al., Annual Rev Microbiol, 2005)
– Alignment-Free Genome Trees - reconstruction methods use a
statistic of the entire genomic DNA, or of all encoded proteins, to
derive a distance between genomes that is then used to cluster
them.
• “The fact that these alignment-free methods do not
incorporate so much standard molecular evolutionary
methodology and proven powerful evolutionary concepts
raises interesting questions, especially because they
perform reasonably well”
– Kolmogorov complexity (Li et al. Bioinformatics 2001) and LempelZiv complexity (Otu et al. Bioinformatics 2003) - Mitochondria
– Qi J et al (2004).; Volkovich Z et al. (2010); Zheng, XQ et al.
(2009).
– Haubold et al (2009). Estimate mutation distances
23
Hybrid approaches
• Edgar, R. C. (2004). MUSCLE: a multiple sequence
alignment method with reduced time and space
complexity. BMC Bioinformatics, 5, 113.
• Sperisen, P. & Pagni, M. (2005). JACOP: a simple and
robust method for the automated classification of protein
sequences with modular architecture. BMC Bioinformatics,
6, 216.
24
Iterated Function Systems (IFS)
• Chaos Game Representation (CGR) for DNA
• Higher-order generalization Universal Sequence Maps
(USM)
• Represent discrete sequences in a continuous map
(bijection)
• Relation with Markov Chains and suffix properties
Jeffrey, H. J. (1990). Chaos game representation of gene structure. Nucleic Acids Res,
18(8):2163–2170.
Almeida, J. S. and Vinga, S. (2002). Universal sequence map (USM) of arbitrary
discrete sequences. BMC Bioinformatics, 3(1):6.
CGR/USM Algorithm
(0,1)
C
ATGCGAGTGT...
X=ATGCGAGTGT...
G
1
(1,1)
ATGCG ...
0.9
ATGC ...
0.8
ATGCGAG ...
ATGCGAGTG ...
ATG ...
0.7
0.6
0.5
ATGCGA ...
0.4
ATGCGAGT ...
0.3
A ...
ATGCGAGTGT...
0.2
(0,0)
AT ...
0.1
(1,0)
0
A
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
T
CGR/USM Example
C
A
G
T
Suffix property
CCC
GCC
CGC
ACC
TCC
AGC TGC
CAC
C
GAC CTC
AAC TAC
CCA
ATC
GGC CCG
GCG CGG GGG
ACG TCG
AGG TGG
G
GTC
CAG GAG CTG
GTG
TTC
AAG TAG
TTG
GCA CGA GGA CCT
ATG
GCT CGT
GGT
CT
A
ACA TCA
AGA TGA
ACT
TCT
AGT
TGT
CAA GAA CTA
GTA
CAT
GAT
CTT
GTT
AAA TAA
TTA
AAT
TAT
ATT
TTT
ATA
AT
AAAGTAGGATAGTT
TT
A given L-tuple
(e.g. AGT) is always
mapped in the same
region – fractal
properties
Generalization of Markov
chain models
Transition probability tables
k-order  2k+1 divisions
Genomic signatures
Pervasive patterns even for
‘short’ DNA segments
GENSTYLE Deschavanne et al. 1999 Mol Biol Evol
Generalizations
• How to maximally “fill” the
space?
• Sierpinsky triangle (n=3)
• CGR (n=4)
• …
• Very academic…
Almeida and Vinga BMC Bioinformatics 2009 10:100
30
Information Theory
• Definition Shannon’s entropy
N states with probabilities pi, i=1,...,N
L-tuple frequencies
Measures randomness level of given system
(or predictability, complexity, ...)
Rényi entropy order  of probability distributions
•
discrete
•
continuous
Generalization of Shannon
Example: N=2 Coin
A not necessarily fair coin…
0.6
H
0.4
0.2
prob(HEADS)  p
prob(TAILS)  1  p
0
0
0.5
1
1
1.5

0.8
0.6
2
0.4
2.5
3
0.2
0
p
Maximum entropy when p=0.5
Highest unpredictability
Global Rényi entropy of DNA sequences
1. Represent the DNA sequence
Chaos Game
Representation/Universal
Sequence Maps (CGR/USM)
s2
s2
s2
2. Estimate probability density function
(pdf)
Parzen’s window method using
normal or gaussian kernels with
different variances s2
Normal distribution
3. Calculate entropy of estimated pdf
Rényi continuous entropy of DNA
Simplification
4. Compare with random model
Algebraic simplification
• When using Gaussian kernels, Rényi quadratic
entropy =2 simplifies to:
Exact results
No numerical integration
Vinga & Almeida 2004 J Theor Biol
Results – pdf estimation
-ATCmotif
Over-represented
Suffix
property
=
High
density
pdf estimation ^f
Global Rényi entropy
10
5
0
more
random
-5
rand
m3
m4
m5
m7e
R1
R5
MC0
MC1
Es
H2
-10
-15
-20
less
random
-25
-30
-35
-40
-20
-15
-10
-5
0
lns2
length N=2000
Random sequences - medians
10
+
2
2
H = 2 ln s + ln16p
2
5
0
-5
length N
ln N
1
10
20
50
100
200
500
1000
2
2000
4000
H2
-10
-15
-20
-25
-30
2
2
H = 2ln s + ln16p + ln N
-35
-40
-20
-15
-10
lns2
-5
0
Local sequence information
Statistical significance of motifs
• SMILE program: Structured Motifs Inference and Evaluation
(Sagot et al.) and RISO (Carvalho et al. 2006)
– Exact algorithm to find motifs or models in sequence sets
– Calculates statistical significance of patterns found based on
permutation tests and Markov Chain Models
• Flexible input parameters – Example:
BOX 1 = 6 - 9
BOX 2 = 5 - 7
* * * * * * (* * *)
* * * * * (* * )
SPACER = 15-19
ERROR for each box and total
Application
Inference of conserved motifs
• Organisms: E.coli, B.subtilis, H.pylori
• Several promoter families:
– TTAAGC [19-23] TATAAT
– TTTTAA [10-14] TATAAT
– TTGACA [15-19] TATAAT
High statistical significance
Biological significance?!?
Vanet et al. 2000 JMB
Motifs
• Xu, M. L. and Z. C. Su (2010). "A Novel Alignment-Free
Method for Comparing Transcription Factor Binding Site
Motifs." Plos One 5(1).
• Comin, M. and D. Verzotto (2010). "Classification of
protein sequences by means of irredundant patterns."
BMC Bioinformatics 11.
• Casimiro, A.C., Vinga, S., Freitas, A.T., and Oliveira, A.L.
(2008) An analysis of the positional distribution of DNA
motifs in promoter regions and its biological relevance.
BMC Bioinformatics 9, 89.
41
Distribution of motifs
• Analyze the
position where the
motif occurs
• Test for uniformity
• Motifs that are both
overrepresented
and non-uniformly
distributed might
be biologically
significant
42
S. Cerevisiae
Promoter sequences
TF Aft2p, Dal80p,
Gln3p, Met4p and
Gat1p
Casimiro, A.C., et al. (2008) BMC Bioinformatics 9, 89.
Uniformity tests
• Small samples  choose “best” uniformity test
– Chi-Square, Kolmogorov-Smirnov and bootstrap Chi-Square
– Optimize number of bins
– Specificity and ROC curves
43
Applications of entropy to extract
local information
• Linguistic complexity
S. cerevisiae
H. influenzae
Low entropy=high repeatability
Crochemore & Vérin 1999
Comput Chem
Troyanskaya et al. 2002
Bioinformatics
Coding vs. non-coding comparisons
Entropic Profiles
• Local information plots that indicate overall conservation of
motifs in genomes
• Obtained by unfolding the probability density function
used in Renyi global entropy estimation using CGR
• New tool implementation, based on new data structures
and algorithmic simplifications, allows to process whole
genomes in few minutes.
• http://kdbio.inesc-id.pt/software/ep/
Vinga and Almeida (2007) BMC Bioinformatics, 8:393
Fernandes et al. (2009) BMC Research Notes 2:72.
45
New kernel function applied to CGR
• L resolution
• f smoothing
Almeida and Vinga Algorithms for Molecular Biology 2006 1:18
46
Properties
• CGR properties are
maintained
• Domain is respected
• Estimation of pdf is also
straightforward
Almeida and Vinga Algorithms for Molecular Biology
2006 1:18
47
EP algorithm
• Calculation: simplified to suffix counts
1 L kk
1

4
fc
([
i
k

1
,i
])
k

1
N
ˆ(
f
x
)

L
,
f i
L
k
f

k

0
Entropic profile for the ith symbol
si, coordinate xi
 L is the length resolution
 f is a smoothing parameter
48
Number of motifs
(si-k+1…si) in the
whole sequence
Vinga and Almeida (2007) BMC Bioinformatics
Whole genome case-studies
Statistical
significance
Input
parameters
Entropic
profile
49
For L>6 Chi site
motif emerges
Whole genome case-studies
Maximum at L=8
(motif length)
EPmax>7 std
50
Position study
Escherichia coli genome
 Corresponds to a Chi site
(Crossover Hotspot Instigator)
(5’-GCTGGTGG-3’)
*key region that modulates the exonuclease activity of
RecBCD, an enzyme that is necessary for chromosomal
dsDNA repair and integration of exogenous dsDNA)
51
The detection of relevant and
statistically significant segments can
be accomplished unsupervisedly by
spanning the parameters space to find
local maxima.
Motif Study
Haemophilus influenza genome
USS+ highly overrepresented
EPmax>10
Histogram
 Analysis of the motif
which represents a USS+
(Uptake Signal Sequence)
5’-AAGTGCCGGT-3’)
52
*USSs are involved in natural
competence, which is a
genetically controlled form of
horizontal gene transfer in
some bacterial species
EP conclusions
• Entropic profiles (EP) provide local information about
global features of DNA
• Excellent performance for sequences up to 2Gbp (time
and memory)
• Whole genomes testing corroborate the strengths of this
approach to detect biologically meaningful DNA segments,
related with the detection of local scales and suffix/motifs
over or under-representation
53
Alignment-free algorithms
• Advantages
–
–
–
–
–
–
General approach with rich collection of methods
Robust to shuffling and recombination events
Applicable even when less conservation is present
All genome information can be considered
Computationally less intensive
Symbol order is sometimes neglected
• Disadvantages
– Methods are less developed and integrated
– Limited detailed local information – hard to identify point
mutations, indels
– Less discriminating for querying databases and genome searches
– Symbol order is sometimes neglected
54
Summary and conclusions
• Global and local sequence analysis
– Metrics, CGR, information theory, statistics
Alignment-free techniques
can provide new tools to… “All Models Are Wrong
But Some Are Useful”
Classify
George E.P. Box
Analyze
Integrate
… biological sequence data
Acknowledgments
• Prof. Jonas S. Almeida
• KDBIO Group @ INESC-ID
• Biomathematics Group @
ITQB
• Project DynaMo (PTDC/EEAACR/69530/2006) from FCT
Thank you!
http://kdbio.inesc-id.pt/~svinga