20070903161016301

Download Report

Transcript 20070903161016301

Phylogenetic Reconstruction
based on RNA Secondary
Structural Alignment
Benny Chor, Tel-Aviv Univ.
Joint work with Moran Cabili, Assaf
Meirovich, and Metsada Pasmanik-Chor
Phylogenetic Trees Based on What ?
• Morphology
(1800 - )
• Single gene sequence (DNA or AA)
(1960 - )
Phylogenetic Trees Based on What ?
• Whole genomes
(2002 - )
More Sources to Base Phylogeny On?
A Proposed, Metric Induced Approach
1. Find a reliable metric between pairs of objects.
2. Design / choose / modify a good algorithm for
determining metric (pairwise distances).
3. Compute distance matrix.
4. Construct a Neighbor Joining tree from the
distance matrix.
5. As a sanity check, compare resulting tree to
“standard & accepted” ones.
NJ
Metric Induced Approach
Was already applied (fairly successfully), e.g.
for constructing phylogenies based on whole
genomes/proteomes (Burstein et al., 2005),
and others, based on metabolic networks
(Tuller et al., 2006).
Of course distances that are
appropriate to each domain must
be applied (or especially designed).
NJ
Our Question
Can phylogenetic reconstruction be
based on RNA secondary structures ?
Answer: Yes, And Even Quite Well
Our tree, based on
secondary structs.
of 16s rRNA
from 91 species
Archaea
Eukarya
Bacteria
Metric Induced Approach: Specifics
1. Find an efficient alignment algorithm
(similarity based) pair-wise RNA secondary
structures.
2. Transform similarity to distance.
3. Use RNA databases to get the RNA molecules
and structures. Apply the algorithm to compute
the distance for each pair of molecules.
4. Run NJ to produce trees.
The Alignment Algorithm Chosen
-
We chose to use RSmatch: A sophisticated dynamic
programming algorithm, based on the “dot bracket”
representation of the secondary structure.
J. Liu , J.T. Wang , J. Hu , B. Tian. BMC Bioinformatics 2005 , 6:89.
-
RSmatch sorts each dot and bracket to components,
and then compares components according to their
order in the secondary structure.
-
RSmatch employs both sequences and structures.
-
Complexity: O(nm), where n and m are the lengths of the two RNA
molecules that are compared.
TAATTATCGGAAGCAGTGCCTTCCATAATTA
(((((((.((((( ......))))))))))))
From Similarity to Distance
In transforming the scoring matrix from similarity to
distance, we tried to preserve the ratios between
mismatches values, and of course lower similarity
should imply higher distance.
Distance metric requirements:
Symmetry, Δ inequality, non negativity, self distance=0
Actual Distance Matrices: Higher
Mismatch Penalties at “Dots”
AU
CG
GC
GU
UA
UG
AU
0
1
1
0.5
0.5
0.5
CG
1
0
0.5
1
1
1
GC
1
0.5
0
1
1
1
GU
0.5
1
1
0
0.5
0.5
UA
0.5
1
1
0.5
0
0.5
UG
0.5
1
1
0.5
0.5
0
A
C
G
U
A
0
2
2
2
C
2
0
2
2
G
2
2
0
2
U
2
2
2
0
- Gap cost : 3 per nucleotide involved.
- Δ inequality : mismatch < 2* gap cost
DBs of Reliable Secondary Struc.
DBs constructed with manual intervention
• RNaseP DB:
http://www.mbio.ncsu.edu/RNaseP/
Sequences length: ~300 - 400 (+/-) nucleotides
RNaseP function:
Cleaves off an extra, or precursor,
sequence of RNA on tRNA molecules.
• 16S rRNA:
Comparative RNA Web Site: http://www.rna.icmb.utexas.edu/
Sequences length: ~1,500 (+/-) nucleotides
16S function:
In charge of tRNA binding and formation
of peptide bonds during translation.
Our results …ahhm… trees
RNaseP Tree, 51 Species
Secondary structure based tree
• Good partition to 3
kingdoms.
• Bacteria
(characterized by
Bxy) also look good.
RNaseP 51 Species
Sequence based tree
Eukarya
Bacteria
Archaea
Eukaryotes are not
monophyletic (yeast
external).
16s rRNA – 20 Species
Secondary structure based tree
Fungi
Mammalia
Bacillariophyta
Amphibia
Viridaeplanatae
16s rRNA –
91 Species
Archaea
Eukarya
Secondary
structure
based tree
Bacteria
Collins et al., 2000
After completing this project, we discovered a
related, earlier work from David Penny’s group. When
determining evolutionary relationships between some
catalytic RNA molecules, they constructed a 16S
rRNA tree based on a similar “distance approach”.
We compared our results to
the trees published in their article
(using a different distance algorithm,
RNAdistance, by Shapiro & Zhang).
Collins et al., 2000.
Archaea
16 Species
Archaea
Bacteria
Collins’ 16s rRNA sequence based tree
Bacteria
Collins’ 16s r RNA
secondary struct based tree
Our Tree, 13 Out of 16 Collins’ Species
Secondary structure based tree
Archaea
Bacteria
A Close Look at the Trees
Collins’ 16s rRNA seq based tree
outgroups
Our 16s second. struct. tree
Collins’ 16s second. struct. based tree
A Close Look at Sec. Strucs. Supports
a “Thermoplasma Outgroup” Theory
Methanobacteruim
Methanococcus
Thermoplasma
Conclusions
1. Encouraging results
2. Accuracy of structure based trees is
comparable to sequence based trees.
3. Warning: Reliable secondary structures
are crucial for accurate tree
reconstruction.