Transcript Document
Genomic signatures of an aquatic lifestyle: rate
variation of orthologous genes from arthropods
living in water versus those living on land
Prediction of genes involved in influencing
terrestrial and aquatic lifestyles in arthropods
A Bioinformatics PipeLine
Mortha Sharat Kumar
Sumit Middha
John K. Colbourne
May 17th 2007
Overview
Problem Statement
Background
Data
Methods
Results
Future Work
References
Acknowledgements
>Problem Statement
Problem Statement
Based on the knowledge of :
Organismal lineages of arthropods
Morphology
Habitat diversity
Gene sequence data
Consider arthropods with aquatic and terrestrial lifestyles, Using just
techniques and tools in Comparative Genomics to predict rate variations
in Orthologs.
Can we - Predict the genes which might have a key role in supporting
aquatic or terrestrial lifestyles in arthropods?
>Problem Statement
Problem Statement
A Bioinformatics Pipeline :
Have a structured methodology, steps - a Pipeline for future
projects.
Spend less time and effort on thinking about the correct steps to be
followed - Have a fixed methodology.
Learn from mistakes.
Spend minimal time tweaking the code.
Spend more time playing with the data and analyses than spend
time on writing code for future projects.
Is it a Tool?
No.
You won`t get all the results on the click of a button. Too many
things involved.
Programs. Some tweaking necessary based on the data, number of
organisms.
Problem Statement
>Background
Background
Homologs, Orthologs and Paralogs
Homologs A gene related to a second gene by descent from a common
ancestral DNA sequence.
Superset of Orthologs and Paralogs.
Orthologs Orthologs are genes in different species that evolved from a
common ancestral gene
Result of a Speciation event.
Normally, orthologs retain the same function in the course of
evolution.
Paralogs Paralogs are genes related by duplication within a genome.
Paralogs evolve new functions, even if these are related to the
original one.
Problem Statement
>Background
Background
Homologs, Orthologs and Paralogs
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/orthologs3.gif
What we are interested are Orthologs.
Problem Statement
>Background
Background
Evolutionary Rates : Rate at which genes evolve in a particular lineage.
Measured by the number of amino acid substitution with some
underlying algorithm model [ Substitution Model ]
Molecular Clock Hypothesis : This postulates that the rate of evolution
measured by the amino acid substitution is roughly constant overtime and
across different lineages.
However the evolutionary rates of some genes are higher/lower across
certain lineage groups.
What does correlation in Evolutionary Rates mean?
Selective forces acting on these genes have been similar between the
lineages.
What does this mean?
Lifestyles / Environment of the organisms.
Problem Statement
>Background
Species Introduced
Anopheles gambiae
Daphnia magna
Apis mellifera
Tribolium castenum
Drosophila melanogaster
Daphnia pulex
Caenorhabditis elegans
Problem Statement
>Background
Fruit Fly [Drosophila melanogaster]
Lifestyle - Terrestrial
Mosquito [Anopheles gambiae]
Lifestyle - Aquatic / Terrestrial
Beetle [Tribolium casteneum]
Lifestyle - Terrestrial
Honey Bee [Apis mellifera]
Lifestyle - Terrestrial
Water Flea [Daphnia magna]
Lifestyle - Aquatic
Water Flea [Daphnia pulex]
Lifestyle - Aquatic
Nematode worm [Caenorhabditis elegans]
Lifestyle - Terrestrial
Problem Statement
>Background
Phylogeny
Problem Statement
>Background
Phylogeny of the Species
Problem Statement
>Background
Aquatic Genes
Ox`
Ox``
Ox```
The Ortholog Cluster Ox has Similar Substitution Rates - Similar
Evolutionary Rates - Similar Selective Forces acting on them?
More closely related to other species.
Could they play role in supporting aquatic lifestyle?
Problem Statement
>Background
Terrestrial Genes
Oy`
Oy``
Oy```
What about Ortholog Cluster Oy?
Could they play role in supporting Terrestrial lifestyle?
Problem Statement
Background
Data
>Data
About the Data :Varied sources.
The number of sequences for each organism vary.
Annotated amino acids to EST Contigs.
Lengths of the sequences differ greatly.
Problem Statement
Background
>Data
….Data - Sequence Lengths in base pairs
Problem Statement
Background
Data
>Methods
Methods
Detect Orthologs
All-against-All Criteria
Alignments
Cleaning of Alignments
Evolutionary rate tests
Analysis
Problem Statement
Background
Methods - The PipeLine
Data
>Methods
Detect Orthologs
RBBH
All-against-All Criteria
Scripts
Alignments
TCoffee
Cleaning of Alignments
Scripts
Evolutionary rate tests
RRTree
Results
Analysis
Problem Statement
Background
Methods - RBBH
Data
>Methods
RBBH [Reciprocal Best Blast Hits] - What is it?
Proteins from different organisms that are each others top Blast hit.
Ax
By
Gene x from A and gene y from B are orthologs.
What if -
Ax
By
Cz
Can x, y and z be considered an ortholog cluster?
Problem Statement
Background
Data
>Methods
Methods - All-Against-All Criteria
One protein sequence from each organism is accepted into an
Ortholog cluster if each protein has a RBBH from every other
Organism.
All-against- All is very stringent.Very high confidence in the
inferred Orthology.
A
For 5 Organisms. We have.
C
B
D
E
Problem Statement
Background
Data
Methods - All-Against-All Criteria
No of Organisms
>Methods
No of Blasts
2
2
3
6
4
12
5
20
6
30
7
42
After checking the All-Against-All Criteria we are left with high
confidence ortholog clusters.
Problem Statement
Background
Data
>Methods
Methods - Alignments and Cleaning
Alignments were carried out using TCoffee.
The leading and the trailing gaps Do not correspond to Indels.
Do not have information associated with them.
If the leading and trailing gaps are not clipped?
Inaccurate Substitution Rates result.
They leading and the trailing gaps have to be clipped Clipped from the start and the end of an alignment when a
highly conserved block is encountered.
Problem Statement
Background
Data
>Methods
Problem Statement
Background
Data
Methods - Alignments and Cleaning
>Methods
Black - Before trimming, Red - After trimming
Problem Statement
Background
Data
>Methods
Relative Rate Tests - RRTree
What exactly is Relative Rate Tests?
Calculates the rate of amino acid/nucleotide substitution across
lineages with respect to an outgroup.
Problem Statement
Background
Relative Rate Tests - Models
Data
>Methods
Kimura 2 Parameter -
Jukes Cantor
Uncorrected Distance
Substitution Matrix
Base Frequencies?
Problem Statement
Background
Data
Methods
>Results
Results
Pairwise Ortholog Distribution between species :
Problem Statement
Background
Data
Methods
>Results
Ortholog Detection Tools Each have their own underlying Algorithm COGs - Clusters of Orthologous Groups
OrthoMCL
InParanoid
KOG - euKaryotic Orthologous Groups
The paper Tim Hulsen, Martin A Huygen, Jacob de Vileg and Peter MA Groenen
“Benchmarking ortholog identification methods using functional
genomics data”
Rated InParanoid as the best Ortholog Detection tool.
InParanoid is also one of the most widely used tool .
Problem Statement
Background
Data
Methods
Why not just used a published tool like InParanoid for Ortholog
Detection?
>Results
The benchmarking paper - InParanoid gave the largest number of
False Positive.
False Positives - Paralogs.
Paralogs are undesirable in our study. We are interested in genes with
the same function..
RBBH gave the least number of False Positives
How did our RBBH method fare when compared to InParanoid?
Problem Statement
Background
Results
Data
Methods
Orthologs clusters present is all --
>Results
Drosophila melanogaster
Anopheles gambiae
Tribolium casteneum
Apis mellifera
Daphnia pulex
932 - 380 =
552
~59 % met All-Against-All
Daphnia magna
Caenorhabditis elegans
The 5 species with atleast Daphnia magna or Daphnia pulex
69
1052 - 360 =
692
~ 65 % met All-Against-All
Total Genes to work with =
1244
Problem Statement
Background
Results
Data
Methods
>Results
When considering the all the seven species ~6% of the genes
had high similarity in evolutionary rates in Anopheles gambiae and
the Daphnia (both Daphnia pulex and Daphnia magna).
Aquatic Lifestyle?
..
.
Problem Statement
Background
Results
Data
Methods
>Results
Now What? - We have Gene IDs
See if the genes belong to some gene families?
Statistical Tests.
GO !
What is Gene Ontology?
The Gene Ontology project provides a controlled vocabulary to
describe gene and gene product attributes in any organism.
Problem Statement
Background
Results
Data
Methods
>Results
..
.
..
.
Problem Statement
Background
Data
Future Work/Project
Methods
Results
Prediction of genes involved in influencing Social behavior in
Insects.
>Future Work
Use the same methodology , the PipeLine
The approach would exactly be the same - instead of arthropod
species with aquatic and terrestrial lifestyle, the study will have insect
species with known social behavioral and non-social behavioral traits.
social
non-social
social
Problem Statement
Background
Data
Methods
Results
References
Zdobnov EM, von Mering C ,et al. - Comparative genome and protein analysis of
Anopheles gambiae and Drosophila melanogaster.
Dirk Steinke, Walter Salzburger, Ingo Braasch and Axel Meyer - Many genes in fish have
species specific asymmetric rates of molecular evolution.\newline
Future Work
>References
J. W. Kijas,M. Menzies and A.Ingham - Sequence diversity and rates of molecular
evolution between sheep and cattle genes.
“Phylogenetic Inference”, Swofford, Olsen, Waddell, and Hillis, in Molecular Systematics,
2nd ed., Sinauer Ass., Inc., 1996, Ch. 11.
F. Tajima and M. Nei, Mol. Biol. Evol. 1984, 1, 269.
M. Kimura, J. Mol. Evol. 1980, 16, 111.4.K. Tamura, Mol. Biol. Evol. 1992, 9, 678.
L. Jin and M. Nei, Mol. Biol. Evol. 1990, 7, 82.
M. Kimura, The Neutral Theory of Molecular Evolution, Camb. Uni. Press, Camb., 1983.\
Insights into social insects from the genome of the honeybee Apis mellifera
Nature 443, 931-949(26 October 2006).
Problem Statement
Background
Data
Methods
Results
References
Alexandre Hassanin (2006). Phylogeny of Arthropoda inferred from
mitochondrial sequences: Strategies for limiting the misleading effects of
multiple changes in pattern and rates of substitution. Molecular Phylogenetics
and
Evolution
38:
100
116.
Future Work
>References
Tim Hulsen ,Martijn A Huynen et al, Benchmarking ortholog identification
methods
using
functional
genomics
data.
Joel Savard, Diethard Tautz and Martin J Lercher., Genome-wide acceleration
of protein evolution in flies(Diptera), BMC Evolutionary Biology 2006
Cedric Notredame, Desmond Higgins and Jaap Heringa., T-Coffee: A Novel
Method for Fast and Accurate Multiple Sequence Alignment, JMB 2000
Robinson-Rechavi M, Huchon D., RRTree: Relative-rate tests between groups
of sequences on a phylogenetic tree., Bioinformatics 2000, 16, 296-297.
Tim Hulsen, Martijn A Huynen, Jacob de Vlieg and Peter MA Groenen :
Benchmarking ortholog identification methods using functional genomics data,
Genome
Biology
2006
Jukes TH, Cantor CR (1969) Evolution of protein molecules. in Munro HN
(Ed.) Mammalian protein metabolism. Academic Press, New York
13:2178-2189.
Problem Statement
Background
Acknowledgements
Data
Methods
Results
Future Work
References
>Acknowledgements
This could not have been possible without the aid, support, guidance
and patience of -
John K. Colbourne
Sumit Middha
The CGB Staff - The Bioinformatics Group & The Genomics Group
Thanks to Memo Dalkilic and Haixu Tang for their valuable
feedback on the project.
Computing Facilities - CGB
Special Thanks to my family and friends and Professor Edward L
Robertson.
Thank You