Comparative Genomics of Microbes

Download Report

Transcript Comparative Genomics of Microbes

Comparative genomics:
Overview & Tools +
MUMmer algorithm
Urmila Kulkarni-Kale
Bioinformatics Centre
University of Pune, Pune 411 007.
[email protected]
Genome sequence: Fact file
• 1995: The first complete genome sequence of
Haemophilus infuenzae Rd-was published
•
•
•
•
Biological systems are dynamic and evolving
The forth dimension: Time
Genome sequence is a snapshot of evolution
Correlation between Phenotypic properties and
Genomic region is not straightforward as
phenotypic properties are result of many to many
interactions
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
2
Genomes: the current

status
• Published complete genomes: 403
» Archaeal: 81
GOLD database
» Bacterial: 1226
» Eukaryal: 169
• Ongoing:
» Archaeal:
107 Metagenomics:203
» Prokaryotic:
3478
» Eukaryotic:
1209
As of Jan 21, 2010
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
Viral: >4500
3
Genome databases
• Genomes at NCBI, EBI, TIGR
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
4
H. influenzae Complete Genome
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
5
Function information clock of E. coli
Generated on March 2K4
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
6
Comparison of the coding regions
• Begins with the gene
identification algorithm:
infer what portions of the
genomic sequence
actively code for genes.
• There are four basic
approaches.
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
7
Knowledge of Full Genome sequence:
Solutions or new questions…?
Correct #
of
genes…?
Jan 21, 2010
• Still struggling
with the gene
counters…
© UKK, Bioinformatics Centre,
University of Pune.
8
Genome analyses
• Variation in
–
–
–
–
–
E. coli: 4.6Mbp
M. pneumoniae: 0.81Mbp
B. subtilis: 4.20Mbp
Genome size
GC content
B. burgdorferi: 29%
M. tuberculosis: 68%
Codon usage
Amino acid composition
G, A, P, R: GC rich
Genome organisation
I, F, Y, M, D: AT rich
• Single circular chromosomes
• Linear chromosome + extra chromosomal elements
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
9
CG: Comparisons between genomes
• The stains of the same species
• The closely related species
• The distantly related species
– List of Orthologs
– Evolution of individual genes
– Evolution of organisms
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
10
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
11
CG helps to ask some interesting questions
• Identification similarities/differences
between genomes may allow us to
understand :
– How 2 organisms evolved?
– Why certain bacteria cause diseases while
others do not?
– Identification and prioritization of drug targets
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
12
CG: Unit of comparison
• Unit of comparison: Gene/Genome
–
–
–
–
–
Number
Content (sequence)
Location (map position)
Gene Order
Gene Cluster (Genes that are part of a known metabolic
pathway, are found to exist as a group)
– Colinearity of gene order is referred as synteny
– A conserved group of genes in the same order in two
genomes as a syntenic groups or syntenic clusters
– Translocation: movement of genomic part from one position
to another
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
13
Numbers: Geneoperon
number
Structure of•tryptophan
• Arrows: Direction of transcription
• //: Dispersion of operon by 50 genes
trpB and trpA
genetically linked
separate genes
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
14
Dandekar et al., 1998
Domain fusion
trpD and trpG
trpF and trpC
Important observations with regard to Gene Order
• Order is highly conserved in closely related
species but gets changed by rearrangements
• With more evolutionary distance, no
correspondence between the gene order of
orthologous genes
• Group of genes having similar biochemical
function tend to remain localized
– Genes required for synthesis of tryptophan (trp
genes) in E. coli and other prokaryotes
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
15
Synteny
• Refers to regions of two genomes that show
considerable similarity in terms of
– sequence and
– conservation of the order of genes
• likely to be related by common descent.
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
16
COGs:
Phylogenetic classification of proteins
encoded in complete genomes
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
17
Genome analyses@NCBI
Pairwise genome comparison of protein
homologs (symmetrical best hits)
Jan 21, 2010
© UKK, Bioinformatics Centre,
http://www.ncbi.nlm.nih.gov/sutils/geneplot.cgi
University of Pune.
18
Integr8: CG site at EBI
http://www.ebi.ac.uk/integr8
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
19
Comparative Genomics Tools
•
•
•
•
•
BLAST2
MUMmer
PipMaker
AVID/VISTA
Comparisons and analyses at both
– Nucleic acid and protein level
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
20
BLAST2
• Available at NCBI
• Input: GI or FASTA sequence (range can be
specified)
• Output:
– Graphical
– Alignment of 2 genomes
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
21
Genome Alignment Algorithm:
MUMmer
• Developed by
– Dr. Steven Salzberg’s group at TIGR
– NAR (1999) 27:2369-2376
– NAR (2002) 30:2478-2483
• Availability
– Free
– TIGR site
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
22
Features of MUMmer
• The algorithm assumes that sequences are closely
related
• Can quickly compare millions of bases
• Outputs:
– Base to base alignment
– Highlights the exact matches and differences in the
genomes
– Locates
•
•
•
•
Jan 21, 2010
SNPs
Large inserts
Significant repeats
Tandem repeats and reversals
© UKK, Bioinformatics Centre,
University of Pune.
23
Definitions are drawn from biology
• SNP: Single mutation surrounded by two
matching regions
– Regions of DNA where 2 sequences have diverged by
more than one SNP
• Large inserts: regions inserted into one of the
genomes
– Sequence reversals, lateral gene transfer
• Repeats: the form of duplication that has occurred
in either genome.
• Tandem repeats: regions of repeated DNA in
immediate succession but with different copy
number in different genomes.
– A repeat can occur 2.5 times
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
24
Techniques used in the MUMmer
Algorithm
Compute Suffix trees for every genome
Longest Increasing Subsequence (LIS)
Alignment using Smith & Waterman algorithm
Integration of
these techniques
for genome alignment
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
25
MUMmer: Steps in the alignment process
Read two
genomes
Using SNPs,
mutation regions,
repeats, tandem
repeats
Perform Maximum Unique
Match (MUM) of genomes
Close the gaps
in the
Alignment
Output
alignment
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
Sort and order the
MUMs using LIS
• MUMs
• regions that do not
match exactly
26
MUMmer steps
• Locating MUMs
• Sorting MUMs
• Closure with gaps
G1: ACTGATTACGTGAACTGGATCCA
G2: ACTCTAGGTGAAGTGATCCA
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
27
Genome1: ACTGATTACGTGAACTGGATCCA
Genome2: ACTCTAGGTGAAGTGATCCA
Genome1: ACTGATTACGTGAACTGGATCCA
Genome2: ACTCTAGGTGAAGTGATCCA
ACTGATTACGTGAACTGGATCCA
ACTC--TAGGTGAAGT-GATCCA
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
28
What is a MUM?
• MUM is a subsequence that occurs exactly once in
both genomes and is NOT part of any longer
sequence
• Two characters that bound a MUM are always
mismatches
GenA: tcgatcGACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAAcgactta
GenB: gcattaGACGATCGCCGCCGTAGATCGAATAACGAGAGAGCATAAtccagag
• Principle: if a long matching sequence occurs
exactly once in each genome, it is certainly to be
part of global alignment
Similar to
BLAST & FASTA!!
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
29
Sorting & ordering MUMs
• MUMs are sorted according to their position in
Genome A
• The order of matching MUMs in Genome B is
considered
2
4
MUM3:
Random match
Inexact repeat
MUM5:
transposition
• LIS algorithm to locate longest set of MUMs
which occur in ascending order in both genomes
Jan 21, 2010
UKK,
Bioinformatics
Centre,
Leads©to
Global
MUM-alignment
University of Pune.
30
MUMmer Results
• 2 strains of M. tuberculosis
– H37Rv & CDC1551
– Genome size: 4Mb
– Time: 55 s
• Generating suffix tree: 5 s
• Sorting MUMs: 45s
• S&W alignment: 5 s
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
31
Alignment of M. tuberculosis strains
CDC1551 (Top) & H37Rv (bottom)
Single green lines
indicate SNPs
Blue lines
indicate insertions
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
32
Comparison of 2 Mycoplasma genomes
cousins that are distantly related
• M. genitalium: 580 074 nt
• M. pneumoniae: 816 394 (+226 000)
• Analysis of proteins tell us that all M.g.
proteins are present in P.m.
• Alignment was carried using
–
–
–
–
FASTA (dividing each genome into 1000 bp)
All-against-all searches
Fixed length of pattern (25)
Using MUMmer (length = 25)
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
33
Comparison of 2 Mycoplasma genomes
Using FASTA
Fixed length
patterns: 25mers
MUMmer
Jan 21, 2010
© UKK, Bioinformatics Centre,
University of Pune.
34
Post-sequencing challenges
• Genome sequencing is just the beginning to
appreciate biocomplexity
• Sequence-based function assignment approaches
fail as the sequence similarity drops …
• Structure-based function prediction approaches are
limited by the availability of structures,
association of structural motifs & associated
functional descriptor
• As a result, in any genome,
Genes with known
function: ~ 40%
Jan 21, 2010
Genes with unknown
function: ~60%
© UKK, Bioinformatics Centre,
University of Pune.
35