Transcript Lecture 8

Introduction to
Bioinformatics
1
Introduction to Bioinformatics.
LECTURE 8: Whole genome comparisons
*
Chapter 8: Welcome to the Hotel Chlamydia
2
Introduction to Bioinformatics
LECTURE 8: WHOLE GENOME COMPARISONS
8.1 Uninvited guests
* Symbionts: organisms that live together in a
beneficial relation
* E. coli and Human: receives nutrients, gives vitamin K
* Numerous examples in Nature: flowers and bees, tickbird and rhinoceros, pea aphid and Buchnera,
mitochondria and Eukaryotes
3
Introduction to Bioinformatics
LECTURE 8: WHOLE GENOME COMPARISONS
8.1 Uninvited guests
* Some symbionts have moved permanently into the cells of
the host
* They have become entirely dependent on the host to
provide them with nutrients, oxygen, specific proteins …
* In the process they have lost many genes necessary to
produce such products themselves
* As a result, intracellular obligate symbionts have the
smallest genomes – both in total size as in number of genes
4
Introduction to Bioinformatics
8.1 – UNINVITED GUESTS
Chlamydia trachomatis
* Chlamydia trachomatis is an intracellular symbiont that
gives no benefit to the host : it is a parasite
* C. trachomatis is the most common sexually transmitted
disease with +/- 3M new infections per annum in the USA
* It has lost the ability to produce many biochemical
products and must live in specific cells in the human
(hence the characterisation of: obligate endo-symbiont)
5
Introduction to Bioinformatics
LECTURE 8: WHOLE GENOME COMPARISONS
Chlamydia trachomatis
6
Introduction to Bioinformatics
8.1 – UNINVITED GUESTS
Chlamydia pneunomia
* Chlamydia pneumonia is a related bacterial parasite of the
human respiratory tract : it causes pneumonia and bronchitis
* Like C. trachomatis it has a very small genome ~ 1 Mb
7
Human
respiratory
tract
Chlamydia
pneunomia
8
Introduction to Bioinformatics
8.1 – UNINVITED GUESTS
Chlamydia pneunomia
Phylogenetic analysis of the parasitic lifestyle of Chlamydia
shows that it dates back to 700 Myrs with the emergence of
Eukaryotes
* This is the same date as the pure symbiontic lifestyle of
mitochondria
9
Introduction to Bioinformatics
8.1 – UNINVITED GUESTS
Whole genome comparisons
* In this lecture we study the problems involved with the
comparisons of entire genomes
* Because of its very small genome Chlamydia are a
perfect case study
* Moreover, Chlamydia shows a high conservation of the
order of the genes and virtually no horizontal gene transfer.
10
Introduction to Bioinformatics
8.1 – UNINVITED GUESTS
Hotel Chlamydia
‘Hotel’ Chlamydia is not so much that we are a living
hotel for the Chlamydia (which is true), but that genes are
guest in hotel Chlamydia – guests that come in, reshuffle
rooms, move out, pass along …
11
Introduction to Bioinformatics
LECTURE 8: WHOLE GENOME COMPARISONS
8.2 Patterns of genome evolution
* Genome comparison looks at the differences between
the entire set of genes between two organisms
* This provides insight in evolution and function of genes
* Single nucleotide polymorphisms form the bulk of the
genetic variability
* Also rearrangement and shuffling of genes: inversion,
duplication, translocation
12
Introduction to Bioinformatics
LECTURE 8: WHOLE GENOME COMPARISONS
8.2 Patterns of genome evolution
* Often translocation between two organisms:
horizontal gene transfer
* Some 20% of E. coli‘s genes derive from horizontal transfer
* Chromosomes can break apart and or stick together
* Whole genomes can be duplicated → polyploid individuals
* This is the basis for new functions as these extra genes
are free to evolve
13
Introduction to Bioinformatics
LECTURE 8: WHOLE GENOME COMPARISONS
14
Introduction to Bioinformatics
LECTURE 8: WHOLE GENOME COMPARISONS
8.3 Beanbag genomics
* Genome = beanbag of genes + junk-DNA
* Comparison of whole genomes is more than comparing
individual genes
* inversions, transpositions, duplications, deletions,
chromosomal rearrangements
* Therefore an alignment of entire genomes will not work
15
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Basic mechanisms of gene evolution
16
17
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
gene evolution
an alignment of entire genome will not work !
GAC ACTTTTTGG GGG TATATA CATGTAGTAC AAATAAT CG AACCCCCG
GAC ACTTTTTGG GGG TATATA CATGTAGTAC AAATAAT CG AACCCCCG
inversion
duplication
transposition
deletion
Therefore we have to break the problem in
smaller pieces and build it back up …
… with multiple single-gene analysis
18
8.3 - BEANBAG GENOMICS :
Comparison of two chromosomes
chromosome evolution
GAC ACTTTTTGG GGG TATATA CATGTAGTAC AAATAAT CG AACCCCCG
AACCCCCG AAATAAT CATGTAGTAC GGG TATATA GAC ACTTTTTGG CG
Splitting into new chromosomes
Reshuffling of genes over the chromosomes
19
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
* STEP 1: Find which genes are present in both
* STEP 2: use ORF-finder with threshold of 100 codons
* EXAMPLE: Chlamydia trachomatis and C. pneumonia :
--- Organism --------- size (nt) --- ORFs --C. trachomatis
1 042 519
916
C. pneumonia
1 229 853
1048
E. coli
5000
20
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
--- Organism --------- size (nt) --- ORFs --C. trachomatis
1 042 519
916
C. pneumonia
1 229 853
1048
E. coli
5000
* Intracellular symbionts like CT and CP have lost many genes:
they parasite on their host and ‘steal’ the gene-products
* CT lives in urinary tracts and CP lives in respiratory tracts
* The differences in their genomes tells us something about the
function of their retained genes
* What are suitable algorithmic methods for comparing lost
21
and gained genes in Chlamydia?
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Similarity on a genomic scale
Similarity between pairs of genes informs about:
* Blocks of conserved gene order
* Changes in size of gene families
* Nucleotide substitutions between orthologous genes
22
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Central idea for genomic comparison
Define the similarity scores between genomes as:
* Nucleotide sequences of all genes found in both
genomes
* Fill out a matrix with alignment scores between
each possible pair of sequences
* For the Chlamydiae CT and CP this is a 1048x916
matrix
* Use Needleman-Wunsch or BLAST to compute
similarity scores
23
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Identifying orthologous and paralogous genes
* Use genome similarity matrix to distinguish between
paralogs and orthologs
* remember: homologs are genes that have a common
ancestor, orthologs arise as homologs evolve in sisterspecies; paralogs arise from duplication and subsequent
specialisation
* Result of evolution of homologs and paralogs: no one-toone relationship, but (many/one)-to-(many/one)
24
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Reciprocal similarity
* Recognition of orthologs: Best Reciprocal similarity Hits
(BRHs)
* A pair of ORFs is a BRH if it is the best match between the
two genomes (using alignment scores)
* possible: ORFs without BRH
* possible: ORF with ortholog in other species and a paralog
in the same species.
25
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
EXAMPLE 8.1: Homology in Chlamydia
* Similarity matrix → BRH → orthologs
* With threshold = 100 codons we find 1964 ORFs
(CT: 916 and CP: 1048)
* Among these 1964 ORFs are 728 ortholog pairs
* Also 126 pairs of paralogs (CT: 56, CP: 70)
* These paralogs are more similar to each other than to
orthologs → result of duplication after the species split
* The remaining 13% (=253 ORFs) perhaps older paralogs
that have been lost in the other species due to specialisation
26
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Identifying gene families
* Defining a gene family is a tricky thing; at some high level
all genes are ‘family’
* The very first DNA-based organism, ancestral to all
present living beings, had a set of genes.
* All (?) genes have derived from these ancestral genes
through duplication and subsequent specialization.
* Genes that cooperate tend to move close together
(Dawkins: like rowers in a rowing boat)
27
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Identifying gene families
* Practical solution: only consider genes that are > 50%
similar; they are ‘closely’ related and probably have a
similar ‘function’
* Method for finding ‘similar’ genes: clustering
* Draw-back: all clustering methods have some degree
of arbitrariness
* Cluster both genomes simultaneously,
then count #genes in each cluster (=gene family).
28
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Identify gene families with Hierarchical Clustering
* input: genome similarity matrix d
* method: cluster d with NJ- or UPGMA-algorithm
to group the genes in families
* This is called Hierarchical Clustering (HC)
29
Hierarchical Clustering on both Chlamidiae
Application of HC on Chlamydia reveals
a large number of small gene families
and a small number of large families
30
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Similar function of gene families
Largest gene families in C. trachomatis and C. pneumonia:
-------- CT ---- CP ------ Function ---------------------------12
12
ABC transporters
6
15
G family outer membrane protein
9
10
Function not known
9
10
Function not known
ABC transporters are transmembrane proteins with binding
sites on both sites : major role in transport in/out the cell
They are very old: they are (near) identical in all organisms31
ABC transport ATP-binding cassette
ABC transporter · Hydrophobicity
32
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Schematic of the E.
coli vitamin B12
importer system.
33
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Snapshots of pore
formation in the
bilayer, with an
applied field of 0.5
V/nm in the
presence of 1 M
NaCl
34
8.3 - BEANBAG GENOMICS :
Comparison of two genomes
Alternative approaches to finding orthologs
* Clustering of genes in families has some arbitrariness
* Ortholog genes are separated by a speciation event
* Thus, a phylogenetic tree is also a useful metaphor
* A phylogenetic tree is a better representation, but it is
less amenable to an automated analysis
35
Introduction to Bioinformatics
LECTURE 8: WHOLE GENOME COMPARISONS
8.4 Synteny
* In Section 8.3 the emphasis was on analysis of genes,
here the emphasis is on chromosomes
* syn- = together, tenia = ribbon, band,
* synteny : the relative ordering of genes
on the same chromosomes
36
Introduction to Bioinformatics
8.4 – SYNTENY
Major mechanisms of reshuffling of synteny are:
inversions and transpositions
Noise on synteny is caused by:
insertions, duplications, and deletions
Blocks of synteny: long stretches of DNA where the
relative ordering of orthologous genes is conserved
Synteny allows for annotation of non-coding sequences
and identification of homologous intergenetic regions
37
Introduction to Bioinformatics
8.4 – SYNTENY
38
Introduction to Bioinformatics
8.4 – SYNTENY
Cat on Human
Conserved synteny map
39
Introduction to Bioinformatics
8.4 – SYNTENY
Visualising Synteny
Dot-plot:
* x-axis = position on genome_1,
* y-axis = position on genome_2,
* For a homologous gene with genome_1-position x,
and genome_2-position y: put a dot ‘*’ on (x,y)
40
41
Introduction to Bioinformatics
8.4 – SYNTENY
Chlamidia Synteny
The high level of conservation in Chlamidia is remarkable
but typical for all intracellular symbionts
For instance Buchnera aphidicola, a intracellular symbiont
in pea aphids, has retained synteny for 50 million years.
42
Introduction to Bioinformatics
8.4 – SYNTENY
Buchneria, a endosymbiont in pea aphids, has retained
synteny for 50 million years.
pea aphid
Buchneria aphidicola
43
Introduction to Bioinformatics
8.4 – SYNTENY
SYMBIOSIS BETWEEN PEA APHIDS AND BUCHNERA
Plant sap contains little protein and aphids cannot produce
ten essential amino acids
The required amino acids come from their symbiotic friends,
the bacterium Buchnera aphidicola.
44
Introduction to Bioinformatics
8.4 – SYNTENY
SYMBIOSIS BETWEEN PEA APHIDS AND BUCHNERA
The symbionts genome reflects this biosynthetic activity.
Buchnera aphidicola carries the
two genes trpEG for tryptophan
synthesis. Each bacterium contains
three or four plasmids that contain
four tandem repeats of these genes,
resulting in 12 to 16 copies of trpEG.
Thus, the symbionts supply the host with the essential
amino acids and receives free nutrients and shelter.
45
Introduction to Bioinformatics
8.4 – SYNTENY
The relation between
pea aphids and
Buchnera aphidicola is
very old ….
46
Introduction to Bioinformatics
8.4 – SYNTENY
Endosymbionts and Synteny
Buchnera aphidicola has retained synteny for 50 million
years.
IN GENERAL: The cloistered lifestyle of endo-symbiotic
organisms shields them from viruses and other bacteria that
may induce gene rearrangement
47
Introduction to Bioinformatics
8.4 – SYNTENY
Homologous intergenic regions and
‘phylogenetic footprinting’
Intergenic regions are not selected for → fast evolution
Non-protein coding regions of the genome that are
conserved are suspicious: they may be RNA-coding or
regulatory sequences
Using syntenic coding regions as anchors we can find
intergenic regions that are highly conserved.
This is called: Genetic Footprinting
48
intergenic regions
ORF CT671-CP1949
ORF CT672
ORF CT673-CP1951
CP1950
49
Introduction to Bioinformatics
8.4 – SYNTENY
A metric for the syntenic distance
Two genomes can be formed by many smaller syntenic
blocks rearranged by inversions or transpositions
Can we define a metric for the syntenic distance of
these two genomes?
We are not interested in nucleotide differences but in
the number of genomic rearrangements that separate
the two species.
METRIC = smallest number of operations (=inversion or
transposition) that transform one genome into the other
50
Introduction to Bioinformatics
8.4 – SYNTENY
EXAMPLE: Sorting by reversals
As an example let us consider only the case of inversions
“sorting by reversals” = minimum number of inversions to
transform one genome into the other
Algorithm: given a permutation of N numbers find the
shortest series of reversals that can sort the back into their
original order
51
Introduction to Bioinformatics
8.4 – SYNTENY
EXAMPLE: Sorting by reversals
3
2
1
4
8
7
6
5
9
1
2
3
4
8
7
6
5
9
1
2
3
4
5
6
7
8
9
2 reversals → syntenic distance = 2
52
Introduction to Bioinformatics
8.4 – SYNTENY
Sorting by reversals
NOTE:
In practice we do not know the original genome, so we
select either one of the two as ‘the’ standard
53
Introduction to Bioinformatics
8.4 – SYNTENY
SIMPLE REVERSAL ALGORITHM
STEP 1: designate one sequence as the standard s
and the other as t
STEP 2: i=1, increase(i) until s(i) ≠ t(i) or i=length(t)
STEP 3: j=i; increase(j) until t(j) = s(i), reverse(t(i:j)
STEP 4: i=j+1; if i=length(t), stop, else goto STEP-2
54
Introduction to Bioinformatics
8.4 – SYNTENY
SIMPLE REVERSAL ALGORITHM
QUESTION:
Can this algorithm solve overlapping reversals?
REMARK :
Involving transpositions is even more complex …
55
END of LECTURE 8
56