Genomic and comparative genomic analysis

Download Report

Transcript Genomic and comparative genomic analysis

Genomic and comparative
genomic analysis
BIO520 Bioinformatics
Jim Lund
Comparative genomics delivers
• Clues as to human disease genes and
evolutionary history
• Evidence of general trends in genome evolution
• Previously unknown regulatory strategies
• “Natural history”of species as apparent in
genome records
• Surprises
Difference is in Scale and Direction
Other “omics”
One or several genes
compared against all
other known genes.
Use genome to
inform us about the
entire organism.
Comparative
Entire Genome
compared to other
entire genomes.
Use information
from many
genomes to learn
more about the
individual genes.
What are some questions that
comparative genomics can address?
How has the organism evolved?
What differentiates species?
Which non-coding regions are important?
Which genes are required for organisms to
survive in a certain environment? (prokaryotes)
Genomic characteristics observed in recently diverged
species
Time (My)
-200
-150
-80
-10
0
A B
C
D
E
F
•Organism-specific differences in gene regulation more
apparent than difference in genome sequence or structure
•Relatively small amount of neutral drift
•Apparent positive selection
•Some chromosomal rearrangement
•Minimal species-specific gene innovation
Genomic characteristics observed in species that have
diverged ~80MYA
Time (My)
-200
-150
-80
-10
0 AB
C
D
E F
•Chromosomal re-arrangements dominate organizational change.
•Changes in chromosome number likely.
•Conservation of synteny regions within rearrangements.
•High conservation features indicate purifying selection against
drift background, therefore important genomic features in common.
•Protein domain arrangements largely conserved among orthologs.
•Species-specific gene duplication, divergence, and/or loss.
Genomic characteristics observed between species that
have diverged ~1BYA
Time (My)
-1000
-500
0
A E
F
G
•Genome structure has no resolvable large or small-scale homology.
•Cis-regulatory regions do not correspond.
•Greatest conservation at the functional level in some protein
domains and functional RNA.
•Different strategies in gene organization and regulation.
•Apparent homology in shared-ancestral systems, such as energy
processing and storage.
Different Questions Require Different Comparisons
From: Hardison. Plos Biology. Vol 1 (2): 156-160
What is compared?
• Gene location
• Gene structure
–
–
–
–
Exon number
Exon lengths
Intron lengths
Sequence similarity
• Gene characteristics
– Splice sites
– Codon usage
– Conserved synteny
Millions of years
From: Miller et al. Annu. Rev. Genom. Human. Genet. 2004.5:15-56.
Reminder: Orthologues & Paralogues
t
0
1
2
Alpha
chain
3
Frog
alpha
Orthologues
Early globin
gene
Human
alpha
Human
Beta
Frog
beta
Beta
chain
First
duplication
event
Second
duplication
event
(speciation)
Paralogues
Figure 1 Regions of the human and mouse homologous genes: Coding exons
(white), noncoding exons (gray}, introns (dark gray), and intergenic regions
(black). Corresponding strong (white) and weak (gray) alignment regions of GLASS
are shown connected with arrows. Dark lines connecting the alignment regions
denote very weak or no alignment. The predicted coding regions of ROSETTA in
human, and the corresponding regions in mouse, are shown (white) between the
genes and the alignment regions.
Functional
elements:
Gene
regulation?
Chromatin
structure?
Example
Terminologies (Cont’d)
– Synteny
• Two or more genes that are located in the same
chromosome.
• Relevant within a species.
– Conserved synteny
• Orthologs of genes that are syntenic in one
species are also located on a single chromosome
in a second species.
• Gene order is irrelevant.
– Conserved segments/linkages
• In a segment of DNA, the order of multiple
orthologous genes is the same in two species.
Image credit: U.S. Department of Energy Human Genome Program
From: http://www.macdevcenter.com/pub/a/mac/2004/06/29/bioinformatics.html
Q: Why do gene pairs in syntenic regions have more significant E scores?
VISTA
A genomic alignment and visualization program
http://genome.lbl.gov/vista/index.shtml
•
•
•
•
VISTA automatically finds an orthologue for your input
sequence and performs a VISTA similarity plot
Example: Rat BAC: gj (AC097115)
For alignment, uses the AVID or LAGAN programs
• Quickly aligns 100’s of kb
• Can handle sequence in draft format
• Uses HMM-like algorithm to find strong anchors from a
collection of maximal matches
Uses VISTA browser – sequence alignment visualization tool
• Allows easy visualization of areas with high similarit.y
• Visualization is scalable – allows you to zoom in/out.
Gene: CARP – cardiac ankyrin repeat protein
There are many genomic alignment
and visualization tools:
•
•
•
•
•
•
•
•
BLASTZ/PipMaker : http://bio.cse.psu.edu/
AVID/VISTA: http://www-gsd.lbl.gov/vista/
LAGAN/Multi-LAGAN: http://lagan.stanford.edu
AVID: http://baboon.math.berkeley.edu/mAVID
BLAT: http://www.genome.ucsc.edu/
SSAHA: http://bioinfo.sarang.net/wiki/SSAHA
CONREAL:http://conreal.niob.knaw.nl/
MUMmer: http://www.tigr.org/software/mummer.
Example output from PipMaker
Genomic view of simple sequence categories
Q: What general patterns can be seen?
Q: Why do some of the factors correlate w/ gene density?
Multi-species conservation
Conserved Non-Coding Sequences
What are those MCS?
• Regulatory
– Transcription factor binding sites
– miRNAs or miRNA target sites
– Chromosome structure
– Insulator sequences
• Structural
– Replication
– Recombination
– Chromosome structure
Between-proteome
comparisons
Used to identify orthologs.
Protein alignments involving a search of one protein from
species A against the proteome of a species B
Several different bioinformatic approaches have been used to
make the comparison.
• High scoring reciprocal best hits.
• COGs (and KOGs)
• Genome-wide phylogenetic analysis
Using High scoring reciprocal best hits
• High scoring reciprocal best hits with the same domain structure are most
likely orthologs
–
–
–
–
–
share common ancestry
likely to have the same function
Function likely to be more essential (replication, etc)
Genes are not unique to either organism.
E-value should be >0.01 and alignment should stretch over >60% of each
protein
• High scoring hits with slightly different domain structures may be
orthologous, but it difficult to tell due to common, conserved domains that
have complicated histories
• Cluster analysis can help sort this out
Worm v. yeast sequences
Cut-off p-value: <e-10
<e-20
<e-50
<e-100
Total num seq
groups
1171
984
552
236
Num groups w/
> 2 members
560
442
230
79
Num (%) of all
(6217) yeast
proteins in
groups
2697 (40)
1848 (30)
888 (14)
330 (5)
Num (%) of all
worm proteins
in groups
3653 (19)
2497 (13)
1094 (6)
370 (2)
What is COG?
• The database of Clusters of Orthologous Groups of
proteins (COGs) represents an attempt on a
phylogenetic classification of the proteins encoded
in complete genomes.
• Each COG group consists of individual orthologous
proteins or orthologous sets of paralogs from at least
3 lineages and thus corresponds to an ancient
conserved domain.
• http://www.ncbi.nlm.nih.gov/COG
A shortcut for identifying orthologs
---the genomic-specific best hit (BeT)
• Given a gene from one genome, the gene
from another genome with the highest
sequence similarity (the BeT) is the
ortholog.
Algorithm of clustering
orthologous groups (overview)
Input protein
sequences
All-against-all
sequence comparison
(gapped-BLAST)
paralogs
Graph of BeTs
Quality control
COG database
Merge triangles
Ortholog triangle
The ortholog triangle
A(a)
• Multiple alignment
C(c)
B(b)
•Comparing pairwise alignments of AC and AB, we deduce the alignment of BC.
•Comparing the calculated and deduced alignment of BC; if the two alignments
are consistent, the BeTs triangle is a triangle of orthologs and can initiate a new
COG group.
Algorithm – merging triangles
• Merging triangles that had a common
side until no new ones can be joined.
The candidates of
orthologous sets were
detected.
A simple COG with two yeast paralogs
isoleucyl-tRNA synthetase
Functional
and
phylogenetic
patterns
E, E. coli;
H, H. influenzae;
G, M. genitalium;
P, M. pneumoniae;
C, Synechocystis sp.;
M, M. jannaschii;
Y, S. cerevisiae.
Phyletic patterns of COGs (2003)
~500 COGs
• 74% of COGs show scattered distribution, which reflect frequent
lineage-specific gene loss and horizontal gene transfer in
prokarytic evolution.
Representation of the 7 analyzed
eukaryotic species in KOGs
• KOG: eukaryotic orthologous groups