Comparative Genomics

Download Report

Transcript Comparative Genomics

Comparative Genomics
Comparative Gene Prediction in the
Human Genome
Maribel Hernandez Rosales
What is Comparative Genomics?





Comparative genomics is the analysis and comparison of genomes
from different species.
The purpose is to gain a better understanding of how species have
evolved and to determine the function of genes and noncoding
regions of the genome.
Researchers have learned a great deal about the function of human
genes by examining their counterparts in simpler model organisms
such as the mouse.
Genome researchers look at many different features when
comparing genomes: sequence similarity, gene location, the length
and number of coding regions (called exons) within genes, the
amount of noncoding DNA in each genome, and highly conserved
regions maintained in organisms as simple as bacteria and as
complex as humans.
Comparative genomics involves the use of computer programs that
can line up multiple genomes and look for regions of similarity
among them.
What are the comparative genome sizes of
humans and other organisms being studied?
organism
Homo sapiens
(human)
estimated size
estimated
gene number
average gene
density
chromosome
number
2900 million bases
~30,000
1 gene per 100,000
bases
46
42
Rattus norvegicus
(rat)
2,750 million bases
~30,000
1 gene per 100,000
bases
Mus musculus
(mouse)
2500 million bases
~30,000
1 gene per 100,000
bases
40
13,600
1 gene per 9,000
bases
8
25,500
1 gene per 4000
bases
5
19,100
1 gene per 5000
bases
6
6300
1 gene per 2000
bases
16
3200
1 gene per 1400
bases
1
1700
1 gene per 1000
bases
1
Drosophila melanogaster
(fruit fly)
Arabidopsis thaliana
(plant)
180 million bases
125 million bases
Caenorhabditis elegans
(roundworm)
Saccharomyces
cerevisiae
(yeast)
Escherichia coli
(bacteria)
H. influenzae
(bacteria)
97 million bases
12 million bases
4.7 million bases
1.8 million bases
Eukaryotic Gene Finding
Comparative Gene Prediction





GenScan : ab initio gene prediction.
GeneWise, Procrustes : homology guided.
Rosseta, SGP1 (Syntetic Gene Prediction), CEM
(Conserved Exon Method) : gene prediction and
sequence alignment are clearly separated.
GenomeScan : Ab Initio modified by BLAST
homologies.
SGP-2, TwinScan, SLAM, DoubleScan :
modification of GenScan scoring schema to
incorporate similarity to known proteins.
GeneScan





A general probabilistic model for the gene structure of
human genomic sequences.
Gene identification by identifying complete exon/intron
structures of genes in genomic DNA.
Include de capacity to predict multiple genes in a sequence,
to deal with partial as well as complete genes, and to predict
consistent sets of genes occurring on either or both DNA
strands.
Markov Model of coding regions: predictions do not depend
on presence of a similar gene in the protein sequence
databases and complement the information provided by
homology-based gene identification methods (BLASTX).
Maximal Dependence Decomposition (MDD): new
statistical model of donor and acceptor splice sites which
capture important dependencies between signal positions.
Pre-mRNA Splicing
exon definition
intron definition
SR proteins
...
5 ’ splice signal
exonic repressor
branch signal
intronic enhancers
3 ’ splice signal
5 ’ splice signal
polyY
exonic enhancers
intronic repressor
(assembly of
spliceosome, catalysis)
...
Hidden semi-Markov Model (HMM)
GenScan HMM



N - intergenic region
P - promoter
F - 5’ untranslated region

Esngl – single exon (intronless)
(translation start -> stop codon)

Einit – initial exon (translation start > donor splice site)

Ek – phase k internal exon (acceptor
splice site -> donor splice site)

Eterm – terminal exon (acceptor
splice site -> stop codon)

Ik – phase k intron: 0 – between
codons; 1 – after the first base of a
codon; 2 – after the second base of
a codon
GenScan Features






Model both strands at once
Each state may output a string of symbols
(according to some probability distribution).
Explicit intron/exon length modeling
Advanced splice site modeling
Parameters learned from annotated genes
Prediction of multiple genes in a sequence
(partial or complete).
GenomeScan



We can enhance our gene prediction by using
external information: DNA regions with
homology to known proteins are more likely to
be coding exons.
Combine probabilistic ‘extrinsic’ information
(BLAST hits) with a probabilistic model of gene
structure/composition (GenScan).
Focus on ‘typical case’ when homologous but
not identical proteins are available.
Ab Initio modified by BLAST
homologies
Ab Initio modified by BLAST
homologies
GeneWise


Motivation: Use good DB of protein world (PFAM) to
help us annotate genomic DNA
GeneWise algorithm aligns a profile HMM directly to
the DNA
GeneWise
Start with a PFAM domain HMM
 Replace
AA emissions with codon
emissions
 Allow for sequencing errors (deletions/
insertions)
 Add a 3-state intron model

GeneWise Model
GeneWise Intron Model
PY tract
central
5’ site
spacer
3’ site
GeneWise Features & Problems






“Best” alignment of DNA to protein domain
Alignment gives exact exon-intron boundaries
Parameters learned from species-specific statistics
Only provides partial prediction, and only where
the homology lies
 Does not find “more” genes
Pseudogenes, Retrotransposons picked up
CPU intensive
 Solution: Pre-filter with BLAST
Rosetta


Gene prediction is separated from sequence alignment.
First, the alignment is obtained between two homologous
genomic sequences using sequence global alignment Glass.
Then, gene structures (splice sites, exon number and length, etc.)
are predicted that are compatible with this alignment, meaning
that predicted exons fall in the aligned regions.
Syntenic Gene Prediction



This approach does not require the comparison
of two homologous genomic sequences.
A query sequence from a target genome is
compared against a collection of sequence from
a second (informant, reference) genome and the
results of the comparison are used to modify the
scores of the exons produced by underlying ``ab
initio'' gene prediction algorithms.
Gene prediction and sequence alignment are
separated.
SGP-2
tblastx
HSPs
HSPs
Projections
Query
Sequence
geneid
Exons
SGP
Exons
Gene predicition programs predict
a large number of genes
almost every mouse gene has
the human orthologue counterpart
TwinScan
SGP
48462 total
47055
17562 novel
21942
10987 multiexonic
long
no low
3171 complexity
12158
4543
954
2983
human ts
human sgp
317
637
1931
1052
intron
aligned
human ts
human sgp
intron
aligned
Orthologous human mouse genes
have conserved exonic structure.
85% of the orhologous pairs have identical
number of exons
 91% of the orthologous exons have identical
length
 99.5% of the orthologous exons have
identical phase
 there
are a few cases of intron
insertion/deletion (22)

Summary


Genes are complex structures which are difficult
to predict with the required level of accuracy/
confidence
Different approaches to gene finding improve
accuracy/confidence of the predictions:





Ab Initio : GenScan
Ab Initio modified by BLAST homologies:
GenomeScan
Homology guided: GeneWise
Gene prediction and sequence alignment separately:
Rosseta
Ab initio with similarity in known proteins: SGP-2
Merci pour votre
attention!