PPT - Bioinformatics.ca

Download Report

Transcript PPT - Bioinformatics.ca

Lecture 5.1:
Genome Annotation
Francis Ouellette
Associate Professor, UBC Biotechnology Laboratory
Director, UBC Bioinformatics Centre (UBiC)
[email protected]
Outline
•
•
•
•
What we have
What we need
How we do it
Where we go next
Lecture 5.1
2
Genomes
Number of base pairs
___________________________________________________________
1971
1977
1982
1992
1995
1996
1998
2000
2001
2003
First published DNA sequence
PhiX174
Lambda
Saccharomyces cerevisiae Chromosome III
Haemophilus influenza
Saccharomyces cerevisiae
C. elegans
D. melanogaster
H. sapiens (draft)
H. sapiens
Lecture 5.1
12
5,375
48,502
316,613
1,830,138
12,068,000
97,000,000
120,000,000
2,600,000,000
2,850,000,000
3
What else?
• Genome sequences by themselves are pretty
useless.
• Functions of many processes reside 3D
proteins
• Need to know where the protein coding
sequences are, and what they do
• Proteins are not everything
• All of this becomes “the parts list”, from where
all biology will be understood.
Lecture 5.1
4
Lecture 5.1
5
Lecture 5.1
6
Challenges at building the “Parts List”
• Finding genes involves computational
methods as well as some experimental
validation
• Computational methods are often inadequate,
and often generate erroneous ‘gene’
sequences which:
–
–
–
–
Lecture 5.1
Are missing exons
Have incorrect exons
Over predict genes
Where the 5’ and 3’ UTR are missing
7
Assumptions we make:
• Reductionist approach still works, albeit we
are now becoming more and more
“systems biologists”
• Evolution drives everything, and will be the
way we figure things out. Or said in another
way:
– Evolutionary relationships and comparisons will be
essential in our efforts to solve and understand
things
Lecture 5.1
8
How we got started:
• GenBank database was populated by
common genes:
–
–
–
–
–
–
Lecture 5.1
rRNA, tRNA
Globin
Histone
ATPases
Actin
Etc …
9
Things we are looking to annotate?
•
•
•
•
•
•
CDS
mRNA
Alternative RNA
Promoter and Poly-A Signal
Pseudogenes
ncRNA
Lecture 5.1
10
Promoter
“Exon 1”
| “Intron 1”
| “Exon 2”
| “Intron 2”
| “Exon 3” | “Intron 3” | “Exon 4”
DNA
Transcription
Primary transcript
Poly-A Signal
GU
AG
GU
AG
GU
AG
Splicing
polyA
cap
Mature
mRNA
Nucleus
Cytoplasm
Start
Stop
cap
Lecture 5.1
polyA
Translation
11
Pseudogenes
• Could be as high as 20-30% of all Genomic sequence
predictions could be pseudogene
• Non-functional copy of a gene
– Processed pseudogene
•
•
•
•
Retro-transposon derived
No 5’ promoters
No introns
Often includes polyA tail
– Non-processed pseudogene
• Gene duplication derived
– Both include events that make the gene non-funtional
• Frameshift
• Stop codons
• We assume pseudogenes have no function, but we
really don’t know!
Lecture 5.1
12
Noncoding RNA (ncRNA)
• ncRNA represent 98% of all transcripts in a
mammalian cell
• ncRNA have not been taken into account in gene
counts
• cDNA
• ORF computational prediction
• Comparative genomics looking at ORF
• ncRNA can be:
– Structural
– Catalytic
– Regulatory
Lecture 5.1
13
Noncoding RNA (ncRNA)
• tRNA – transfer RNA: involved in translation
• rRNA – ribosomal RNA: structural component
of ribosome, where translation takes place
• snoRNA – small nucleolar RNA:
functional/catalytic in RNA maturation
• Antisense RNA: gene silencing
Lecture 5.1
14
Rfam
• Covariance model searches
are extremely compute
intensive. A small model (like
tRNA) can search a
sequence database at a rate
of around 300 bases/sec.
The compute time scales
roughly to the 4th power of
the length of the RNA, so
larger models quickly
become infeasible without
significant compute
resources.
Lecture 5.1
15
Protein coding genes in prokaryotes,
and simple eukaryotes
• Use ORF finder
http://www.ncbi.nlm.nih.gov/gorf/orfig.cgi
• Simple ATG/Stop
• Simple link to FASTA formatted files and
BLAST.
• Problems:
– In frame Methionine
– Small protein
• Solution: comparative genomics
Lecture 5.1
16
BLAST
• Seeks high-scoring segment pairs (HSP)
– pair of sequences that can be aligned without gaps
– when aligned, have maximal aggregate score
(score cannot be improved by extension or trimming)
– score must be above score threshold S
• Public Search engines
– WWW search form
http://www.ncbi.nlm.nih.gov/BLAST
– Unix command line
blastall -p progname -d db -i query > outfile
• Making your own search space
Lecture 5.1
17
So many matrices...
• Triple-PAM strategy (Altschul, 1991)
– PAM 40
Short alignments, highly similar
• tblastn against ESTs
– PAM 120
– PAM 250
Longer, weaker local alignments
• Looking in the twilight zone
• BLOSUM (Henikoff, 1993)
– BLOSUM 90
– BLOSUM 62
Short alignments, highly similar
Most effective in detecting known
members of a protein family
• Standard on NCBI server – works in most cases
– BLOSUM 30
Lecture 5.1
Longer, weaker local alignments
18
Ab initio gene identification
• Goals
– Identify coding exons
– Seek gene structure information
– Get a protein sequence for further analysis
• Relevance
– Characterization of anonymous DNA genomic
sequences
– Works on all DNA sequences
Lecture 5.1
19
Gene-Finding Strategies
Genomic Sequence
Content-Based
Bulk properties of
sequence:
• Open reading frames
• Codon usage
• Repeat periodicity
• Compositional
complexity
Lecture 5.1
Site-Based
Absolute properties of
sequence:
• Consensus sequences
• Donor and acceptor
splice sites
• Transcription factor
binding sites
• Polyadenylation
signals
• “Right” ATG start
• Stop codons
out-of-context
Comparative
Inferences based
on sequence homology:
• Protein sequence
with similarity to
translated product
of query
• Modular structure of
proteins usually
precludes finding
complete gene
20
Gene-Finding Methods
Genomic Sequence
Rule-Based
Cutoff method:
• Criteria applied sequentially
to identify possible exons
• Rank or eliminate candidates
from consideration based on
pre-determined cutoff at
each step
Lecture 5.1
Neural Network
Composite method:
• Criteria applied in parallel
• Training sets used to optimize
performance
• Weight scores in order of
importance
21
Evaluation Statistics
TP
FP
TN
FN
TP
FN
TN
Actual
Predicted
Sensitivity
Fraction of actual coding regions that are correctly
predicted as coding
Specificity
Fraction of the prediction that is actually correct
Correlation
Coefficient
Combined measure of sensitivity and specificity,
ranging from –1 (always wrong)
to +1 (always right)
Lecture 5.1
22
Relative Performance
Individual Exons
MZEF
HEXON
SorFind
GRAIL II
Gene Structure
GENSCAN
FGENES
GRAIL II/Gap
GeneParser
HMMgene
Lecture 5.1
Claverie 1997
Sn (%)
Sp (%)
CC
78
71
42
51
86
65
47
57
0.79
0.64
0.62
0.47
78
73
51
35
81
78
52
40
0.86
0.74
0.66
0.54
Rogic 2000
CC
0.91
0.83
0.91
23
What works best when?
• Genome survey (draft) data:
expect only a single exon in any given stretch of contiguous sequence
– BLASTN vs. dbEST (3’ UTR)
– BLASTX vs. nr (protein CDS)
• Finished data:
large contigs are available, providing context
– GENSCAN
– HMMgene
Lecture 5.1
24
What you need
• Compute the prediction
• Confirm with biological sequences (also
with computational tools)
• Integrate all of this
• Annotate (decorate) genome (often via a
GUI: Graphical User Interface)
•
•
•
•
Validate
Re-annotate/Update
Check it twice
Submit to NCBI RefSeq
Lecture 5.1
25
What you need
• Compute the prediction: Pegasys
• Confirm with biological sequences (also
with computational tools): Pegasys
• Annotate (decorate) genome (often via a
GUI: Graphical User Interface): Apollo
Lecture 5.1
26
Some of the things available:
•
•
•
•
•
•
EnsEMBL (EBI)
Sequin (NCBI)
PseudoCAP (SFU)
GMOD (CSHL)
Pegasys (UBiC)
Apollo (EBI/Berkeley)
Lecture 5.1
27
ENSEMBL
Lecture 5.1
28
Lecture 5.1
29
Lecture 5.1
30
Lecture 5.1
31
Lecture 5.1
32
Features of Pegasys
• Flexible architecture
– Can manage different types of analyses with the
same software system
• One gene or a whole genome
• Prokaryotic or eukaryotic
• Plant or mammal
– We can create new workflows without creating
new software
• Workflows encode protocols which can be distributed
and used for comparison of methodologies of for
reproducibility
Lecture 5.1
33
Features of Pegasys
• Extensibility and modularity
– Easily add new or modified tools to the system
– Bioinformatics changes fast!
Lecture 5.1
34
Features of Pegasys
• Provides a way to integrate results from
heterogeneous sources
– Want to look at results from a collection of
analyses simultaneously
– Gives the scientist a global picture of all the
evidence pointing to a biological feature
Lecture 5.1
35
Components of Pegasys
Lecture 5.1
36
Lecture 5.1
37
The zoo project
• NIH Intramural Sequencing Centre
– Eric Green
• Orthologous regions of multiple
vertebrates for interesting human
targets
• Facilitates creation of
computational tools for comparative
genomics
• Target 6: 1.7Mb on human
21q22.11
–
–
–
–
–
Insert Ensembl chr21
here
11 species
~12Mb
22 refseq genes
13 with unknown function
7 predicted coding sequence
Lecture 5.1
source: ensembl.org
38
Zoo alignments to target6
Refseq
coding
exons
baboon
Similarity (-log(e-value))
cow
pig
cat
dog
mouse
rat
chicken
Lecture 5.1
39 Danio
Lecture 5.1
40
Target6 alignments compressed
Refseq
coding
exons
Cumulative similarity = S [ –log(E) * log(d(i)) ]
E = BLAST expect value
i
i = species
Lecture 5.1
d = evolutionary distance (mya)
41
Example output – GAME XML
(Genome Annotation Markup Elements XML)
• Input to Apollo
– Genome editor created by Berkeley
Drosophila group and Ensembl
– Simultaneously view heterogeneous
computational evidence
– Manually create and/or edit annotations
Lecture 5.1
42
Lecture 5.1
43
Apollo
• Apollo is a collaborative project between the Berkeley
Drosophila Genome Project (www.bdgp.org) and
Ensembl (www.ensembl.org). The collaboration was
set up to create a tool to initially annotate fly but
which would also be able to annotate and browse any
large eukaryotic genome.
There is a sister developers' website at
www.fruitfly.org/annot/apollo to download the fly
specific Apollo annotation tool.
• All the code is open source and freely downloadable.
Lecture 5.1
44
Features of Apollo include:
•
•
•
•
•
•
•
•
•
•
Zoomable and scrollable feature display down to sequence level
optimized for display of large regions of genome.
User configurable feature types (colour, appearance, size, order,
score threshold)
Can connects directly to the Ensembl web site for the latest
human genome annotation
Reads/write gff format
Searchable for feature names or sequence string
Ability to select features and sort by different feature attributes
All features are linked out to their source database web sites
(ensembl,swissprot,embl,unigene etc)
Display of genomic sequence and any associated start and stop
codons
Prints postscript output
Display is reversible allowing easy interpretation of reverse strand
features.
Lecture 5.1
45
Lecture 5.1
46
Genome Centers
DDBJ/EMBL/GenBank
RefSeq
M&P
Genome Browser Pipeline
Genome Centers
Lecture 5.1
users
bioinformaticians
47
PeGASys
Genome Centers
DDBJ/EMBL/GenBank
M&P
RefSeq
TPA
Genome Browser Pipeline
Genome Centers
Lecture 5.1
users
bioinformaticians
48
GenBank Features
-10_signal
-35_signal
3'clip
3'UTR
5'clip
5'UTR
attenuator
CAAT_signal
CDS
conflict
C_region
D-loop
D_segment
enhancer
exon
Lecture 5.1
GC_signal
gene
iDNA
intron
J_segment
LTR
mat_peptide
misc_binding
misc_difference
misc_feature
misc_recomb
misc_RNA
misc_signal
misc_structure
modified_base
mRNA
N_region
old_sequence
polyA_signal
polyA_site
precursor_RNA
primer_bind
prim_transcript
promoter
protein_bind
RBS
repeat_region
repeat_unit
rep_origin
rRNA
satellite
scRNA
sig_peptide
snoRNA
snRNA
S_region
stem_loop
STS
TATA_signal
terminator
transit_peptide
tRNA
unsure
variation
V_region
V_segment
49
GenBank Features: the important ones
-10_signal
-35_signal
3'clip
3'UTR
5'clip
5'UTR
attenuator
CAAT_signal
CDS
conflict
C_region
D-loop
D_segment
enhancer
exon
Lecture 5.1
GC_signal
gene
iDNA
intron
J_segment
LTR
mat_peptide
misc_binding
misc_difference
misc_feature
misc_recomb
misc_RNA
misc_signal
misc_structure
modified_base
mRNA
N_region
old_sequence
polyA_signal
polyA_site
precursor_RNA
primer_bind
prim_transcript
promoter
protein_bind
RBS
repeat_region
repeat_unit
rep_origin
rRNA
satellite
scRNA
sig_peptide
snoRNA
snRNA
S_region
stem_loop
STS
TATA_signal
terminator
transit_peptide
tRNA
unsure
variation
V_region
V_segment
50
GenBank Features: the abundant one
-10_signal
-35_signal
3'clip
3'UTR
5'clip
5'UTR
attenuator
CAAT_signal
CDS
conflict
C_region
D-loop
D_segment
enhancer
exon
Lecture 5.1
GC_signal
gene
iDNA
intron
J_segment
LTR
mat_peptide
misc_binding
misc_difference
misc_feature
misc_recomb
misc_RNA
misc_signal
misc_structure
modified_base
mRNA
N_region
old_sequence
polyA_signal
polyA_site
precursor_RNA
primer_bind
prim_transcript
promoter
protein_bind
RBS
repeat_region
repeat_unit
rep_origin
rRNA
satellite
scRNA
sig_peptide
snoRNA
snRNA
S_region
stem_loop
STS
TATA_signal
terminator
transit_peptide
tRNA
unsure
variation
V_region
V_segment
51
Gene Prediction Caveats
• Predictions are of protein coding regions
– Do not detect non-coding areas (5’ and 3’ UTR)
– Non-coding RNA genes are missed
• Predictions are for “typical” genes
–
–
–
–
–
Lecture 5.1
Must predict a beginning and an end
Partial or multiple genes are often missed
Training sets may be biased
Methods are sensitive to G+C content
Weighting of factors may be inordinately biased
52
Moving along
• Sequencing technology led genomics, and to some
extant bioinformatics
• EST complicated things, and where the beginning of
specialized ‘methods’ or ‘functional’ division in
GenBank.
• Yeast chromosomes and bacterial chromosomes
rapidly lead us to our obvious ineptitude of genome
annotations, andthese genomes where simple!
• A controlled vocabulary was necessary, albeit slow to
be created: Gene Ontology
Lecture 5.1
53
Genome annotation problems:
•
•
•
•
•
•
•
•
•
•
Assembling the genome
Analysis & interpretation
Lack of consistency from gene to gene
Lack of consistency from person to person
Lack of controlled vocabulary
Parts we don’t know
Bacteria vs mammals
Graphical user interface
Dimensions
Updates and maintenance
Lecture 5.1
54
Some comments about the genome
•
•
•
•
•
“Finished” February 15, 2001
Finished April 25, 2003
Still not fully understood and definitely not finished.
We are still in the genomic era.
To get a full “parts list”, we need, as a community, to
develop a system to rigorously find all of the part of
the human genome:
–
–
–
–
–
Lecture 5.1
Genes
Protein coding sequences
Non coding RNAs (ncRNA)
Identify and understand regulatory sequences
Many other cool things we don’t know about!
55
The ideal annotation of “MyGene”
All clones
All SNPs
Promoter(s)
MyGene
All mRNAs
All proteins
All structures
Lecture 5.1
• All protein modifications
• Ontologies
• Interactions (complexes,
pathways, networks)
•Expression (where and
when, and how much)
56
•Evolutionary relationships
Things we will need to integrate in the
future:
•
•
•
•
•
•
•
•
Better Gene predictions
Haplotypes to map complex diseases
Micro-array/gene expression data
Protein-protein interaction data
SAGE data
GFP (Green Fluorescent Protein)
Human-base (LocusLink)
Integration!
Lecture 5.1
57
Concluding remarks
• Trust but verify
• Sites differ in assembly of contigs, which
affects gene prediction tools
• Use a couple of tools, but also confirm with
biological sequences
• Lots of data to work with, look for your
regions, and see what the various tools
generate and what various viewers have
already computed
Lecture 5.1
58
Some Web resources:
•
•
•
•
•
UBiC: http://bioinformatics.ubc.ca
CBW: http://bioinformatics.ca
NCBI: http://www.ncbi.nlm.nih.gov
Apollo: http://www.ensembl.org/apollo
Open Bioinformatics Foundation:
http://open-bio.org
• The only web resource you really need:
http://www.google.ca
Lecture 5.1
59
Open Source
• Essential for us to exist, provides the code we use
and adapt, and do the science we want to do. Millions
of lines of code exist, here are some example:
–
–
–
–
–
–
–
Lecture 5.1
(BLAST)
(NCBI toolkit)
Apollo
Perl and PHP
Bio-*
BIND software
Pegasys
60
Committed to making the world’s
scientific and medical literature
a public resource
Lecture 5.1
61
Lecture 5.1
62
Acknowledgements
http://bioinformatics.ubc.ca
Ouellette Laboratory
@ UBiC
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
Stefanie Butland
Graeme Campbell
Jeff Druce
Joanne Fox
David He
Yong Huang
John Ling
Scott McMillan
Dianne Moore
Gerald Quon
Jessica Sawkins
Sohrab Shah
Julie Stitt
Anna Wilkinson
Tao Xu
Mack Yuen
Grace Zheng
Lecture 5.1
------------------------- Collaborators -------------------------BCCA - GSC
Ian Bosdet
Jackie Schein
Rob Holt
CGDN / U of Toronto
Christopher Pearson
CMMT
David Arenillas
Rebecca Devon
Michael Hayden
Danielle Kemmer
Blair Leavitt
Wyeth Wasserman
NIH - NISC
Gerry Bouffard
Eric Green
Pamela Thomas
UBC
Jenny Bryan
Juergen Kast
Holger Hoos
Wilf Jefferies
Jim Kronstad
Alan Mackworth
Sanja Rogic
CMMT Systems group
Jonathan Falkowski
Miroslav Hatas
63
May 2, 2003
Lecture 5.1
64
Maya et Pascale,
January 2004
Lecture 5.1
65