Genome Annotation: From Sequence to Biology

Download Report

Transcript Genome Annotation: From Sequence to Biology

Genome Annotation:
From Sequence to
Biology
Ashley Bateman & Andrew Tritt
Genetics 677
Prof. Ahna Skop
Spring 2009
Introduction
-over 450 organisms have been
completely sequenced since 1995,
and many more have working
drafts
-361 prokaryotes, 28 archaea, 20
protists, 8 plants, 15 fungi, 26
mammals, and 21 “other”
(wikipedia)
List of Sequenced Organisms
Genome Sequencing
454
Sanger
Solexa
Sanger Sequencing
454 Sequencing: Sequencing by
synthesis
Reads ~200 bp
QuickTime™ and a
decompressor
are needed to see this picture.
1-fix DNA strands to beads in water-in-oil emulsion
2-DNA amplified by PCR
3-use PPi product of PCR to determine identity of added base
http://www.nature.com/nrmicro/journal/vaop/ncurrent/images/nrmicro1901-f3.jpg
High Throughput Sanger
Sequencing
QuickTime™ and a
decompressor
are needed to see this picture.
~900 bp read
-DNA of interest inserted into a plasmid, and sequenced using primers for
plasmid
Solexa Sequencing
QuickTime™ and a
decompressor
are needed to see this picture.
~26-50 bp reads
-newest sequencing technology --> cheaper and faster
-small reads present problems if dealing with repetitive
sequence
http://seqanswers.com/forums/showthread.php?t=21
Genome Annotation
The process of taking the DNA sequence
produced by genome-sequencing
projects, and adding layers of
analysis/interpretation to understand its
biological significance in a larger
context
QuickTime™ and a
decompressor
are needed to see this picture.
Genome Annotation:
A multistep process
3 general levels of annotation:
-1 Nucleotide-level (where)
-2 Protein-level
(what)
-3 Process-level
(how)
QuickTime™ and a
decompressor
are needed to see this picture.
Stein, 2001.
Nucleotide-level Annotation:
Mapping
-“…identify
the punctuation marks…”
-Identification and placement of known
landmarks into the genome (genes, genetic
markers, etc.)
-Connects the pre-genomic literature with
post-genomic research
Nucleotide-level Annotation:
Finding Genomic
Landmarks
-short sequences: PCR-based genetic
markers (ID with e-PCR program)
-long sequences: RFLPs (ID with
BLASTN, etc.)
Nucleotide-level Annotation:
Gene Finding
Prokaryotes: ID ORFs
Eukaryotes: Sophisticated
software needed (gene prediction)
-overlapping ORFs
-signal-to-noise ratio
-splicing
-unclear exon/intron delineations
Gene Prediction Software
-use algorithms that contain sensors to
identify specific sequence features
- neural networks
- rule-based system
- hidden Markov model
-sequence similarity to known CDS
-BLAST
-cDNA
-EST’s
Ab initio gene
prediction without use of
prior
knowledge
about
similarities to
other genes
Hidden Markov Models
0.85
-a
set of states
with transition
and emission
probabilities
in a
sequence
predicted by
finding most
probable path
1.0
EXON
A: 0.2
C: 0.3
G: 0.3
T: 0.2
0.05
QuickTime™ and a
0.10
decompressor
0.05picture.
are needed to see this
INTRON
A: 0.25
C: 0.25
G: 0.25
T: 0.25
-genes
Example :
0.95
DNA Sequence :
AGTTCGAATCGATGCTAAGACGA
Possible Path :
EEEEIIIIIIIIIIIIIIEEEEE
Most probable path: EEEIIIIIIIIIIIIIIIIIEEE
Sequence Similarity
-currently, most powerful tool for detecting
CDS
-Problems exist:
-Fragmentary ESTs
-Repetitive cDNA sequences
-Ortholog-paralog problem
-Incomplete data
ab initio predictions + similarity data = more
powerful model
Nucleotide-level Annotation:
non-coding RNAs and
regulatory regions
-include tRNAs, rRNAs, snRNAs, nRNAs
-transcription factor binding sites
-largely unknown; active area of bioinformatics research
Nucleotide-level Annotation:
non-coding RNAs and
regulatory regions
QuickTime™ and a
decompressor
are needed to see this picture.
-red and blue boxes represent unknown positions of
motifs
-Gibbs Motif Sampler1 and MEME infer models for
motifs and identify motif locations within sequences
1 Lawrence et al. 1993, Thompson et al. 2007
Nucleotide-level Annotation:
Repetitive Elements &
Segmental Duplications
Repetitive Elements:
-account for a large proportion of genome size
variation
-important to (generally) exclude these from later
assembly process
-problematic for next-gen sequencing technologies
Segmental Duplications:
-paralogs exist throughout many genomes
Nucleotide-level Annotation:
Mapping Variation
-SNPs are important for population genetics
and association mapping
AAGTCGATGCTAGCGCTACTAGCTAGGCTCGATGTT
AAGTCGATGCTAGCGCTACTAGCTAGGCTAGATGTT
AAGTCGATGCTAGCCCTACTAGCTAGGCTCGATGTT
AAGTCGATGCTAGCGCTACTAGCTAGGCTAGATGTT
AAGTCGATGCTAGCCCTACTAGCTAGGCTTGATGTT
AAGTCGATGCTAGCGCTACTAGCTAGGCTCGATGTT
SNPs
Protein-level Annotation
-Assign putative functions to proteins of an organism
-Classify proteins into families:
-using similarities to better-characterized proteins of
other species (BLASTP)
-on the basis of functional domains, motifs, and folds
-Search against protein databases of functional domains
(e.g. PFAM)
-InterPro: integration of several protein databases
-makes things much easier!
Process-level Annotation
-linking the genome to biological processes
-bench work required (e.g. microarrays, RNAi, etc.)
-classification scheme required: Gene Ontology (GO)
-standardized vocabulary for molecular
function, biological process, and cellular
component
-hierarchy of terms provides flexibility for new
additions
Process-level Annotation
-hierarchical structure of GO terminology
QuickTime™ and a
decompressor
are needed to see this picture.
Organizing Annotation
Efforts
Several models:
- factory
- museum
- cottage industry
- party
Bioinformatics research
in biomedical text
mining to automate
annotation process
QuickTime™ and a
decompressor
are needed to see this picture.
Conclusion
A synthesis of biology and
annotation must be developed…
…change is constant, databases
are updated sometimes hourly…
…the experimental literature of
the past must be tied with the
genome annotations of the
future!
Student Question
“The paper was mostly about predicting the number of
genes and proteins in an organism. Why do we need to
predict the number of genes and proteins in the cell? It
appears that most studies identify genes based on
phenotypes. For proteins, many methodologies exist for
identifying protein function. I cannot see the purpose of
this prediction--pardon my short sightedness.
Also, has a standardized format emerged in regard to the
genome files?”
NCBI standardized format example