Identification of Coding Sequences

Download Report

Transcript Identification of Coding Sequences

Identification of Coding
Sequences
Bert Gold, Ph.D., F.A.C.M.G
In Vitro Approaches
•
•
•
•
•
Transcription
Translation
Linked
Site Specific Mutagenesis
Promoter Fusions
Runoff Protocol and Controls
Ribosome Binding Sequences
In vitro translation
96-well (high througput)
translation
Linked In Vitro TranscriptionTranslation
APC Protein Truncation Test
In Vivo Approaches
• Prokaryotic expression
– E. coli maxicells
– E. coli minicells
• Metazoan expression
– Yeast
• Overexpression
– Baculovirus
• Expression for Rapid Analysis
– X. laevis oocytes
• Expression in Mammalian Cells
– Transient Transfection
– Stable Transfection
– ES Cells
– Transgenic Mice
• Knock in
• Knock out
Background Definitions
Working Draft – A working draft sequence has
come to mean a genomic sequence before it is
finished. Working draft sequences contain multiple
gaps, underrepresented areas and misassemblies.
In addition, the error rate of working draft sequence is
higher than the 1 in 10,000 error rate that is standard
for finished sequences.
FASTA file – A common file format used for the
storage and tranfer of sequence data. It contains raw
DNA or protein sequence, but no annotation
information.
SENSORS
An algorithm specialized to identify a
feature of a sequence, such as a
possible splice site.
Neural Network
Neural networks are analytical
techniques modeled after the
(proposed) processes of learning in
cognitive systems and the neurological
functions of the brain. Neural networks
use a data ‘training set’ to build rules
that can make predictions or
classifications on data sets.
Rule-Based System
A type of computer algorithm that uses
an explicit set of rules to make
decisions.
Hidden Markov Model
A type of computer algorithm that
represents a system as a set of discrete
states and transitions between those
states. Each transition has an
associated probability. Markov models
are ‘hidden’ when one or more of the
states cannot be directly observed.
AB INITIO GENE PREDICTION
A class of software that attempts to
predict genes from sequence data
without the use of prior knowledge
about similarities to other genes.
In Silico Approaches
• Sensors
– Single Feature Predictors
• HEXON http://searchlauncher.bcm.tmc.edu:9331/gene-finder/Help/hexon.html
• MZEF http://sciclio.cshl.org/genefinder/
• Neural Networks
– GRAIL http://compbio.ornl.gov/Grail-1.3/
• Rule Based Systems
– GeneFinder under construction by [email protected]
• Hidden Markov Models
– GenScan http://genes.mit.edu/GENSCAN.html
– Genie http://www.fruitfly.org/seq_tools/genie.html
– Fgenes http://searchlauncher.bcm.tmc.edu:9331/gene-finder/Help/fgenes.html
– GeneMark.hmm http://genemark.biology.gatech.edu/GeneMark/
– HMMGene http://www.cbs.dtu.dk/services/HMMgene/
Ab Initio Methods
•
•
•
•
Comparative Genomics
dbEST
BLASTX
TAP and PASS
Evaluation of In Silico Approaches
Scheme for an Ab Initio Approach
Diagrammatic Evaluation of an In
Silico Approach
Hidden
Markov
Model
A hidden Markov model explicitly models the
probabilities for the transition from one part of a
gene to another. In this model, used by the
GENSCAN algorithm, each circle or diamond
represents a functional unit in the gene. For example
Eint is the initial exon and Eterm is the last. The
arrows represent the probability of a transition from
one part of a gene to another. The algorithm is
‘trained’ by running a set of known genes through the
model and adjusting the weights of each transition
to reflect realistic transition probabilities. Thereafter,
test sequence data can be run through the model one
base position at a time, and the model will read out the
probability of a gene being present at that position.
The states that occur below the dashed line
correspond to a gene in the reversed strand, and thus
are symmetric with those abovethe line. E, exon, I,
intron, UTR untranslated region, pro, promoter.
Evaluating Ab Initio Gene Predictions