Transcript Lecture
What do you with a whole genome sequence?
Translate it into all 6 reading frames……
Identify all of the stop codons..…
And the start codons……
Can then identify all Open Reading Frames (ORFs)
But are all real genes?
Three major prokayotic gene modelers:
Generation uses predominantly 6-mer statistics to recognize
coding regions; it uses a proximity rule-based start call with
ATG and GTG as potential starts.
Glimmer uses interpolated Markov models (IMMs) to identify
the coding regions; it uses ATG, GTG, and TTG as potential
starts.
Critica uses blastn to produce alignments from the entire
dataset and derives dicodon statistics to recognize coding
sequences. It uses an SD sensor with ATG, GTG, and TTG as
potential starts.
Now what?
BLAST genes:
To assign functions based on similarity with known genes
BLAST
Basic Local Alignment Search Tool
finds regions of local similarity between sequences
>my favorite gene
Atgtcgctagctagctsctagctag
Database of many
gene sequences
GenBank is one
example
Answers the questions—
Is there a match?
And how good is it?
What are the genes doing?
Function is assigned based on degree of similarity of an
already characterized gene in the database
2 potential problems with this approach
Transitive catastrophe
Gene A
Assigned function
based on mutant
phenotype or
biochemical
characterization of
protein product
Gene B
From genome
sequence:
70% identity to
gene A
Gene C
From genome
sequence:
60% identity to
gene B
Gene D
From genome
sequence:
70% identity to
gene C
But--Gene D has only 20% identity to gene A!
Would like to propagate function only to orthologous
genes
Homolog– genes sharing a common origin
note: two genes are homologs or they or not
no such thing as %homology or “more homologous”
Two main kinds of homologs
Orthologs-genes orginating from a single ancestral gene
in the last common ancestor of the compared genomes
Paralogs-genes related via duplication
X,Y,Z are
genes in the
same family
A, B, C are
three species
Two more complicated cases:
Xenologs-genes orginating from a HGT of an ortholog in
a distant lineage
Pseudoparalogs- homologous genes that appear to
paralogs in a single genome analysis but have arisen due
to a combination of vertical and lateral descent
How to identify orthologs:
One way: Reciprocal BLAST analysis
>Genome A gene1
AGTGCATGTCCC
Database:
>Genome A gene 2
Genome B
TGTGCGTAGTCCAAA
AND
>Genome B gene1
GGTTTTTACA
Database:
>Genome B gene 2
Genome A
AAACCTCTCTGA
ASK: are two genes each other’s Best BLAST hit?
Can be confounded by lineage specific gene loss
What if there is nothing at all similar in the database?
20%
Call it a “hypothetical”
gene
If it has a match but
that is to another
hypothetical gene?
4% 4%
2%
1%
4%
1%
2%
1%
32%
Conserved
Hypothetical
“conserved hypothetical”
25%
Hypothetical
4%
DNA Replication & Repair
Energy Metabolism
Lipid Metabolism
Amino Acid Metabolism
Carbohydrate Metabolism
Cofactor Metabolism
1%
Nucleotide Metabolism
Transcription
Translation
Transport
Unassigned