Finding genes

Transcript Finding genes

Genomics and bioinformatics summary
1.
2.
3.
4.
5.
Gene finding: computer searches, cDNAs, ESTs,
Microarrays
Use BLAST to find homologous sequences
Multiple sequence alignments (MSAs)
Trees quantify sequence and evolutionary
relationships
6. Protein sequences are evolutionary clocks
7. Some public databases and protein sequence analysis
tools
Finding genes -- computer searches
Computer searches locate most genes in prokaryotes, Archeae,
and yeast, but only ~1/3 of human genes are identified correctly.
Criteria
Protein start, stop signals, splicing signals . . .
Codon bias
Comparisons to other genomes (mouse, rat, fish, fly,
mosquito, worm, yeast . . .)
Some hard problems: small genes, post-translational modifications,
unique genes, spliced genes, alternative splicing, gene
rearrangements (e.g. IgGs) . . .
Finding genes -- cDNA synthesis
Synthesizing “cDNA”
(complementary DNA)
1. Extract RNA
2. Hybridize polyT primer
3. Synthesize DNA strand 1
using reverse transcriptase.
4. Fragment RNA strand using
RNaseH.
5. Synthesize DNA strand 2
using DNA pol
Sequences of random cDNAs provide ESTs (Expressed
Sequence Tags)
Microarrays quantify expressed genes
by hybridization
1. Label cDNAs with red fluorophore
in one condition and green
fluorophore in another reference
condition.
2. Mix red and green DNA and
hybridize to a “microarray”.
Red genes enriched in reference
Yellow genes (green + red) =
Green genes enriched in experiment
Each spot is a
different synthetic
oligonucleotide
complementary to
a specific gene.
“Cluster analysis” identifies patterns of gene
expression
Genes
Conditions
1. Similar patterns of expression are placed next to each other.
Groups of genes with similar patterns form a hierarchical
“tree”. For example the two major branches of the tree
comprise activated (left, green) or repressed genes (right, red).
2. Genes with similar expression patterns (e.g. A-E) often function
together.
“Tiling” microarrays can find
transcribed sequences
Microarray coding capacity
~16 M bases
Each spot has a different synthetic
oligonucleotide complementary to a
different segment of the genome
(E.g every 100 bps). Spots that
hydridize reveal transcribed
regions.
Find similar sequences (homologs) with BLAST
The most related human protein identified by a BLAST search of the human genome using the sequence of M.
tuberculosis PknB Ser/Thr protein kinase is . . . ELKL motif kinase 1. Query = the part of the PknB sequence that
matches ELKL-1. Subject = ELKL-1. Expect = expectation value = the number of hits of this quality expected by
chance in a database of this size (5e-24 = 5 x 10-24; is this a big number or small?) Identities = # of exact amino
acid matches in the alignment. Positives = # of conservative changes as defined by the residues that tend to
replace each other in homologous proteins. NP_00495.2 = sequence ID for ELKL-1.
>ref|NP_004945.2| ELKL motif kinase 1 [Homo sapiens]
Length = 691
Score = 108 bits (270), Expect = 5e-24
Identities = 87/296 (29%), Positives = 135/296 (45%), Gaps = 21/296 (7%)
Query: 11
Sbjct: 20
Query: 71
Sbjct: 79
YELGEILGFGGMSEVHLARDLRLHRDVAVKVLRADLARDPSFYLRFRREAQNAAALNHPA 70
Y L + +G G ++V LAR +
++VAVK++
S
FR E +
LNHP
YRLLKTIGKGNFAKVKLARHILTGKEVAVKIIDKTQLNSSSLQKLFR-EVRIMKVLNHPN 78
IVAVYDTGEAETPAGPLPYIVMEYVDGVTLRDIVHTEGPMTPKRAIEVIADACQALNFSH 130
IV +++ E E
Y+VMEY G + D +
G M K A
A+ + H
IVKLFEVIETEKTL----YLVMEYASGGEVFDYLVAHGRMKEKEARAKFRQIVSAVQYCH 134
Query: 131 QNGIIHRDVKPANIMISATNAVKVMDFGIARAIADSGNSVTQTAAVIGTAQYLSPEQARG 190
Q I+HRD+K N+++ A
+K+ DFG +
GN +
G+ Y +PE +G
Sbjct: 135 QKFIVHRDLKAENLLLDADMNIKIADFGFSNEFT-FGNKLD---TFCGSPPYAAPELFQG 190
Query: 191 DSVDA-RSDVYSLGCVLYEVLTGEPPFTGDSPVSVAYQHVREDPIPPSARHE-GLSADLD 248
D
DV+SLG +LY +++G PF G +
+ +RE +
R
+S D +
Sbjct: 191 KKYDGPEVDVWSLGVILYTLVSGSLPFDGQN-----LKELRERVLRGKYRIPFYMSTDCE 245
Query: 249 AVVLKALAKNPENRYQTAAEMRADLVRVHNGEPPEAPKV-----LTDAERTSLLSS 299
++ K L NP R
M+
+ V + +
P V
D RT L+ S
Sbjct: 246 NLLKKFLILNPSKRGTLEQIMKDRWMNVGHEDDELKPYVEPLPDYKDPRRTELMVS 301
Ser/Thr Protein kinases diverge rapidly
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Multiple Sequence Alignment (MSA) of the N-terminal ~90
residues of M. tuberculosis PknB (bottom) and Ser/Thr protein
kinases of known structure. The histogram at the bottom
shows % identity at each position. Only a few residues are
absolutely conserved (functional sites!). The MSA defines the
beginning of the kinase domain. Insertions often occur in
loops.
Histones evolve slowly
Tree
MSA = Multiple Sequence Alignment
Core H3 proteins (that have the same function) are nearly
identical in eukaryotes (left). Archaeal H3s and specialized
H3 proteins that bind at centromeres show much more
divergence (bottom sequences and tree branches, right).
Protein sequences are evolutionary clocks
Slow
Assuming that organisms diverged
from a common ancestor and sequence
changes accumulate at constant rates,
the number of changes in homologous
proteins gives information about the
time that each sequence has been
evolving independently.
Fast
Average rate of change of
proteins of different
function.
Tree of life (Sequences = biological clocks)
A tree derived by
clustering sequences
of a typical protein
family (pterin-4ahydroxylase)
recapitulates the tree
of life. Evolutionary
relationships are seen
at the molecular level
in virtually every
shared protein and
RNA!
Some web sites for bioinformatics
Nucleic acid sequences
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=nucleotide
Protein sequences
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Protein
Structure Coordinates: Protein Data Bank
http://www.rcsb.org/pdb/
Programs
BLAST sequence similarity calculation
http://www.ncbi.nlm.nih.gov/BLAST/
BLAST bacterial genomes
http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi
PHD secondary structure predictor and motif search
http://www.embl-heidelberg.de/predictprotein/predictprotein.html
PHYRE fold predictor
http://www.sbg.bio.ic.ac.uk/~phyre/
Multicoil: Coiled coil prediction
http://multicoil.lcs.mit.edu/cgi-bin/multicoil/
Many nucleic acid and protein sequence-analysis tools
http://au.expasy.org/
Predict transmembrane helices
http://www.cbs.dtu.dk/services/THMM-2.0/
Predict signal sequences
http://www.cbs.dtu.dk/services/SignalP/
Genomics and bioinformatics summary
1.
2.
3.
4.
5.
Gene finding: computer searches, cDNAs, ESTs,
Microarrays
Use BLAST to find homologous sequences
Multiple sequence alignments (MSAs)
Trees quantify sequence and evolutionary
relationships
6. Protein sequences are evolutionary clocks
7. Lots of public databases and protein sequence
analysis tools

Finding genes

Transcript Finding genes

Directory