Transcript Document
Automated sequencing machines,
particularly those made by PE Applied
Biosystems, use 4 colors, so they can
read all 4 bases at once.
All the Genes?
• Any human gene can now be found in the
genome by similarity searching with over
95% certainty.
• However, the sequence still has many
gaps
– unlikely to find an uninterrupted genomic
segment for any gene
– still can’t identify pseudogenes with certainty
• This will improve as more sequence data
accumulates
Finding Genes in genome
Sequence is Not Easy
• About 2% of human DNA encodes
functional genes.
• Genes are interspersed among long
stretches of non-coding DNA.
• Repeats, pseudo-genes, and introns
confound matters
Impact on Bioinformatics
• Genomics produces high-throughput, highquality data, and bioinformatics provides
the analysis and interpretation of these
massive data sets.
• It is impossible to separate genomics
laboratory technologies from the
computational tools required for data
analysis.
Six basic questions about genomes
[1] how is a genome sequenced?
[2] when is the project finished?
[3] sequence one individual or many?
[4] what information is in the DNA?
[5] how many genes are in the genome?
[6] how can whole genomes be compared?
[1] Genome projects: sequencing strategies
Hierarchical shotgun method
Assemble contigs from various chromosomes, then sequence and assemble them. A contig
is a set of overlapping clones or sequences from which a sequence can be obtained. The
sequence may be draft or finished.
A contig is thus a chromosome map showing the locations of those regions of a
chromosome where contiguous DNA segments overlap. Contig maps are important
because they provide the ability to study a complete, and often large segment of the genome
by examining a series of overlapping clones which then provide an unbroken succession of
information about that region.
Scaffold: an ordered set of contigs placed on a chromosome.
Shotgun
An approach used to decode an organism's genome by shredding it into smaller
fragments of DNA which can be sequenced individually. The sequences of these
fragments are then ordered, based on overlaps in the genetic code, and finally
reassembled into the complete sequence. The 'whole genome shotgun' method is
applied to the entire genome all at once, while the 'hierarchical shotgun' method is
applied to large, overlapping DNA fragments of known location in the genome.
http://www.genome.gov/glossary.cfm
3. Whole Genome Shotgun
Sequencing
genome
cut many times at
random
• plasmids (2 – 10 Kbp)
• cosmids (40 Kbp)
~500 bp
forward-reverse
linked reads
known dist
~500 bp
ARACHNE:
Whole Genome Shotgun Assembly
1. Find overlapping reads
2. Merge good pairs of reads
into longer contigs
3. Link contigs to form
supercontigs
4. Derive consensus sequence
..ACGATTACAATAGGTT..
http://www-genome.wi.mit.edu/wga/
[2] When is the project finished?
Get five to ten-fold coverage
Finished sequence: a clone insert is contiguously
sequenced with high quality standard of error rate
0.01%. There are usually no gaps in the sequence.
Draft sequence: clone sequences may contain several
regions separated by gaps. The true order and
orientation of the pieces may not be known.
Repetitive DNA sequences: five classes
[1] Interspersed repeats: transposon-derived repeats
-- 45% of human genome; LTR, SINE, LINE
[2] Processed pseudogenes
[3] Simple sequence repeats
-- micro- and minisatellites
-- ACAAACT, 11 million times in a Drosophila
-- Human genome has 50,000 CA dinucleotide repeats
[4] Segmental duplications (about 5% of human genome)
[5] Tandem repeats (e.g. telomeres, centromeres)
• LINE and SINE repeats. A LINE (long interspersed
nuclear element) encodes a reverse transcriptase (RT) and
perhaps other proteins. Mammalian genomes contain an
old LINE family, called LINE2, which apparently stopped
transposing before the mammalian radiation, and a
younger family, called L1 or LINE1, many of which were
inserted after the mammalian radiation (and are still being
inserted). A SINE (short interspersed nuclear element)
generally moves using RT from a LINE. Examples include
the MIR elements, which co-evolved with the LINE2
elements. Since the mammalian radiation, each lineage has
evolved its own SINE family. Primates have Alu elements
and mice have B1, B2, etc. The process of insertion of a
LINE or SINE into the genome causes a short sequence (721 bp for Alus) to be repeated, with one copy (in the same
orientation) at each end of the inserted sequence. Alus
have accumulated preferentially in GC-rich regions, L1s in
GC-poor regions.
What is the function of nongenic DNA?
Hypotheses:
• Nongenic DNA performs essential functions, such as
regulation of gene expression.
• Nongenic DNA is inert, genetically and physiologically.
Excess DNA is incidental and is called “junk DNA.”
• Nongenic DNA is a functional parasite or selfish DNA
(retrotransposons).
• Nongenic DNA has a structural function.
Clasificación del ADN
FUNCIONAL (secuencias que cumplen una función)
- Codante (se traducen en proteínas)
-No codante (no se traducen)
* Transcrito (cumple función a nivel de RNA: subun. ribos.)
* No transcrito (cumple función a nivel de DNA: intrón,
promotor, enhancer, etc.)
NO-FUNCIONAL (secuencias que no cumplen ninguna función: “Junk DNA” –
basura)
Gene-finding algorithms
Homology-based searches (“extrinsic”)
Rely on previously identified genes
Algorithm-based searches (“intrinsic”)
Investigate nucleotide composition, openreading frames, and other intrinsic
properties of genomic DNA
DNA
intron
RNA
Mature RNA
protein
Homology-based searching: compare DNA
to expressed genes (ESTs)
DNA
intron
RNA
RNA
protein
DNA
RNA
Algorithm-based searching: compare DNA in exons
(unique codon usage) to introns (unique splices sites)
to noncoding DNA. Identify open reading frames (ORFs).
[6] how can whole genomes be compared?
-- molecular phylogeny
-- You can BLAST (or PSI-BLAST) all the DNA and/or
protein in one genome against another
-- We looked at TaxPlot and COG for bacterial (and for
some eukaryotic) genomes
Orthologue & Paralogue
• Orthologue- homologous genes with
identical function in different organisms.
• Paralogue- homologous genes in the same
organism originated from gene duplication.
Orthologue & Paralogue
Species 1
Species 2
Gene A
Gene A
Gene B
Gene B
diverge
Orthologue & Paralogue
Species 1
Species 2
Gene A
Gene A
Gene B
Gene B
Orthologue & Paralogue
Species 1
Species 2
Gene A
Gene A
Gene B
Orthologue & Paralogue
Species 1
Species 2
Gene A
Gene B
Comparative Genomics
Using ACT
The Artemis Comparison Tool
Artemis
• Artemis is a free DNA sequence viewer and
annotation tool that allows visualization of
sequence features and the results of analyses
within the context of the sequence, and its sixframe translation.
• http://www.sanger.ac.uk/Software/Artemis/
Artemis comparison tool ACT
• Based on artemis and coded in java.
• Allows visualisation of two sequences or
more and a comparison file.
• The comparison file can be BLASTn or
tBLASTx.
• Retains all the functionality of artemis.
Running ACT
Sequence 1
Sequence 2
BLASTn
tBLASTx
MSPcrunch
Reformat
DNA sequence
RepeatMasker
Blastn
Repeats
Promoters
Fasta
BlastP
Gene finders
rRNA
Pfam
Blastx
Halfwise
Pseudo-Genes
Prosite
Psort
tRNA scan
Genes
SignalP
tRNA
TMHMM
The Annotation Process
ANNALYSIS SOFTWARE
DNA SEQUENCE
Useful
Information
Annotator
DNA in Artemis
AT content
Forward
translations
Reverse
Translations
DNA and amino
acids
Gene structure
• IN TRYPANOSOMATIDS
–
–
–
–
Polycistronic structure
Genes occur on a single strand at a time.
Inflection points
No splicing
Trypanosome gene structure
GENE STRUCTURE IN
MALARIA
•
•
•
•
Splicing
No polycistronic units
Can have small exons
Low complexity regions
AT content
• Coding regions have higher GC content in
AT rich genomes
AT content
CODON USAGE
• Codon bias is different for each organisms.
• DNA content in coding regions is restricted
but not in non coding regions.
• The codon usage for any particular gene can
influence expression.
Codon usage
• All organisms have a preferred set of
codons.
Malaria
GUU
GUC
GUA
GUG
0.41
0.06
0.42
0.11
Trypanosoma
GUU
GUC
GUA
GUG
0.28
0.19
0.14
0.39
Codon Usage
• http://www.kazusa.or.jp/codon/
Codon Usage in Artemis
Forward
frames
Reverse
frames
GC frame plot
• Plots the third position GC content of each
frame of a DNA sequence.
• In coding DNA the GC content of the 3rd
base is often higher.
• Good prediction of coding in malaria and
trypanosomes.
Genefinding programs
• Genefinding software packages use hidden
markov models.
• Predict coding, intergenic and intron
sequences
• Need to be trained on a specific organism.
• Never perfect!
Phat
Cawley et al. (2001) Mol. Bio. Para. 118 p167
http://www.stat.berkeley.edu/users/scawley/Phat/
• Based on a generalised hidden Markov
model (GHMM)
• Free easily installed and run.
• Is good at predicting multiexon genes but
will in some cases miss out genes altogether
and will over predict.
Whant is an HMM
• A statistical model that represents a gene.
• Similar to a “weight Matrix” that can
recognise gaps and treat them in a
systematic way.
• Has a different “states” that represent
introns,exons and intergenic regions.
GlimmerM
Salzberg et al. (1999) genomics 59 24-31
• Adaption of the prokaryotic genefinder
Glimmer.
Delcher et al. (1999) NAR 2 4363-4641
• Based on a interpolated HMM (IHMM).
• Only used short chains of bases (markov
chains) to generate probabilities.
• Trained identically to Phat
GlimmerM
• Under predicts splicing
• Hardly hardly ever misses a gene
completely.
• Does over predict.
• Free with licence.
Homology Data
• Coding regions are more conserved than non
coding regions due to selective pressure.
• Comparing all possible translations against
all known proteins will give clues to known
genes.
• Blastx
The Gene Prediction Process
ESTs
ANNALYSIS SOFTWARE
DNA SEQUENCE
FASTA
BlastX
Good
Gene
Models
Phat
GlimmerM
DNA Plots
Annotator
T. brucei vs L. major (cont.)
T. brucei vs T. cruzi
L. major has break in synteny that is
conserved in T. brucei and T. cruzi
T. cruzi
Chr3.
T. Brucei
chr1
T. Brucei
chr6
L. Major
chr12
The ACT Display
genome1
Zoom scroll bar
Filter scroll
bar
genome2
Genome2
Blast HSPs
genome3
ACT
• Designed for looking at complete bacterial
genomes.
Knowlesi
contgs
tblastx
Falciparum
Chr 3
tblastx
Yoelii
Contigs (TIGR)
Software
• www.sanger.ac.uk/Software/Artemis
• www.sanger.ac.uk/Software/ACT
• www.genome.nghri.nih.gov/blastall
• www.cgr.ki.se/cgr/goups/sonnhammer/MSPcrunch.html