Genome Annotation - Virginia Commonwealth University

Download Report

Transcript Genome Annotation - Virginia Commonwealth University

Genome Annotation
BBSI
July 14, 2005
Rita Shiang
Genome Annotation

Identification of important components in
genomic DNA
What is a Gene?



Fundamental unit of heredity
DNA involved in producing a polypeptide; it
includes regions preceding and following the
coding region (leader and trailer) as well as
intervening sequences (introns)
Entire DNA sequence including exons, introns,
and noncoding transcription-control regions
What Components are Important in
Protein Coding Genes?



Sequences that initiate transcription
Sequences that process hnRNA to mRNA
Signals important in translation
TATA Box
Lodishet al, Molecular Cell Biology, 2000, Fig. 10.30.
Other Promoters

Initiator consensus
–
5’Py Py A(+1) N T/A Py Py Py



N = A, T, G or C
Py = pyrimidine = C or T
GC rich sequences
–
–
–
Stretch of 20-50 GC nucleotides ~100 bp upstream
of start site (CpG not common in genome)
Housekeeping genes
Multiple initiation sites
Polyadenylation & Cleavage




Addition of a string of As to mRNAs
Polyadenylation signal AAUAAA found before
cleavage site
GU or UU rich region ~50 bp from the cleavage
site
Stabilizes mRNA transcripts
Lodishet al, Molecular Cell Biology, 2000, Fig. 11.23.
Splicing
Electron micrograph of adenovirus DNA and hexon gene mRNA
Lodishet al, Molecular Cell Biology, 2000, Fig. 11,13.
Splice Reaction
Lodishet al, Molecular Cell Biology, 2000, Fig. 11.15.
Splice Sites
Lodishet al, Molecular Cell Biology, 2000, Fig. 11,14.
Additional Splice Sites
Consensus Py7NCAG-G(exon)AG – GUAAGU 98.12%
Nonconsensus
GC
U12 introns
AC
PuUAUCCUPy 0.76%
Other rare sequences
1%
Py = C or U
Pu = A or G
Translation Signals



5’ Cap structure directs ribosomal binding
AUG codes for methionine. The first AUG in a
transcript is where translation starts
Open reading frame (ORF)
–

Stretch of sequence that codes for amino acids
before a stop codon
Translation stop codons UAG, UAA, UGA
Capping of 5’RNA with 7’methylguanylate (m7G)
Lodish et al, Molecular Cell Biology, 2000, Fig. 11.8.
Known Gene Components
Lodishet al, Molecular Cell Biology, 2000, Fig. 10.34.
Genome Annotation

What is in a genome besides protein coding
genes?
Repetitive DNA makes up at least
50% of the genome





Transposon-derived interspersed repeats
Inactive retroposed copies of genes –pseudogenes
Simple short repeats
Segmental Duplications
Blocks of tandemly repeated sequences
–
–
–
–
Centromeres
Telomeres
Short arm of acrocentric chromosomes
Ribosomal gene clusters
Non-protein coding genes or noncoding RNA (ncRNA)



tRNA genes
rRNA genes
snRNA genes
–
–


Splicing
Telomere maintenance
snoRNA genes
Other
–
microRNA
Annotation of Genomic DNA


Identifying Protein Coding Genes
Placing the genes on the genome (where are
they?)
How Many Genes in the Genome?




Early on based on reassociation kinetics the
estimate was ~40,000
Walter Gilbert estimated ~100,000 based on
gene and genome size
70,000 – 80,000 based on an extrapolated
number of CpG islands
With the Human sequence the estimate is
30,000 – 40,000
Annotation of Genomic DNA Specifically for
Genes that Code for Proteins



Match genomic DNA to genes that have been
previously cloned and sequenced looking for
sequence similarity using BLAST programs
Predict genes using computer programs to
scan genomic DNA using known elements
Many strategies use a combination of both
methods
cDNA Library Construction
Lodishet al, Molecular Cell Biology, 2000, Fig. 7.14
Lodishet al, Molecular Cell Biology, 2000, Fig. 7.15
Gene Annotation
Celera


Constructed gene models using sequence from
cDNAs
Used Unigene database


Partitions GenBank sequences (mRNAs & ESTs) into nonredundant set using 3’ UTRs
111,064 Unigene clusters for human
Gene Annotation
Celera cont.



Predicts gene boundaries by identifying overlapping
sets of EST and protein matches
Known full-length genes were annotated on the map
(matched w/50% of the length & >92% identity)
Clusters that did not match a full-length gene were
evaluated using other references
–
–
–
Conservation of genomic sequence between mouse & human
Similarity between human & rodent transcripts
Similarity to known proteins
Validation


Validated by construction of known genes
(RefSeq)
6.1% of RefSeq genes were not annotated by
Otto
Gene Annotation - Human Genome
Sequencing Consortium

Start with Ensemble predicted genes
–
ab initio predictions using Genscan

–
–
Confirm similarity to mRNAs, ESTs, protein motifs
from all organisms
Extend protein matches using GeneWise

–
Based on probabilistic model of genome sequence
composition and gene structure
Compares protein based information to genomic sequence
and allows for frameshifts and large introns
Produces partial gene predictions
Consortium cont.

Merge Ensemble gene predictions w/ Genie
predictions
–
Genie identifies matches of mRNAs and ESTs




Employs hidden Markov models (HMMs) to extend matches using
ab initio statistical methods
Links information from 5’ and 3’ ESTs from the same cDNA clone
to complete a sequence from the ATG to the stop codon
Can generate alternatively spliced products (though only longest
used in this build)
Merge results with genes in RefSeq, SWISSPROT and
TrEMBL databases
Validation



Validate method by comparing to a new set of
known genes, a set of mouse cDNAs and
genes on Chromosome 22 (Finished
Sequence)
85% Sensitivity
13% spurious predictions
Factors Affecting Gene Annotation


Splice sites do not conform to consensus
Noncoding exons are common
–
–
–
–
Exon – what is left over after splicing after introns
are removed and does not refer to a stretch of
coding information
tRNAs are spliced but noncoding
>35% of human genes have noncoding exons
No statistical bias so they are difficult to identify
Factors Affecting Gene Annotation
Cont.

Internal exons can be very small
–
–
–
–
–
Avg. size of internal exons are ~130 bp
~65% of vertebrate exons are 68-208 bp
>10% are <60 bp
Exons < 10 bp have been identified
Invected gene in Drosophila



–
One of four exons is 6 bp (GTCGAA)
Flanked by introns of 27.6 and 1.1 kb
Not correctly recognized by cDNA alignment software and creates
a frameshift in the gene
Exons of size 0

Resizing exons create an intermediate splice product
Places to View Annotated Genomes




National Center for Biotechnology Information
(NCBI)
Ensemble
The Golden Path (UCSC Genome Browser)
Celera
Verification of Annotation in C.
elegans by Experimentation



Complete genomic sequence
Small introns
Small intergenic regions
Results



11,984 cDNAs successfully cloned out of a
prediction of 19,477
4,365 were not represented by cDNAs or ESTs
Failure of cloning could be due to:
–
–
–
Wrongly predicted exons
Very low expressing genes
Not a real gene
Verification of intron/exon
structures
Comparison of a Single Transcript
Greater than 50% of intron/exon
structures need correcting?