Transcript Lecture_4

Genome annotation and search
for homologs
Genome of the week
• Discuss the diversity and features of
selected microbial genomes.
• Link to the paper describing the genome on
the MMG433 website.
Bacillus subtilis
• Gram-positive soil bacterium
• Genetically tractable, well-studied
• Developmental pathways (sporulation, genetic
competence)
• Industrial and agricultural importance
• 4.2 Mb genome (sequence completed 1997)
B. subtilis genome features
• 4,106 protein coding genes
• 10 rRNA operons
• Nearly 50% of the genome consists of paralogous
genes.
– 77 ABC transporter binding proteins
• 10 phage like regions - horizontal transfer. Low
GC regions in the genome.
• 18 sigma factors - initiate transcription.
• 34 two-component regulatory systems.
Sequencing of genomes
• Hierarchical or contig based sequencing
– Clone smaller seqments of the genome.
– Labor intensive, slow
– Not needed for sequencing microbial genomes
• Shotgun method
– Randomly clone and sequence 1.5-2 kb fragments of DNA. 5-10
fold coverage.
– Computationally intensive.
Sequence assembly
• Focus of this week’s lab exercise
• Algorithms to align and edit multiple
sequences
• Phrap and Consed
• Sequencher (commercial) for lab.
Finding functional features in a
microbial genome.
• Genes
•
•
•
•
•
rRNA operons, tRNAs - programs available
Origin of replication - oriC -near dnaA gene
Promoters
Transcription terminators
Horizontially transferred DNA
– GC content
Gene finding
• Easy relative to eukaryotic genomes
– No introns
– 80-90% of DNA encodes genes. 5% in eukaryotes.
• Find open reading frames (ORF scanning).
– Find start codons (mostly ATG, not always) to stop
codons. Smallest ORFs - usually 300 nt in length.
– Additional features. Good Shine-Dalgarno sequence
(ribosome binding site). AGGAGG. Not essential.
– Similarity matches to genes in other genomes.
– Effective way of searching for ORFs.
Gene finding programs
• Genefinder, Grail, Glimmer (TIGR), etc.
• ORF finder from NCBI
– Will use in a future lab exercise and in the final
annotation project
Annotating genes
• How to assign preliminary functions to genes.
• Automated programs.
• Similarity searches
– BLAST and PSI-BLAST
– COGs, Pfam, CDD, other databases
– Only 50-75% of genes will have a predicted function.
Some have no known homologs in any other genome.
• Functional characterization (individual genes)
– Gene knockouts
– Overexpression
• In most cases computer annotation will only
be able to predict function - NOT assign
function.
– The biological function of many genes have not
been determined, even in model systems.
– As genomic characterization of gene function
continues - more and more computer generated
annotations will be correct.
• Molecular function - activity of a protein at
the molecular level.
– Examples would be ATPase, metal binding,
converting glucose-6-phosphate to fructose-6phosphate.
• Biological function - cellular role of the
protein.
– Examples would be translation initiation,
adapting to environmental changes, glycolysis.
Homologs, orthologs, and
paralogs.
• Homologous genes are genes that share a
common evolutionary ancestor.
– Orthologs are genes found in different
organisms that arose from a common ancestor
– Paralogs are genes found in the same organism
that arose from a common ancestor.
Duplication could have occurred in the species
or earlier.
Using BLAST to predict gene
function.
• BLAST predicted protein sequence against
the non-redundant database.
• Determine best hits
• Automated annotation programs will often
assign the best hit function to the gene
being searched.
• Must manually confirm automated
annotations.
Assessment of BLAST output
• What is the level of identity and similarity of the
best hits?
– More identity - more likely the proteins may have
similar functions.
• Does the area of similarity occur over the entire
protein? Or just part of the protein? (fig. 2.19)
– Often you will find hits to only part of your protein. A
GTP-binding domain for example.
• Have any of the best hits been characterized
experimentally?
– With so many microbial genomes sequenced chances
are you will have to search extensively to find a hit that
has been characterized experimentally.
Databases used in protein
function analysis.
• COGs - Cluster of orthologous groups - proteins
that are best hits against each other when
comparing two genomes.
• Pfam - Protein families -more likely to identify
conserved domains rather than full-length proteins
• TIGRfam - strives to find equivalogs - “proteins
that are conserved with respect to FUNCTION
since their last common ancestor”
Databases used in protein
function analysis.
• SMART - Simple Modular Architecture
Research Tool.
• PROSITE - Protein motifs
• CDD - Conserved domain database - linked
to BLAST -Pfam, SMART, COGs.
• InterPro - A database that brings together
many of the above databases so that you can
search them all at once.
Bottom line on databases
• Are useful tools in assigning possible
functions.
• Be careful about annotations
– example -proteins in the same COG can be
orthologs that have evolved different functions.
– Many annotations are not backed up by
experimental data.
– Some databases are automated - have not been
checked for accuracy.
Examples YqeH and DnaA
Protein function
• Molecular function
– YqeH - GTPase
– DnaA - ATPase, DNA binding
• Biological function
– YqeH - Unknown
– DnaA -DNA replication initiation