Lecture_3_2005

Download Report

Transcript Lecture_3_2005

Genome databases and webtools
for genome analysis
• Become familiar with microbial genome
databases
• Use some of the tools useful for analyzing
genome
• Visit sites used in lab exercise #2
Major components of NCBI
•
•
•
•
•
•
•
GenBank
PubMed
Entrez
BLAST
Conserved Domain Database (CDD)
Cluster of orthologous groups (COGS)
OMIM
GenBank
• Database of DNA and protein sequences
• Searchable
• Caution: Sequences deposited by the
community, not curated for accuracy.
• RefSeq - verified by NCBI.
Example of a GenBank record
BLAST
• Basic Local Alignment Search Tool
• Comparing nucleotide sequences and
protein sequences
• Microbial specific BLAST page
• Focus of a future lab
OMIM
• Online Mendelian Inheritance in Man.
• Database that links diseases and genes
TIGR
• Comprehensive microbial resource (CMR).
• Many genomes.
• Tools to analyze genomes.
SubtiList
• Website for B. subtilis genome.
• Features
–
–
–
–
–
Annotated genes
Gene region display
Updated similarity searches for every protein
BLAST and pattern search capabilities
Links to journal articles and protein databases
RDP
• Ribosomal database project
• Curated at MSU
• Contains a compilation of all ribosomal
DNA sequences (currently over 100,000).
• Second database contains information
regarding copy number of ribosomal RNA.
KEGG
• Kyoto Encyclopedia of Genes and Genomes
• Often changing database of gene content,
metabolic pathways, etc.
• Excellent resource for reconstructing
pathways in organism of interest.
Genome sequencing and
annotation
Week 2 reading assignments - pages
65-79, 110-122. Boxes 2.1, 2.2 and
2.3. Don’t worry about the details
of HMM.
Hughes Functional Genomics
Review.
• Sequencing - dideoxy method for DNA
sequencing.
• Methods for sequencing genomes.
• Methods for finding and annotating genes in
microbial genomes.
Dideoxy sequencing (Sanger
method)
• Developed by Frederick Sanger (for which
he won his second Nobel Prize in 1980).
Two types of labeling
• Radioactive
– 32P, 35S
– Run out each dideoxy base in a separate reaction, lane on a gel.
– No longer used
• Fluorescent
– Four different fluorophores for each base
– Can be mixed.
– Chromatograms - GTSF
Cycle sequencing
Phred
• Method for automated quality assessment of DNA
sequence traces.
– Variance in peak spacing in 7 peak window
– Ratio of largest uncalled peak to smallest called peak in
7 and 3 peak windows.
– Number of bases between current base and nearest
unresolved base.
• Phred score = 10 x (-log(P)).
• Phred scores of 20 or higher are considered good
calls. Why?
Sequencing of genomes
• Hierarchical or contig based sequencing
– Clone smaller segments of the genome.
– Labor intensive, slow
– Not needed for sequencing microbial genomes
• Shotgun method
– Randomly clone and sequence 1.5-2 kb fragments of DNA. 5-10
fold coverage.
– Computationally intensive.
Finding genes in a genome
sequence
• What to look for?
• Glimmer - HMM algorithm for identifying
genes. (TIGR).
• ORF finder - NCBI.
• Most automated annotation engines have
ORF finding capabilities.
• Much more difficult in eukaryotic genomes.