Transcript document

MBoMS
Genomics of Model Microbes
Lab 1: Genome databases
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Genomics of Model Microbes
Instructor: Peg Riley
[email protected]
Section grade
20% lab notebook
20% homework
60% independent project
Model Microbes and the
Microbial Species Concept
• We are going to employ model microbes
and their genomes to begin to address
the microbial species concept
– What you learned from my presentation is
that such a study requires us to be able to
handle very large databases of sequences
and to compare them between individuals
within a species and between multiple
species
• Our goal is to provide robust answer to
the question “Do microbes adhere to a
Molecular Databases
• The first step in our study is to learn how to find
the existing molecular databases
– In our case, microbial genomes
• Let’s start with an introduction to molecular
databases
– One of the most challenging tasks faced by molecular
biologists comes after the data has been collected
– Data recording, storage, analysis and interpretation
require the use of sophisticated software and hardware
– For example:
• GenBank holds over 130 billion bases
• Genome Project holds over 1,360 complete/incomplete
genomes
Biological databases
• In this portion of
the course, we will
learn how to
negotiate your
way through
massive biological
datasets with ease
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
www.bio-pro.de/imperia/ md/images/artikelgebun...
What is a biological database?
• A large, organized body of
persistent data
– usually associated with
computerized software designed
to update, query, and retrieve
components of the data stored
within the system
Biological database
– A simple database might be a single file
containing many records, each of which
includes the same set of information
• For example, a record associated with a
nucleotide sequence database typically
contains information such as contact
name, the input sequence with a
description of the type of molecule, the
scientific name of the source organism
from which it was isolated, and often,
literature citations associated with the
sequence
GenBank Accession
•
•
•
•
•
•
•
•
•
•
•
•
LOCUS
NM_001042765 1406 bp
mRNA
linear
PRI 11-FEB-2007
DEFINITION Macaca mulatta alcohol dehydrogenase 1A (class I),
ACCESSION
NM_001042765 XM_001106395
VERSION
NM_001042765.1 GI:112363121
KEYWORDS
.
SOURCE
Macaca mulatta (rhesus monkey)
ORGANISM Macaca mulatta Eukaryota; Metazoa; Chordata; Craniata;
Vertebrata; Euteleostomi;Mammalia; Eutheria; Euarchontoglires;
Primates; Haplorrhini; Catarrhini; Cercopithecidae; Cercopithecinae;
Macaca.
REFERENCE
1 (bases 1 to 1406)
AUTHORS
Cheung,B., Holmes,R.S., Easteal,S. and Beacham,I.R.
TITLE
Evolution of class I alcohol dehydrogenase genes in
catarrhineprimates: gene conversion, substitution rates, and gene
regulation
JOURNAL
Mol. Biol. Evol. 16 (1), 23-36 (1999)
PUBMED
10331249
•
FEATURES
source
•
gene
•
CDS
Location/Qualifiers
1..1406
/organism="Macaca mulatta"
/mol_type="mRNA"
/db_xref="taxon:9544"
/chromosome="5"
/map="5"
1..1406
/gene="ADH1A"
/note="alcohol dehydrogenase 1A (class I),polypeptide"
/db_xref="GeneID:707682"
24..1151
/gene="ADH1A"
/EC_number="1.1.1.1"
/note="alchohol dehydrogenase 1A"
/codon_start=1
/product="class I alcohol dehydrogenase, alpha subunit"
/protein_id="NP_001036230.1"
/db_xref="GI:112363122"
/db_xref="GeneID:707682"
/translation="MSTAGKVIKCKAAVLWEVMKPFSIEDVEVAPPKAYEVRIKMVTV
GICGTDDHVVSGTMVTPLPVILGHEAAGIVESVGEGVTTVEPGDKVIPLALPQCGKCR
ICKTPERNYCLKNDVSNPRGTLQDGTSRFTCRGKPIHHFLGVSTFSQYTVVDENAVAK
IDAASPMEKVCLIGCGFSTGYGSAVKVAKVTPGSTCAVFGLGGVGLSAVMGCKAAGAA
RIIAVDINKDKFAKAKELGATECINPQDYKKPIQEVLKEMTDGGVDFSFEVIGRLDTM
MASLLCCHEACGTSVIVGVPPDSQNLSINPMLLLTGRTWKGAVYGGFKSKEDIPKLVA
DFMAKKFSLDALITHVLPFEKINEGFDLLRSGKSIRTILTF"
polyA_site
1406
/gene="ADH1A"
•
ORIGIN
tgcaaagcag
gcacctccta
gatgaccacg
gcagccggca
gtcatcccac
aactactgct
aggttcacct
tacacggtgg
gtctgcctta
gtcaccccag
atgggctgta
tttgcaaagg
cccatccagg
atcggtcggc
agcgtcatcg
ctgactggac
ccaaaacttg
gttttacctt
cgtaccatcc
ctcctctacc
agatgtattc
ttatttttca
ctggggaatt
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
1081
1141
1201
1261
1321
1381
tgcagagaag
ctgtgctatg
aggcttatga
tggttagtgg
ttgtggagag
tcgctcttcc
tgaaaaacga
gcagggggaa
tggatgagaa
ttggctgtgg
gctctacctg
aagcagctgg
ccaaagagtt
aggtgctaaa
ttgataccat
taggggtacc
gcacctggaa
tggctgattt
ttgaaaaaat
tgaccttttg
ctacatgatc
aataaattac
agcaaaaatt
gagccaataa
accagaaacc
ggaggtaatg
agttcgcatt
taccatggtg
tgttggagaa
tcagtgtgga
tgtgagcaat
gcccatccac
tgcagtagcc
attttcaact
tgctgtgttt
agcagccaga
gggtgccact
ggaaatgact
gatggcttcc
tcctgattcc
gggggctgtt
tatggctaag
aaatgaaggc
aaacactaga
tggagcaaca
acatgggggc
taaaattcaa
actgttcttc
aacatgagca
aaaccctttt
aagatggtga
accccacttc
ggggtgacta
aaatgcagaa
cctcggggga
cacttcctcg
aaaattgacg
ggttatgggt
ggcctgggag
atcattgcgg
gaatgcatca
gatggaggtg
ctgttatgtt
cagaacctct
tatggtggct
aagttttcac
tttgacctgc
gatgccttcc
gctgggaaat
tttccaaaga
gtgagaatta
tcaacc//
cagcaggaaa
ccattgagga
ctgtaggaat
ctgtgatttt
cagtcgaacc
tttgtaaaac
ccctgcagga
gtgtcagcac
cagcctcacc
ctgcagtcaa
gggtcggcct
tggacatcaa
accctcaaga
tggatttttc
gtcatgaggc
caataaaccc
ttaagagtaa
tggatgcttt
ttcgctctgg
cttgtacgca
atcataattc
aatggcaaat
aataaagtgt
agtaatcaaa
tgtggaggtt
ctgtggcaca
aggccatgag
aggtgataaa
cccggaaagg
tggcaccagc
cttctcccaa
catggagaaa
agttgccaag
atctgctgtt
caaggacaaa
ctacaagaaa
gtttgaagtc
atgtggcaca
tatgctgcta
agaagatatc
aataacccat
gaaaagtatc
gtcttcaggc
tgctcttcag
tgatgggaaa
tgaacatcag
Biological database
–For researchers to benefit from the
data stored in a database, two
additional requirements must be
met:
• easy access to the information
• a method for extracting only that
information needed to answer a
specific biological question
There is an enormous diversity
of molecular databases
DNA sequences
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Genome sequences
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
•
www.sanger.ac.uk/.../ 050304_bfragilis-300.jpg
Protein sequences
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
•
www.accelrys.com/.../ images/blast_sm.jpg
RFLP data
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
•
biologi.uio.no/.../ Images/RFLP.FIG3.gif
SNPs
Single nucleotide polymorphism
1: rs33873737 [Mus musculus]
Links
GCTTTCCACAGAAGCTTGATGACTT[A/G]AATTCTATCCTTAGAACCCACTTAA
2: rs33871597 [Mus musculus]
Links
TGTAGGCTTTCTGTTGTTTTTGTTG[C/T]TATTGAAGACCAGCCTTAGTCCTTA
3: rs33871077 [Mus musculus]
Links
CATTTCAAATGTTATCCTCTTTCCA[C/T]GACCCCCCCAAACACCCTATCCCAT
4: rs31664163 [Mus musculus]
Links
TGTAATTGGTAAACTTGTAATTTTT[A/T]AAGGAAGATTTGTATATTTTCCCCT
Molecular structure
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
•
www.cs.dartmouth.edu/.../ proj/insulinLabels.gif
Microarrays
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
•
www.bio-aurum.com
Gene networks
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
•
www.ehponline.org/txg/ members/2004/6758/fig3.gif
Exercise 1
•
Introduction to NCBI
Open NCBI http://www.ncbi.nlm.nih.gov/
– Go to the NCBI introduction
http://www.ncbi.nlm.nih.gov/About/index.html
– Glance at the 7 sections and provide answers to the
tasks and questions listed below
1. NCBI at a glance - what is the mission of NCBI?
2. A science primer - briefly describe each of the science primers
3. Databases and tools - create a list of databases and related
resources described
4. Human genome resources - what sorts of things can you learn at
the HGR site
5. Model organisms guide - how many model organisms are there?
6. Outreach and education - what tutorials are offered?
7. News - what is the latest news at NCBI?
Exercise 2
•
Use Connexions web-based learning
tools to explore NCBI and other data
repositories
– Bio 533 Bioinformatics by Susan Cates
http://cnx.org/content/col10152/latest/
•
Go to the Database lab and complete each of
the 4 sections
1.
2.
3.
4.
NCBI orientation
Entrez, problems 1-12
Protein data bank, problems 1-12
Tour of bioinformatics sites, problems 1-2
Exercise 3
• Use NCBI to learn about microbial
genome databases
– Click on “All databases”
– Click on “Genome”
• Become familiar with the genome resources
offered by NCBI
• How many complete genomes are now available
• Pick one genome you are surprised to see and
explain why
Exercise 4
– Click on “New Microbial Genome Resources”
• Become familiar with the contents
• How many complete microbial genomes are there?
• Do you see a bias in what genomes are available?
– What is gMap?
• What does this tool provide?
• When might it be useful?
• Use Escherichia to view the output of gMap and summarize your
findings
» How many genomes are compared?
» What does hit coverage limit mean?
Exercise 4 cont.
– What is genome ProtMap?
• What does this tool provide?
• When might this feature be useful?
– How many bacterial protein clusters
are known?
• Use COG0539 in ProtMap and summarize
what you find
Exercise 5
• Independent Projects
• You will now be assigned a pair of microbial
species
– Research the basic biology of both species
– Provide a one page summary of what you learn
• Find one genome for each species
– Provide a summary description of these genome
• Include size, number of proteins encoded, base composition,
etc.
• Determine if there are multiple genomes for each
species
– Provide a one page description of how many genomes
are available, how they differ and how they are similar
• Hint, you may have to do some genome comparisons using the
tools you just learned about!
Model
Microbial
Species
Bacterial Species
Acinetobacter baumannii
Bacillus anthracis
Bacillus cereus
Buchnera aphidicola
Burkholderia cenocepacia
Burkholderia mallei
Burkholderia pseudomallei
Campylobacter jejuni
Chlamydia trachomatis
Chlamydophila pneumoniae
Clostridium botulinum
Clostridium perfringens
Corynebacterium glutamicum
Ehrlichia ruminantium
Escherichia coli
Francisella tularensis
Haemophilus influenzae
Helicobacter pylori
Legionella pneumophila
Methanococcus maripaludis
Mycobacterium tuberculosis
Neisseria meningitidis
Prochlorococcus marinus
Pseudomonas aeruginosa
Pseudomonas putida
Pseudomonas syringae
Rhodobacter sphaeroides
Rhodopseudomonas palustris
Salmonella enterica
Shewanella baltica
Staphylococcus aureus
Streptococcus agalactiae
Streptococcus pneumoniae
Streptococcus pyogenes
Streptococcus thermophilus
Xanthomonas campestris
Xylella fastidiosa
Yersinia pestis
Yersinia pseudotuberculosis
Number
Complete
Genomes
3
3
4
4
3
4
4
5
3
4
6
3
3
3
13
7
4
3
4
4
4
4
12
3
4
3
3
5
6
3
14
3
4
12
3
3
3
7
3
Lab Notebook
• You will create a lab notebook to keep track of
your laboratory exercises and assignment
• For each exercise you should start a new page
and put the exercise “title”, your name and the
date
– Include enough information about the exercise so that
someone who picks up your notebook could repeat
your investigations
• In other words, just labeling the parts 1, 2, etc. won’t do it
– Explain what you are doing and how you do it
• Provide detailed information about the steps you take
– Summarize each exercise with a paragraph about the
most important things that you learned
• We will examine your notebook at each class