Integrated Microbial Genomes

Download Report

Transcript Integrated Microbial Genomes

An introduction to metalloenzymes and
biotechnological approaches to studying them
12.755 L10
Urease is an enzyme that catalyzes the hydrolysis of urea into
carbon dioxide and ammonia. The reaction occurs as follows:
(NH2)2CO + H2O → CO2 + 2NH3
Aconitase
Outline
•
•
•
•
•
•
Introduction – global BGC to cellular physiology to metalloenzyme and
molecular
Categories of metalloprotein and metalloenzymes functions
The code: amino acids
The Genomic Firehose
Bioinformatic terminology
Intergrated Microbial Genomics Portal
Roles of metal in biology
(From Bioinorganic Chemistry, Lippard and Berg)
Metalloprotein Functions
• Dioxygen Transport
• Hemoglobin-myoglobin family
• Hemocyanins
• Hemerthyrins
•
Electron Transfer (e.g. nitrogen fixation)
•
Structural Roles (zinc fingers)
Metalloenzyme Functions (Note: Metalloenzymes are metalloproteins that perform a catalytic
function)
•
Hydrolytic Enzymes (Carbonic Anhydrases)
•
Two Electron Redox Enzymes (Nitrate Reductase, oxidation of hydrocarbons by P-450)
•
Multielectron Pair Redox Enzymes (Cytochrome c, PSII, Nitrogenase)
•
Rearrangements (Vitamin B12)
Metalloenzymes in Photosynthesis
Metalloenzymes in Photosynthesis (From Raven 2000)
Metalloenzymes in carbon fixation
Metalloenzymes in Nitrogen Utilization
Metalloenzymes in the Nitrogen Biogeochemical Cycle
Key enzyme in the nitrification reaction:
ammonia (NH3) hydroxylamine (NH2OH)  nitrite (NO2-)
Found in anaerobic oxidizing bacteria (AOB) but not the
more abundant anaerobic oxidizing archaea (AOA)
24 hemes (irons) per molecule!
What does nature actually use in the oceans
if this enzyme is not present?
How does a particular amino acid sequence create the function of a
metalloprotein or the activity of a metalloenzyme?
“The sequence itself is not informative; it must be analyzed by comparative methods
against existing databases to develop hypothesis concerning relatives and function.“
Terminology for comparing sequences:
•
Identity: The extent to which two (nucleotide or amino acid) sequences are invariant.
•
Similarity: The extent to which nucleotide or protein sequences are related. The extent of
similarity between two sequences can be based on percent sequence identity and/or
conservation. In BLAST similarity refers to a positive matrix score.
•
Conservation: Changes at a specific position of an amino acid or (less commonly, DNA)
sequence that preserve the physico-chemical properties of the original residue.
•
Homology - Similarity attributed to descent from a common ancestor. NOTE: it is binary,
sequences have homology or they do not. Something cannot be “highly homologous”
•
Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html
BLAST Basic Local Alignment Search Tool
•
E-Values: Expectation value. The number of different alignments with scores equivalent to or better than S that
are expected to occur in a database search by chance. The lower the E value, the more significant the score.
•
In the limit of sufficiently large sequence lengths m and n, the statistics of HSP scores are characterized by two
parameters, K and lambda. Most simply, the expected number of HSPs with score at least S is given by the
formula The parameters K and lambda can be thought of simply as natural scales for the search space size and
the scoring system respectively.
•
We call this the E-value for the score S. This formula makes eminently intuitive sense. Doubling the length of
either sequence should double the number of HSPs attaining a given score. Also, for an HSP to attain the score
2x it must attain the score x twice in a row, so one expects E to decrease exponentially with score.
•
Raw Score: The score of an alignment, S, calculated as the sum of substitution and gap scores. Substitution
scores are given by a look-up table (see PAM, BLOSUM). Gap scores are typically calculated as the sum of G, the
gap opening penalty and L, the gap extension penalty. For a gap of length n, the gap cost would be G+Ln. The
choice of gap costs, G and L is empirical, but it is customary to choose a high value for G (10-15)and a low value
for L (1-2).
•
HSP: High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a
given search.
Sources:
•
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html
•
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html
Program
Description
blastp
Compares an amino acid query sequence against
a protein sequence database.
blastn
Compares a nucleotide query sequence against a
nucleotide sequence database.
blastx
Compares a nucleotide query sequence translated
in all reading frames against a protein sequence
database. You could use this option to find
potential translation products of an unknown
nucleotide sequence.
tblastn
Compares a protein query sequence against a
nucleotide sequence database dynamically
translated in all reading frames.
tblastx
Compares the six-frame translations of a
nucleotide query sequence against the six-frame
translations of a nucleotide sequence database.
Please note that the tblastx program cannot be
used with the nr database on the BLAST Web
page because it is computationally intensive.
There are many metalloenzymes often doing crucial cellular
biochemical (and biogeochemical) processes
Enzymes containing metals:
• Superoxide dismutase
• Urease
• Aconitase
• Zinc finger proteins
• Carbonic anhydrase
• Alkaline phosphatase
• DNA polymerase
• Nitrate Reductase
• Multi-copper oxidase
• uvrA (ultraviolet resistence gene)
• Ferredoxin
• Nitrogenase
• Many more…
There are also many proteins and enzymes that are involved in
metal processes (uptake, storage, insertion, transformations etc).
Integrated Microbial Genomics
Joint Genome Institute, Department of Energy
The U.S. Department of Energy (DOE) Office of Science supports innovative,
high-impact, peer-reviewed biological science to seek solutions to difficult DOE
mission challenges. These challenges include finding alternative sources of energy,
understanding biological carbon cycling as it relates to global climate change, and
cleaning up environmental wastes.
•Cleanup of toxic-waste sites worldwide.
•Production of novel therapeutic and preventive agents and pathways.
•Energy generation and development of renewable energy sources (e.g., methane and hydrogen).
•Production of chemical catalysts, reagents, and enzymes to improve efficiency of industrial processes.
•Management of environmental carbon dioxide, which is related to climate change.
•Detection of disease-causing organisms and monitoring of the safety of food and water supplies.
•Use of genetically altered bacteria as living sensors (biosensors) to detect harmful chemicals in soil, air, or water.
•Understanding of specialized systems used by microbial cells to live in natural environments with other cells.
http://microbialgenomics.energy.gov/index.shtml
The Integrated Microbial Genomes (IMG) system serves as a community resource
for comparative analysis and annotation of all publicly available genomes from three
domains of life, in a uniquely integrated context.
Go To: http://img.jgi.doe.gov/
Compile list of Organisms
IMG Carts
• Carts are needed since IMG resets your
session’s cache when you leave the site.
• Carts are an easy way to save a list of:
– Organisms (eg. all cyanobacteria)
– Genes (i.e you have a list of genes that code for
superoxide dismutase in 16 different organisms)
– Functions (you have a list of the most popular
metalloenzymes in the form of COG, Pfam, TigerFam,
or EC#)
• Saved as tab delimited text files
Organism Cart (cyanobac)
taxon_oid
641228474
637000006
638341074
640612201
637000121
640963043
639857035
639857037
638341137
637000199
640069321
640753041
640069322
640069323
637000210
637000211
640069324
640069325
637000212
637000213
637000214
641228501
637000307
637000308
639857006
637000309
637000310
637000311
637000312
637000313
640427148
639857007
638341213
638341214
640427149
638341215
637000314
637000315
637000320
637000329
Genome Name
Acaryochloris marina MBIC11017
Anabaena variabilis ATCC 29413
Crocosphaera watsonii WH 8501
Cyanothece sp. CCY 0110
Gloeobacter violaceus PCC 7421
Leptolyngbya valderiana BDU 20041
Lyngbya sp. PCC 8106
Nodularia spumigena CCY9414
Nostoc punctiforme PCC 73102
Nostoc sp. PCC 7120
Prochlorococcus marinus AS9601
Prochlorococcus marinus MIT 9215
Prochlorococcus marinus MIT 9301
Prochlorococcus marinus MIT 9303
Prochlorococcus marinus MIT 9312
Prochlorococcus marinus MIT 9313
Prochlorococcus marinus MIT 9515
Prochlorococcus marinus NATL1A
Prochlorococcus marinus NATL2A
Prochlorococcus marinus marinus CCMP1375
Prochlorococcus marinus pastoris CCMP1986
Prochlorococcus marinus str. MIT 9211
Synechococcus elongatus PCC 6301
Synechococcus elongatus PCC 7942
Synechococcus sp. BL107
Synechococcus sp. CC9311
Synechococcus sp. CC9605
Synechococcus sp. CC9902
Synechococcus sp. JA-2-3Ba(2-13)
Synechococcus sp. JA-3-3Ab
Synechococcus sp. RCC307
Synechococcus sp. RS9916
Synechococcus sp. RS9917
Synechococcus sp. WH 5701
Synechococcus sp. WH 7803
Synechococcus sp. WH 7805
Synechococcus sp. WH 8102
Synechocystis sp. PCC 6803
Thermosynechococcus elongatus BP-1
Trichodesmium erythraeum IMS101
SequencingDomain
Status
Finished Bacteria
Finished Bacteria
Draft
Bacteria
Draft
Bacteria
Finished Bacteria
Draft
Bacteria
Draft
Bacteria
Draft
Bacteria
Draft
Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Draft
Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Draft
Bacteria
Draft
Bacteria
Draft
Bacteria
Finished Bacteria
Draft
Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Finished Bacteria
Genes
GC Perc Bases
8488
0.47 8361599
5764
0.41 7068601
6004
0.37 6238156
6520
0.37 5880532
4488
0.62 4659019
12
0.53
89264
6185
0.41 7037511
4904
0.41 5316258
7818
0.41 9020037
6217
0.41 7211789
1982
0.31 1669886
2056
0.31 1738790
1963
0.31 1641879
3127
0.5 2682675
1856
0.31 1709204
2345
0.51 2410873
1964
0.31 1704176
2247
0.35 1864731
1942
0.35 1842899
1932
0.36 1751080
1765
0.31 1657990
1901
0.38 1688963
2584
0.55 2696255
2715
0.55 2742269
2553
0.54 2283377
2945
0.52 2606748
2756
0.59 2510659
2358
0.54 2234828
2938
0.58 3046680
2891
0.6 2932766
2583
0.61 2224914
3009
0.6 2664465
2820
0.64 2579542
3401
0.65 3043834
2586
0.6 2366980
2938
0.58 2620367
2586
0.59 2434428
3626
0.47 3947019
2554
0.54 2593857
5124
0.34 7750108
Gene cart (Cu/Zn superoxide
dismutase)
gene_oid
641254312
637459373
637459565
640015250
639885006
638115359
637776096
637771156
640545246
639889548
640543304
639020551
Locus Tag
AM1_5239
glr1981
glr2170
L8106_24545
BL107_14050
sync_1771
Syncc9605_1507
Syncc9902_0982
SynRCC307_0325
RS9916_26849
SynWH7803_0951
WH7805_01302
Gene Symbol Product Name
sodCC
copper/zinc superoxide dismutase
similar to superoxide dismutase
similar to superoxide dismutase
superoxide dismutase
putative superoxide dismutase
sodC
Copper/zinc superoxide dismutase
superoxide dismutase precursor (Cu-Zn)
putative superoxide dismutase
sodC
Superoxide dismutase [Cu-Zn]( EC:1.15.1.1 )
superoxide dismutase precursor (Cu-Zn)
sodC
Superoxide dismutase [Cu-Zn]( EC:1.15.1.1 )
putative superoxide dismutase
AA Seq Length
196
233
191
201
198
175
178
175
175
177
174
174
Genome
Acaryochloris marina MBIC11017
Gloeobacter violaceus PCC 7421
Gloeobacter violaceus PCC 7421
Lyngbya sp. PCC 8106
Synechococcus sp. BL107
Synechococcus sp. CC9311
Synechococcus sp. CC9605
Synechococcus sp. CC9902
Synechococcus sp. RCC307
Synechococcus sp. RS9916
Synechococcus sp. WH 7803
Synechococcus sp. WH 7805
Function Cart (metalloenzymes)
func_id
func_name
COG0619 ABC-type cobalt transport system, permease component CbiQ and related transporters
COG1122 ABC-type cobalt transport system, ATPase component
COG1930 ABC-type cobalt transport system, periplasmic component
COG2032 Cu/Zn superoxide dismutase
COG2140 Thermophilic glucose-6-phosphate isomerase and related metalloenzymes
COG3227 Zinc metalloprotease (elastase)
COG4097 Predicted ferric reductase
COG4300 Predicted permease, cadmium resistance protein
pfam01676Metalloenzyme
pfam01794Ferric_reduct
pfam02022Integrase_Zn
pfam02361CbiQ
pfam02553CbiN
pfam02742Fe_dep_repr_C
pfam03596Cad
In Class Exercise on IMG:
http://img.jgi.doe.gov
•
Load genomes
–
–
–
–
–
•
Go to “FIND GENOMES”
Click “VIEW PHYTOGENETICALLY”
Click “CLEAR ALL” to unselect all genomes
Click “ALL” after Cyanobacteria listings to select all Cyanobacterial genomes
Click “SAVE SELECTIONS” to choose only these selected Cyanobacterial genomes. Note
at top now it should say 40 genomes selected.
Gene Search for Superoxide Dismutase, using “FIND GENES” function
–
–
–
–
–
–
By “GENE SEARCH”: type in superoxide dismutase and hit search. Note that this will only
return genes that have been “annotated” as a superoxide dismutase by a previous computer
or human annotator. Go ahead and grab a sequence for Synechococcus strain WH8102’s
nickel superoxide dismutase, by clicking on the 474bp to the clipboard (highlight the area and
hit control-C). Note that this is the DNA sequence.
Click the “FIND GENES” tab and then the “BLAST” tab: Paste in the nickel superoxide
dismutase into open box.
Choose BLASTn for nucleotide (DNA) search
Set the cutoff value to 1e-2, (less stringent).
Note that the best hit is where you got the sequence from.
Repeat, but now with the amino acid sequence instead of the DNA sequence
Blast results
Sequences producing significant alignments:
(bits) E-Value
637000314.NC_005070 Synechococcus sp. WH 8102, complete genome.
637000310.NC_007516 Synechococcus sp. CC9605, complete genome.
639857006.NZ_AATZ01000003 Synechococcus sp. BL107, unfinished se...
637000311.NC_007513 Synechococcus sp. CC9902, complete genome.
637000309.NC_008319 Synechococcus sp. CC9311, complete genome.
640069323.NC_008820 Prochlorococcus marinus str. MIT 9303, compl...
637000211.NC_005071 Prochlorococcus marinus str. MIT 9313, compl...
640963030.NZ_ABCS01000039 Plesiocystis pacifica SIR-1, unfinishe...
637000213.NC_005042 Prochlorococcus marinus subsp. marinus str. ...
641228501.NC_009976 Prochlorococcus marinus str. MIT 9211, compl...
940 0.0
389 e-105
311 3e-82
287 4e-75
208 3e-51
168 3e-39
153 2e-34
68 7e-09
54 1e-04
48 0.007
The COG database: new developments in phylogenetic classification of proteins from complete
genomes
Roman L. Tatusov, Darren A. Natale, Igor V. Garkavtsev, Tatiana A. Tatusova, Uma T. Shankavaram,
Bachoti S. Rao, Boris Kiryutin, Michael Y. Galperin, Natalie D. Fedorova, and Eugene V.
KooninaNational Center for Biotechnology Information, National Library of Medicine, National Institutes
of Health, Bethesda, MD 20894, USA
The database of Clusters of Orthologous Groups of
proteins (COGs), which represents an attempt on a
phylogenetic classification of the proteins encoded in
complete genomes, currently consists of 2791 COGs
including 45 350 proteins from 30 genomes of bacteria,
archaea and the yeast Saccharomyces cerevisiae
(http://www.ncbi.nlm.nih.gov/COG). In addition, a
supplement to the COGs is available, in which proteins
encoded in the genomes of two multicellular eukaryotes,
the nematode Caenorhabditis elegans and the fruit fly
Drosophila melanogaster, and shared with bacteria
and/or archaea were included. The new features added
to the COG database include information pages with
structural and functional details on each COG and
literature references, improvements of the COGNITOR
program that is used to fit new proteins into the COGs,
and classification of genomes and COGs constructed by
using principal component analysis.
Growth dynamics of the COG set with the increase of number of included genomes.
The circles show the sequence of genome inclusion according to the actual order of
sequencing, and the smooth line shows the mean of 106 random permutations of the
genome order. The colored area indicates the range between the maximal and
minimal value for each point (number of genomes) in 106 random permutations.
Nucleic Acids Res. 2001 January 1; 29(1): 22–28.
End for today