BioInformatics at FSU - whose job is it and why it needs

Download Report

Transcript BioInformatics at FSU - whose job is it and why it needs

Special Topics BSC4933/5936:
An Introduction to Bioinformatics.
Florida State University
The Department of Biological Science
www.bio.fsu.edu
BioInformatics Databases
Steven M. Thompson
Florida State University School of
Computational Science (SCS)
NCBI’s
Entrez
But first some of my definitions, lots of overlap —
Biocomputing and computational biology are synonyms and
describe the use of computers and computational techniques
to analyze any type of a biological system, from individual
molecules to organisms to overall ecology.
Bioinformatics describes using computational techniques to
access, analyze, and interpret the biological information in
any type of biological database.
Sequence analysis is the study of molecular sequence data for
the purpose of inferring the function, interactions, evolution,
and perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes
(the total DNA content of an organism) within the same and/or
across different genomes.
Proteomics is the subdivision of genomics concerned with
analyzing the complete protein complement, i.e. the proteome,
of organisms, both within and between different organisms.
One way to think about the field —
The Reverse Biochemistry Analogy.
Biochemists no longer have to begin a research project by
isolating and purifying massive amounts of a protein from
its native organism in order to characterize a particular
gene product. Rather, now scientists can amplify a
section of some genome based on its similarity to other
genomes, sequence that piece of DNA and, using
sequence analysis tools, infer all sorts of functional,
evolutionary, and, perhaps, structural insight into that
stretch of DNA!
The computer and molecular databases are a
necessary, integral part of this entire process.
The exponential growth of molecular sequence
databases & cpu power —
Year
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
BasePairs
Sequences
680338
606
2274029
2427
3368765
4175
5204420
5700
9615371
9978
15514776
14584
23800000
20579
34762585
28791
49179285
39533
71947426
55627
101008486 78608
157152442 143492
217102462 215273
384939485 555694
651972984 1021211
1160300687 1765847
2008761784 2837897
3841163011 4864570
11101066288 10106023
15849921438 14976310
28507990166 22318883
36553368485 30968418
doubling time ~
one year
Q uickTim e™ and a
TI FF ( Uncompr essed) decompr essor
ar e needed t o see t his pict ur e.
http://www.ncbi.nlm.nih.gov/
Genbank/genbankstats.html
Database Growth (cont.) —
The Human Genome Project and numerous smaller
genome projects have kept the data coming at alarming
rates. As of December 2004, almost 240 complete
genomes are publicly available for analysis, not
counting all the virus and viroid genomes available.
The International Human Genome Sequencing
Consortium announced the completion of the "Working
Draft" of the human genome in June 2000;
Independently that same month, the private company
Celera Genomics announced that it had completed the
first “Assembly” of the human genome. Both articles
were published mid-February 2001 in the journals
Science and Nature.
Some neat stuff from the papers —
We, Homo sapiens, aren’t nearly as special as
we had hoped we were. Of the 3.2 billion
base pairs in our DNA:
Traditional, text-book estimates of the number of genes
were often in the 100,000 range; turns out we’ve only
got about twice as many as a fruit fly, between 25’ and
35,000!
The protein coding region of the genome is only about
1% or so, a bunch of the remainder is ‘jumping’
‘selfish DNA’ of which much may be involved in
regulation and control.
Over 100-200 genes were transferred from an ancestral
bacterial genome to an ancestral vertebrate genome!
(Later shown to be not true by more extensive analyses, and to
be due to gene loss rather than transfer.)
What are sequence databases?
These databases are an organized way to store the tremendous
amount of sequence information accumulating worldwide. Most have
their own specific format. An ‘alphabet soup’ of three major database
organizations around the world are responsible for maintaining most
of this data. They largely ‘mirror’ one another and share accession
codes, but NOT proper identifier names:
North America: the National Center for Biotechnology Information (NCBI),
a division of the National Library of Medicine (NLM), at the National
Institute of Health (NIH), has GenBank & GenPept. Also Georgetown
University’s National Biomedical Research Foundation (NBRF) Protein
Identification Resource (PIR) & NRL_3D (Naval Research Lab
sequences of known three-dimensional structure).
Europe: the European Molecular Biology Laboratory (EMBL), the
European Bioinformatics Institute (EBI), and the Swiss Institute of
Bioinformatics’ (SIB) Expert Protein Analysis System (ExPasy), all help
maintain the EMBL Nucleotide Sequence Database, and the SWISSPROT & TrEMBL amino acid sequence databases.
Asia: The National Institute of Genetics (NIG) supports the Center for
Information Biology’s (CIG) DNA Data Bank of Japan (DDBJ).
A little history —
Developments that affect software and the end user —
The first well recognized sequence database was Dr. Margaret Dayhoff’s
hardbound Atlas of Protein Sequence and Structure begun in the midsixties. DDBJ began in 1984, GenBank in 1982, and EMBL in 1980.
They are all attempts at establishing an organized, reliable,
comprehensive and openly available library of genetic sequences.
Databases have long-since outgrown a hardbound atlas. They have
become huge and have evolved through many changes with many more
yet to come.
Changes in format over the years are a major source of grief for software
designers and program users. Each program needs to be able to
recognize particular aspects of the sequence files; whenever they
change it throws a wrench in the works. NCBI’s ASN.1 format and its
Entrez interface attempt to circumvent some of these frustrations.
However, database format is much debated as many bioinformaticians
argue for relational or object-oriented standards. Unfortunately, until all
biologists and computer scientists worldwide agree on one standard and
all software is (re)written to that standard, neither of which is likely to
happen very quickly, format issues will remain probably the most
confusing and troubling aspect of working with primary sequence data.
So what are these databases like?
Just what are primary sequences?
(Central Dogma: DNA —> RNA —> protein)
Primary refers to one dimension — all of the ‘symbol’ information
written in sequential order necessary to specify a particular
biological molecular entity, be it polypeptide or nucleotide.
The symbols are the one letter codes for all of the biological
nitrogenous bases and amino acid residues and their ambiguity
codes. Biological carbohydrates, lipids, and structural and
functional information are not sequence data. Not even DNA
translations in a DNA database!
However, much of this feature and bibliographic type information
is available in the reference documentation sections associated
with primary sequences in the databases.
Content & Organization —
Sequence database installations are commonly a complex
ASCII/Binary mix, usually not relational or Object Oriented (but
proprietary ones often are). They’ll contain several very long
text files each containing different types of information all
related to particular sequences, such as all of the sequences
themselves, versus all of the title lines, or all of the reference
sections. Binary files often help ‘glue together’ all of these
other files by providing indexing functions.
Software is usually required to successfully interact with these
databases and access is most easily handled through various
software packages and interfaces, either on the World Wide
Web or otherwise.
More organization stuff —
Nucleic acid sequence databases (and TrEMBL) are split into
subdivisions based on taxonomy (historical rankings — the Fungi
warning!). PIR is split into subdivisions based on level of
annotation. TrEMBL sequences are merged into SWISS-PROT
as they receive increased levels of annotation.
Nucleic Acid DB’s
GenBank/EMBL/DDBJ
all Taxonomic
categories + HTC’s,
HTG’s, & STS’s
“Tags”
EST’s
GSS’s
Amino Acid DB’s
SWISS-PROT
TrEMBL
PIR
PIR1
PIR2
PIR3
PIR4
NRL_3D
Genpept
Parts and problems —
All sequence databases contain these elements:
Name: LOCUS, ENTRY, ID all are unique identifiers
Definition: A brief, one-line, textual sequence description.
Accession Number: A constant data identifier.
Source and taxonomy information.
Complete literature references.
Comments and keywords.
The all important FEATURE table!
A summary or checksum line.
The sequence itself.
But:
Each major database as well as each major suite of software tools
that you are likely to use has its own distinct format requirements.
This can be a huge problem and an enormous time sink, even with
helpful tools such as Don Gilbert’s ReadSeq. Therefore, becoming
familiar with some of the common formats is a big help. Look for key
features of each type of entry:
GenBank and GenPept format —
LOCUS
HSEF1AR
1506 bp
mRNA
linear
PRI 12-SEP-1993
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha).
X03558
X03558.1 GI:31097
elongation factor; elongation factor 1.
human.
Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE
1 (bases 1 to 1506)
AUTHORS
Brands,J.H., Maassen,J.A., van Hemert,F.J., Amons,R. and Moller,W.
TITLE
The primary structure of the alpha subunit of human elongation……
JOURNAL
Eur. J. Biochem. 155 (1), 167-171 (1986)
MEDLINE
86136120
FEATURES
Location/Qualifiers
source
1..1506
/organism="Homo sapiens"
/db_xref="taxon:9606"
CDS
54..1442
/note="EF-1 alpha (aa 1-463)"
/codon_start=1
/protein_id="CAA27245.1"
/db_xref="GI:31098"
/db_xref="SWISS-PROT:P04720"
/translation="MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEK
EAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNM
……VTKSAQKAQKAK"
BASE COUNT
412 a
337 c
387 g
370 t
ORIGIN
1 acgggtttgc cgccagaaca caggtgtcgt gaaaactacc cctaaaagcc aaaatgggaa
61 aggaaaagac tcatatcaac attgtcgtca ttggacacgt agattcgggc aagtccacca……….
1501 aactgt
//
Look for “LOCUS,”
“FEATURES,”
“ORIGIN,” the
sequence itself,
and then “//.”
EMBL and SWISS-PROT format —
ID
AC
DT
DE
DE
GN
OS
OS
OS
OC
OC
OX
RN
RP
RC
RX
RA
RT
RL
CC
CC
CC
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
DR
KW
KW
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
SQ
//
EF11_HUMAN
STANDARD;
PRT;
462 AA.
P04720; P04719;
13-AUG-1987 (Rel. 05, Created)……
Elongation factor 1-alpha 1 (EF-1-alpha-1) (Elongation factor 1 A-1)
(eEF1A-1) (Elongation factor Tu) (EF-Tu).
EEF1A1 OR EEF1A OR EF1A.
Homo sapiens (Human),
Bos taurus (Bovine), and
Oryctolagus cuniculus (Rabbit).
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
NCBI_TaxID=9606, 9913, 9986;
[1]
SEQUENCE FROM N.A.
SPECIES=Human;
MEDLINE=86136120; PubMed=3512269;
Brands J.H.G.M., Maassen J.A., van Hemert F.J., Amons R., Moeller W.;
"The primary structure of the alpha subunit of human elongation …. -binding sites.";
Eur. J. Biochem. 155:167-171(1986).……
-!- FUNCTION: THIS PROTEIN PROMOTES THE GTP-DEPENDENT BINDING OF
AMINOACYL-TRNA TO THE A-SITE OF RIBOSOMES DURING PROTEIN
BIOSYNTHESIS.
-!- SUBCELLULAR LOCATION: Cytoplasmic.
-!- TISSUE SPECIFICITY: BRAIN, PLACENTA, LUNG, LIVER, KIDNEY,
PANCREAS BUT BARELY DETECTABLE IN HEART AND SKELETAL MUSCLE.
-!- SIMILARITY: BELONGS TO THE GTP-BINDING ELONGATION FACTOR FAMILY.
EF-TU/EF-1A SUBFAMILY……
EMBL; X03558; CAA27245.1; -……
PIR; S18054; EFRB1……
HSSP; Q01698; 1TUI……
InterPro; IPR004160; GTP_EFTU_D3.
Pfam; PF00009; GTP_EFTU; 1……
PROSITE; PS00301; EFACTOR_GTP; 1.
Elongation factor; Protein biosynthesis; GTP-binding; Methylation;
Multigene family.
NP_BIND
14
21
GTP (BY SIMILARITY).
NP_BIND
91
95
GTP (BY SIMILARITY).
NP_BIND
153
156
GTP (BY SIMILARITY).
MOD_RES
36
36
METHYLATION (TRI-).
MOD_RES
55
55
METHYLATION (DI-).
MOD_RES
79
79
METHYLATION (TRI-).
MOD_RES
165
165
METHYLATION (DI-).
MOD_RES
318
318
METHYLATION (TRI-).
BINDING
301
301
ETHANOLAMINE-PHOSPHOGLYCEROL.
BINDING
374
374
ETHANOLAMINE-PHOSPHOGLYCEROL.
CONFLICT
83
83
S -> A (IN REF. 2).
CONFLICT
232
232
L -> V (IN REF. 3).
SEQUENCE
462 AA; 50141 MW; D465615545AF686A CRC64;
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL
DKLKAERERG …… VTKSAQKAQK AK
Look for
“ID,” “FT,”
“SQ,” the
sequence,
and then “//.”
PIR CODATA and NBRF formats —
ENTRY
TITLE
FEATURE
1-223
EFHU1 #type complete
iProClass View of EFHU1
translation elongation factor eEF-1 alpha-1 chain - human
(Annotation abrideged here)
#domain eEF-1 alpha domain I, GTP-binding #status
predicted #label EF1\
#domain translation elongation factor Tu homology
8-156
#label ETU\
14-21
#region nucleotide-binding motif A (P-loop)\
153-156
#region GTP-binding NKXD motif\
245-330
#domain eEF-1 alpha domain II, tRNA-binding
#status predicted #label EF2\
332-462
#domain eEF-1 alpha domain III, tRNA-binding
#status predicted #label EF3\
36,55,79,165,318
#modified_site N6,N6,N6-trimethyllysine (Lys)
#status predicted\
301,374
#binding_site glycerylphosphorylethanolamine
(Glu) (covalent) #status predicted
SUMMARY
#length 462 #molecular_weight 50141
Look for
“ENTRY” and
“SEQUENCE”
with numbers for
CODATA;
SEQUENCE
5
10
15
20
25
30
1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K
31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L
61 D K L K A E R E R …... Q K A Q K A K
>P1;EFHU1
pir1:efhu1 => EFHU1
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG
KGSFKYAWVL DKLKAERERG ITIDISLWKF ETSKYYVTII DAPGHRDFIK
NMITGTSQAD CAVLIVAAGV GEFEAGISKN GQTREHALLA YTLGVKQLIV
GVNKMDSTEP PYSQKRYEEI VKEVSTYIKK IGYNPDTVAF VPISGWNGDN
MLEPSANMPW FKGWKVTRKD GNASGTTLLE ALDCILPPTR PTDKPLRLPL
QDVYKIGGIG TVPVGRVETG VLKPGMVVTF APVNVTTEVK SVEMHHEALS
EALPGDNVGF NVKNVSVKDV RRGNVAGDSK NDPPMEAAGF TAQVIILNHP
GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA
IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK
VTKSAQKAQK AK*
C;P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - human
C;N;Alternate names: translation elongation factor Tu
C;Species: Homo sapiens (man)
C;Date: 30-Jun-1988 #sequence_revision 05-Apr-1995 #text_change 19-Jan-2001
C;Accession: B24977; A25409; A29946; A32863; I37339
C;R;Rao, T.R.; Slobin, L.I. . . .
“>P1;” name,
then definition
line, then
sequence, then
annotation “C;”
for NBRF protein
format.
Pearson FastA
format —
>EFHU1 PIR1 release 71.01
MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG
KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK
NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV
GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN
MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL
QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS
EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP
GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA
IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK
VTKSAQKAQKAK
Look for
“>”name,
start of
definition
line.
Only one
annotation
line allowed!
GCG single sequence
format —
!!AA_SEQUENCE 1.0
P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - human
N;Alternate names: translation elongation factor Tu……
F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted <EF1>
F;8-156/Domain: translation elongation factor Tu homology <ETU>
F;14-21/Region: nucleotide-binding motif A (P-loop)
F;153-156/Region: GTP-binding NKXD motif
EFHU1 Length: 462 January 14, 2002 19:49 Type: P Check: 5308 ..
1
401
351
451
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE……
IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK
GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA
VTKSAQKAQK AK
Look for “!!” sequence type, then annotation, then sequence
identifier name on the checksum line, then the sequence itself.
!!AA_MULTIPLE_ALIGNMENT 1.0
small.pfs.msf
Name:
Name:
Name:
Name:
Name:
Name:
Name:
//
MSF: 735
a49171
e70827
g83052
f70556
t17237
s65758
a46241
Type: P
Len:
Len:
Len:
Len:
Len:
Len:
Len:
425
577
718
534
229
735
274
July 20, 2001 14:53
Check:
Check:
Check:
Check:
Check:
Check:
Check:
537
21
9535
3494
9552
111
3514
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Weight:
Check: 6619 ..
1.00
1.00
1.00
1.00
1.00
1.00
1.00
……………
The other GCG formats — but these hold
more than one sequence at a time.
!!RICH_SEQUENCE 1.0
..
{
name ef1a_giala
descrip
PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list
type
PROTEIN
longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}
sequence-ID Q08046
checksum
7342
offset
23
creation-date 07/11/2001 16:51:19
strand 1
comments …………….
This is SeqLab’s native format
Specialized ‘sequence’ -type DB’s —
Databases that contain special types of sequence
information, such as patterns, motifs, and profiles.
These include: REBASE, EPD, PROSITE, BLOCKS,
ProDom, Pfam . . . .
Databases that contain multiple sequence entries
aligned, e.g. RDP and ALN.
Databases that contain families of sequences ordered
functionally, structurally, or phylogenetically, e.g.
iProClass and HOVERGEN.
Databases of species specific sequences, e.g. the HIV
Database and the Giardia lamblia Genome Project.
And on and on . . . . See Amos Bairoch’s excellent links
page: http://us.expasy.org/alinks.html and the
wonderful Human Genome Ensemble Project at
http://www.ensembl.org/ that tries to tie it all together.
What about other types of biological databases?
Three dimensional structure databases:
the Protein Data Bank and Rutgers Nucleic Acid Database.
These databases contain all of the 3D atomic coordinate data
necessary to define the tertiary shape of a particular biological
molecule. The data is usually experimentally derived, either by
X-ray crystallography or with NMR, but sometimes it is a
hypothetical model. In all cases the source of the structure and
its resolution is clearly indicated.
Secondary structure boundaries, sequence data, and reference
information are often associated with the coordinate data, but it
is the 3D data that really matters, not the annotation.
Molecular visualization or modeling software is required to
interact with the data. It has little meaning on its own. See
Molecules to Go at http://molbio.info.nih.gov/cgi-bin/pdb/ .
Other types of Biological DB’s —
Still more; these can be considered ‘non-molecular’:
Genomic linkage mapping databases for most large genome projects
(w/ pointers to sequences) — H. sapiens, Mus, Drosophila, C.
elegans, Saccharomyces, Arabidopsis, E. coli, . . . .
Reference Databases (also w/ pointers to sequences): e.g.
OMIM — Online Mendelian Inheritance in Man
PubMed/MedLine — over 11 million citations from more than 4
thousand bio/medical scientific journals.
Phylogenetic Tree Databases: e.g. the Tree of Life.
Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s
GenomeNet KEGG (the Kyoto Encyclopedia of Genes and
Genomes).
Population studies data — which strains, where, etc.
And then databases that many biocomputing people don’t even usually
consider:
e.g. GIS/GPS/remote sensing data, medical records, census counts,
mortality and birth rates . . . .
So how do you access and manipulate all this data?
Often on the InterNet over the World Wide Web:
Site
URL (Uniform Resource Locator)
Content
Nat’l Center Biotech' Info'
http://www.ncbi.nlm.nih.gov/
databases/analysis/software
PIR/NBRF
http://www-nbrf.georgetown.edu/
protein sequence database
IUBIO Biology Archive
http://iubio.bio.indiana.edu/
database/software archive
Univ. of Montreal
http://megasun.bch.umontreal.ca/
database/software archive
Japan's GenomeNet
http://www.genome.ad.jp/
databases/analysis/software
European Mol' Bio' Lab'
http://www.embl-heidelberg.de/
databases/analysis/software
European Bioinformatics
http://www.ebi.ac.uk/
databases/analysis/software
The Sanger Institute
http://www.sanger.ac.uk/
databases/analysis/software
Univ. of Geneva BioWeb
http://www.expasy.ch/
databases/analysis/software
ProteinDataBank
http://www.rcsb.org/pdb/
3D mol' structure database
Molecules to Go
http://molbio.info.nih.gov/cgi-bin/pdb/
3D protein/nuc' visualization
The Genome DataBase
http://www.gdb.org/
The Human Genome Project
Stanford Genomics
http://genome-www.stanford.edu/
various genome projects
Inst. for Genomic Res’rch
http://www.tigr.org/
esp. microbial genome projects
HIV Sequence Database
http://hiv-web.lanl.gov/
HIV epidemeology seq' DB
The Tree of Life
http://tolweb.org/tree/phylogeny.html
overview of all phylogeny
Ribosomal Database Proj’
http://rdp.cme.msu.edu/index.jsp
databases/analysis/software
PUMA2 at Argonne
http://compbio.mcs.anl.gov/puma2/cgi-bin/
metabolic reconstruction
Harvard Bio' Laboratories
http://golgi.harvard.edu/
nice bioinformatics links list
With a World Wide Web browser and tools like NCBI’s Entrez & EMBL’s SRS
Advantage: Can access the very latest updates. It’s fun and
very fast. It can be very powerful and efficient, if you know
what you’re doing.
Disadvantage: Can be very inefficient, if you don’t know what
you’re doing. Also format hassles, and . . . very easy to get
lost and/or distracted in cyberspace!
Additionally problems sometimes arise with the Net, like bad
connections. So what are some of the alternatives . . . ?
Desktop software solutions — public domain programs are
available, but . . . complicated to install, configure, and maintain.
User must be pretty computer savvy. So,
commercial software packages are available, e.g. Sequencher,
MacVector, DNAsis, DNAStar, etc.,
but . . . license hassles, big expense per machine, and Internet
and/or CD database access all complicate matters!
Therefore, server-based solutions — we’re talking
UNIX server computers here.
Again public domain programs exist. But now a VERY
cooperative systems manager needs to install, configure, and
maintain the system. Therefore a commercial package, e.g.
the Wisconsin Package, is often used to simplify matters.
One commercial license fee for an entire institution and very fast,
convenient database access on local server disks.
Connections from any networked terminal or workstation
anywhere!
Within the GCG suite, LookUp is an SRS derivative used to find a
sequence of interest from local GCG server databases.
Advantage: Search output is a legitimate GCG list file, appropriate
input to other GCG programs; no need to reformat — all GCG.
Disadvantage: DB’s only as new as administrator maintains them.
The Genetics Computer Group —
the Wisconsin Package for Sequence Analysis.
Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept.
at the University of Wisconsin, Madison, then a private
company for over 10 years, then acquired by the Oxford
Molecular Group U.K., and now owned by Pharmacopeia
U.S.A. under the new name Accelrys, Inc.
The suite contains almost 150 programs designed to work in
a "toolbox" fashion. Several simple programs used in
succession can lead to sophisticated results.
Also 'internal compatibility,' i.e. once you learn to use one
program, all programs can be run similarly, and, the
output from many programs can be used as input for
other programs.
Used all over the world by more than 30,000 scientists at
over 530 institutions in 35 countries, so learning it here
will most likely be useful anywhere else you may end up.
To answer the always perplexing GCG question — “What
sequence(s)? . . . .”
Specifying sequences, GCG style;
in order of increasing power and complexity:
The sequence is in a local GCG format single sequence file in your UNIX
account. (GCG Reformat and all From & To programs)
The sequence is in a local GCG database in which case you ‘point’ to it by
using any of the GCG database logical names. A colon, “:,” always sets
the logical name apart from either an accession number or a proper
identifier name or a wildcard expression and they are case insensitive.
The sequence is in a GCG format multiple sequence file, either an MSF
(multiple sequence format) file or an RSF (rich sequence format) file. To
specify sequences contained in a GCG multiple sequence file, supply the
file name followed by a pair of braces, “{},” containing the sequence
specification, e.g. a wildcard — {*}.
Finally, the most powerful method of specifying sequences is in a GCG “list”
file. It is merely a list of other sequence specifications and can even
contain other list files within it. The convention to use a GCG list file in a
program is to precede it with an at sign, “@.” Furthermore, one can
supply attribute information within list files to specify something special
about the sequence.
Logical terms for the Wisconsin Package —
Sequence databases, nucleic acids:
Sequence databases, amino acids:
GENBANKPLUS
all of GenBank plus EST and GSS subdivisions
GENPEPT
GenBank CDS translations
GBP
all of GenBank plus EST and GSS subdivisions
GP
GenBank CDS translations
GENBANK
all of GenBank except EST and GSS subdivisions
SWISSPROTPLUS
all of Swiss-Prot and all of SPTrEMBL
GB
all of GenBank except EST and GSS subdivisions
SWP
all of Swiss-Prot and all of SPTrEMBL
BA
GenBank bacterial subdivision
SWISSPROT
all of Swiss-Prot (fully annotated)
BACTERIAL
GenBank bacterial subdivision
SW
all of Swiss-Prot (fully annotated)
EST
GenBank EST (Expressed Sequence Tags) subdivision
SPTREMBL
Swiss-Prot preliminary EMBL translations
GSS
GenBank GSS (Genome Survey Sequences) subdivision
SPT
Swiss-Prot preliminary EMBL translations
HTC
GenBank High Throughput cDNA
P
all of PIR Protein
HTG
GenBank High Throughput Genomic
PIR
all of PIR Protein
IN
GenBank invertebrate subdivision
PROTEIN
PIR fully annotated subdivision
INVERTEBRATE
GenBank invertebrate subdivision
PIR1
PIR fully annotated subdivision
OM
GenBank other mammalian subdivision
PIR2
PIR preliminary subdivision
OTHERMAMM
GenBank other mammalian subdivision
PIR3
PIR unverified subdivision
OV
GenBank other vertebrate subdivision
PIR4
PIR unencoded subdivision
OTHERVERT
GenBank other vertebrate subdivision
NRL_3D
PDB 3D protein sequences
PAT
GenBank patent subdivision
NRL
PDB 3D protein sequences
PATENT
GenBank patent subdivision
PH
GenBank phage subdivision
PHAGE
GenBank phage subdivision
PL
GenBank plant subdivision
PLANT
GenBank plant subdivision
GENMOREDATA
path to GCG optional data files
PR
GenBank primate subdivision
GENRUNDATA
path to GCG default data files
PRIMATE
GenBank primate subdivision
RO
GenBank rodent subdivision
RODENT
GenBank rodent subdivision
STS
GenBank (sequence tagged sites) subdivision
SY
GenBank synthetic subdivision
SYNTHETIC
GenBank synthetic subdivision
TAGS
GenBank EST and GSS subdivisions
UN
GenBank unannotated subdivision
UNANNOTATED
GenBank unannotated subdivision
VI
GenBank viral subdivision
VIRAL
GenBank viral subdivision
General data files:
These are easy —
they make sense and
you’ll have a vested
interest.
The List File Format —
An example GCG list file of many elongation
1a and Tu factors follows. As with all GCG
data files, two periods separate
documentation from data.
..
my-special.pep
begin:24
end:134
SwissProt:EfTu_Ecoli
Ef1a-Tu.msf{*}
/usr/accounts/test/another.rsf{ef1a_*}
@another.list
The ‘way’ SeqLab works!
Conclusions —
There’s a bewildering assortment of different
databases and ways to access and manipulate the
information within them. The key is to learn how to
use that information in the most efficient manner. A
comprehensive sequence analysis software suite,
such as the Wisconsin Package, expedites the
chore, putting a large assortment of tools all under
one organizational model with one user interface.
FOR EVEN MORE INFO...
Contact me ([email protected]) for specific
bioinformatics assistance and/or collaboration.