BioInformatics at FSU - whose job is it and why it needs
Download
Report
Transcript BioInformatics at FSU - whose job is it and why it needs
An Introduction to Bioinformatics.
CSE, Marmara University
mimoza.marmara.edu.tr/~m.sakalli/cse546
Oct/12/09
Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt
Terminology
Bioinformatics: using computational techniques to access, analyze, and
interpret the biological information. Tool Building. Biocomputing and
computational biology are the synonyms.
Sequence analysis is the study of molecular sequence data.
Genomics analyzes the context of genes or complete genomes.
Proteomics is the subdivision of genomics concerned with analyzing the
protein complement, i.e. the proteome.
The Human Genome Project and numerous the data coming at alarming rates.
Homo sapiens the 3.2 billion base pairs: Estimates of the number of genes were
around 100,000 range; but turns out to be twice as many as a fruit fly, between 25’
and 35,000!
The protein coding region of the genome is only about 1% or so, a bunch of the
remainder is ‘jumping’ ‘selfish DNA’ of which much may be involved in regulation
and control.
Three major databases with their own specific format. Mirrored among each other
and sharing accession codes, but NOT identifier names:
1) National Center for Biotechnology Information (NCBI),/the National Library of
Medicine (NLM), at the NIH, (Gene bank and GenPept).
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html
Georgetown University’s National Biomedical Research Foundation Protein
Identification Resource and Naval Research Lab sequences of threedimensional structure.
http://www-nbrf.georgetown.edu/
http://www-nbrf.georgetown.edu/pirwww/dbinfo/nrl3d.html
2)
European Molecular Biology Laboratory
http://www.ebi.ac.uk/embl/index.html, http://www.embl-heidelberg.de/
European Bioinformatics Institute,
http://www.ebi.ac.uk/
Swiss Institute of Bioinformatics’ (SIB), Expert Protein Analysis System
http://www.expasy.ch/, http://www.expasy.org/links.html
Nucleotide Sequence Database, amino acid sequence databases
http://expasy.cbr.nrc.ca/sprot/
3)
http://www.ddbj.nig.ac.jp/
The National Institute of Genetics, DNA Data Bank of Japan.
Atlas of Protein Sequence and Structure: The first well recognized protein
sequence database, mid sixties, by Dr. Margaret Dayhoff.
DDBJ began in 1984, GenBank in 1982, and EMBL in 1980. They are all attempts at
establishing an organized, reliable, comprehensive and openly available library of
genetic sequences.
Each program needs to recognize particular aspects of the sequence files; flexibility of
the program is a headache. NCBI’s ASN.1 format and its Entrez interface attempt
to reduce these prbls.
Unfortunately, not like ieee working groups for internet taskforce, RFCies for example,
format issues are the most confusing and troubling aspect of working with primary
sequence data.
Sequence database installations are commonly a complex ASCII/Binary mix, but
neither relational nor OOP (often proprietary).
Contain several very long text files each containing different types of information all
related to particular sequences.
Software is usually required to interact with these databases. ReadSeq of Don Gilbert
(a reformatting program, for DNA and protein sequences, accepting single or
multiple inputs in 18 different formats, converting to a specified format. )
http://www.molecularevolution.org/
AWTY (Are We There Yet?) is a system for graphically exploring convergence of Markov Chain
Monte Carlo (MCMC) chains in Bayesian phylogenetic inference (Nylander et al. 2008).
FigTree to graphically view phylogenetic trees.
Clustal W (Thompson et al. 1994) is for global multiple sequence alignment. Using a progressive
alignment algorithm with affine gap penalties and a guide tree based on sequence similarity to
align DNA or amino acid sequences. The affine gap cost model penalizes insertions and
deletions using a linear function in which one term is length independent, and the other is
length dependent. Gap penalty = Gapopen + Len * Gapextend. Recent reviews comparing
multiple alignment algorithms (e.g., Hickson et al. 2000, Thompson et al. 1999, and McClure
et al. 1994). Morrison and Ellis (1997) discuss the effects of nucleotide sequence alignment
on the estimation of phylogenetic hypotheses. The current version is Clustal W2 (Larkin et al.
2007). The program is also available with a graphical user interface, Clustal X.
BEAST, (Beauti), -Bayesian Evolutionary Analysis Sampling Trees- is for evolutionary inference of
molecular sequences, Andrew Rambaut and Alexei Drummond (Drummond et al. 2002; 2005;
2006).
FASTA compares pairs of protein or DNA sequences as well as comparing a single protein or DNA
sequence to a database or library. Fast and local or remote services.
GARLI (Genetic Algorithm for Rapid Likelihood Inference) performs phylogenetic searches on
aligned nucleotide datasets using the maximum likelihood criterion.
MAFFT implements FFT to optimize protein alignments based on physical properties of the amino
acids (Katoh et al., 2002; 2005). The program uses progressive alignment followed by
refinement, also known as iterative alignment.
All sequence databases contain (in their own format):
Name (Genetic identifiers): LOCUS, ENTRY, ID
Definition: A brief, one-line, textual sequence description.
Accession Number: A constant data identifier.
Source and classification (taxonomy) information.
Complete literature references.
Comments and keywords.
The all important FEATURE table!
A summary or checksum line.
The sequence itself.
LOCUS
DEFINITION
ACCESSION
VERSION
KEYWORDS
SOURCE
ORGANISM
HSEF1AR
1506 bp
mRNA
linear
PRI 12-SEP-1993
Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha).
X03558
X03558.1 GI:31097
elongation factor; elongation factor 1.
human.
Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE
1 (bases 1 to 1506)
AUTHORS
Brands,J.H., Maassen,J.A., van Hemert,F.J., Amons,R. and Moller,W.
TITLE
The primary structure of the alpha subunit of human elongation……
JOURNAL
Eur. J. Biochem. 155 (1), 167-171 (1986)
MEDLINE
86136120
FEATURES
Location/Qualifiers
source
1..1506
/organism="Homo sapiens"
/db_xref="taxon:9606"
CDS
54..1442
/note="EF-1 alpha (aa 1-463)"
/codon_start=1
/protein_id="CAA27245.1"
/db_xref="GI:31098"
/db_xref="SWISS-PROT:P04720"
/translation="MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEK
EAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNM
……VTKSAQKAQKAK"
BASE COUNT
412 a
337 c
387 g
370 t
ORIGIN
1 acgggtttgc cgccagaaca caggtgtcgt gaaaactacc cctaaaagcc aaaatgggaa
61 aggaaaagac tcatatcaac attgtcgtca ttggacacgt agattcgggc aagtccacca……….
1501 aactgt
//
GenBank and GenPept format
EMBL and
SWISSPROT
ID
AC
DT
DE
DE
GN
OS
OS
OS
OC
OC
OX
RN
RP
RC
RX
RA
RT
RL
CC
CC
CC
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
DR
KW
KW
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
SQ
//
EF11_HUMAN
STANDARD;
PRT;
462 AA.
P04720; P04719;
13-AUG-1987 (Rel. 05, Created)……
Elongation factor 1-alpha 1 (EF-1-alpha-1) (Elongation factor 1 A-1)
(eEF1A-1) (Elongation factor Tu) (EF-Tu).
EEF1A1 OR EEF1A OR EF1A.
Homo sapiens (Human),
Bos taurus (Bovine), and
Oryctolagus cuniculus (Rabbit).
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
NCBI_TaxID=9606, 9913, 9986;
[1]
SEQUENCE FROM N.A.
SPECIES=Human;
MEDLINE=86136120; PubMed=3512269;
Brands J.H.G.M., Maassen J.A., van Hemert F.J., Amons R., Moeller W.;
"The primary structure of the alpha subunit of human elongation …. -binding sites.";
Eur. J. Biochem. 155:167-171(1986).……
-!- FUNCTION: THIS PROTEIN PROMOTES THE GTP-DEPENDENT BINDING OF
AMINOACYL-TRNA TO THE A-SITE OF RIBOSOMES DURING PROTEIN
BIOSYNTHESIS.
-!- SUBCELLULAR LOCATION: Cytoplasmic.
-!- TISSUE SPECIFICITY: BRAIN, PLACENTA, LUNG, LIVER, KIDNEY,
PANCREAS BUT BARELY DETECTABLE IN HEART AND SKELETAL MUSCLE.
-!- SIMILARITY: BELONGS TO THE GTP-BINDING ELONGATION FACTOR FAMILY.
EF-TU/EF-1A SUBFAMILY……
EMBL; X03558; CAA27245.1; -……
PIR; S18054; EFRB1……
HSSP; Q01698; 1TUI……
InterPro; IPR004160; GTP_EFTU_D3.
Pfam; PF00009; GTP_EFTU; 1……
PROSITE; PS00301; EFACTOR_GTP; 1.
Elongation factor; Protein biosynthesis; GTP-binding; Methylation;
Multigene family.
NP_BIND
14
21
GTP (BY SIMILARITY).
NP_BIND
91
95
GTP (BY SIMILARITY).
NP_BIND
153
156
GTP (BY SIMILARITY).
MOD_RES
36
36
METHYLATION (TRI-).
MOD_RES
55
55
METHYLATION (DI-).
MOD_RES
79
79
METHYLATION (TRI-).
MOD_RES
165
165
METHYLATION (DI-).
MOD_RES
318
318
METHYLATION (TRI-).
BINDING
301
301
ETHANOLAMINE-PHOSPHOGLYCEROL.
BINDING
374
374
ETHANOLAMINE-PHOSPHOGLYCEROL.
CONFLICT
83
83
S -> A (IN REF. 2).
CONFLICT
232
232
L -> V (IN REF. 3).
SEQUENCE
462 AA; 50141 MW; D465615545AF686A CRC64;
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL
DKLKAERERG …… VTKSAQKAQK AK
PIR/NBR
F format
ENTRY
EFHU1 #type complete
iProClass View of EFHU1
TITLE
translation elongation factor eEF-1 alpha-1 chain - human
ALTERNATE_NAMES translation elongation factor Tu
ORGANISM
#formal_name Homo sapiens #common_name man
#cross-references taxon:9606
DATE
30-Jun-1988 #sequence_revision 05-Apr-1995 #text_change…..
ACCESSIONS
B24977; A25409; A29946; A32863; I37339
REFERENCE
A93610
#authors
Rao, T.R.; Slobin, L.I.
#journal
Nucleic Acids Res. (1986) 14:2409
#title
Structure of the amino-terminal end of mammalian elongation…
#accession
B24977
##molecule_type mRNA
##residues 1-82,'A',84-94 ##label RAO
##cross-references EMBL:X03689; NID:g31109; PIDN:CAA27325.1;
PID:g31110…….
GENETICS
#gene
GDB:EEF1A1; EEF1A; EF1A
##cross-references GDB:118791; OMIM:130590
#map_position 6q14-6q14
#introns
48/3; 108/3; 207/3; 258/1; 343/3; 422/1
CLASSIFICATION SF003007
#superfamily translation elongation factor Tu; translation elongation
factor Tu homology
KEYWORDS
GTP binding; methylated amino acid; nucleotide binding;
P-loop; phosphoprotein; protein biosynthesis; RNA binding
FEATURE
1-223
#domain eEF-1 alpha domain I, GTP-binding #status
predicted #label EF1\
8-156
#domain translation elongation factor Tu homology
#label ETU\
14-21
#region nucleotide-binding motif A (P-loop)\
153-156
#region GTP-binding NKXD motif\
245-330
#domain eEF-1 alpha domain II, tRNA-binding
#status predicted #label EF2\
332-462
#domain eEF-1 alpha domain III, tRNA-binding
#status predicted #label EF3\
36,55,79,165,318
#modified_site N6,N6,N6-trimethyllysine (Lys)
#status predicted\
301,374
#binding_site glycerylphosphorylethanolamine
(Glu) (covalent) #status predicted
SUMMARY
#length 462 #molecular_weight 50141
SEQUENCE
5
10
15
20
25
30
1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K
31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L
61 D K L K A E R E R …... Q K A Q K A K
Examples of DBs with specialized type of sequences
Almost all the links Human Genome Ensemble Project at http://www.ensembl.org/
Patterns, motifs, and profiles: REBASE, EPD, PROSITE,
Aligned multiple sequence entries. RDP and ALN.
Functionally, structurally, or phylogenetically ordered iProClass and HOVERGEN vertebrate gene db.
HIV Database, and the Giardia lamblia Genome Project.
3D Structure, atomic coordinate data is necessary to define the tertiary shape of a particular biological
molecule. Protein DB and Rutgers Nucleic Acid Db.
MolBio Molecular visualization with special software.
Genomic linkage mapping databases for H. sapiens, Mus, Drosophila, C. elegans, Saccharomyces,
Arabidopsis, E. coli.
OMIM — Online Mendelian Inheritance in Man
Phylogenetic Tree Databases: e.g. the Tree of Life.
Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto
Encyclopedia of Genes and Genomes).
Check the links given below..
There’s a bewildering assortment of different databases and ways to access and manipulate the
information within them. The key is to learn how to use that information in the most efficient manner.
For example: Given a novel genome sequence, find all genes and p-genes.
I want to design "sequence capture" probes for the exons of 40 genes that cause RP.
Obtain the exonic sequence, with at least 100 nt's flanking, and 1000 nts of the promoter from
transcription start
I propose a new way to find disease-causing mutations in humans. I want to only look in genes that have
regions that are 1) highly conserved across species, 2) have known functional protein domains (ex.
transmembrane domains), and 3) have mRNA secondary structure. Is this a good idea?
1859 of Charles Darwin’s The Origin of Species
Basic Mendelian Genetics
Mendel’s laws
independent assortment
independent segregation
mitosis and meiosis
dominant/recessive and pedigrees (the graphs of phenotype)
alleles
Basic molecular genetics
DNA
RNA
proteins
Central Dogma
genes and gene structure
cells and chromosomes
Principles of Genetics, Tamarin
Pearson
FastA
format —
GCG
single
sequenc
e format
—
>EFHU1 PIR1 release 71.01
MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG
KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK
NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV
GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN
MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL
QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS
EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP
GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA
IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK
VTKSAQKAQKAK
!!AA_SEQUENCE 1.0
P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - human
N;Alternate names: translation elongation factor Tu……
F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted <EF1>
F;8-156/Domain: translation elongation factor Tu homology <ETU>
F;14-21/Region: nucleotide-binding motif A (P-loop)
F;153-156/Region: GTP-binding NKXD motif
F;245-330/Domain: eEF-1 alpha domain II, tRNA-binding #status predicted <EF2>
F;332-462/Domain: eEF-1 alpha domain III, tRNA-binding #status predicted
<EF3>
F;36,55,79,165,318/Modified site: N6,N6,N6-trimethyllysine (Lys) #status
predicted
F;301,374/Binding site: glycerylphosphorylethanolamine (Glu) (covalent)
#status predicted
EFHU1 Length: 462 January 14, 2002 19:49 Type: P Check: 5308 ..
1
401
351
451
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE……
IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK
GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA
VTKSAQKAQK AK