BioInformatics (1)

Download Report

Transcript BioInformatics (1)

BioInformatics (1)
What is Life All About :
Self-compiling & self-assembling
Complementary surfaces
Watson-Crick base pair
(Nature April 25, 1953)
Life Science vs Computing
Where do parasites come from?
(computer & biological viral codes)
Over $12 billion/year
on computer viruses
LoveBug
20 M dead (worse than
black plague & 1918 Flu)
AIDS - HIV-1
Polymerase drug resistance mutations
Set dirtemp =3D fso.GetSpecialFolder(2)
M41L, D67N, T69D, L210W, T215Y, H208Y
Set c =3D fso.GetFile(WScript.ScriptFullName)
PISPIETVPV KLKPGMDGPK VKQWPLTEEK
c.Copy(dirsystem&"\MSKernel32.vbs")
IKALIEICAE LEKDGKISKI GPVNPYDTPV
c.Copy(dirwin&"\Win32DLL.vbs")
c.Copy(dirsystem&"\LOVE-LETTER-FOR-YOU.TXT.vbs")
regruns()
FAIKKKNSDK WRKLVDFREL NKRTQDFCEV
html()
spreadtoemail()
listadriv()
Exciting Life ??
Concept
Computers
Organisms
Instructions
Bits
Stable memory
Active memory
Processing
Editing
Environment
I/O
Monomer
Polymer
Replication
Sensor/In
Program
0,1
ROM,Disk,tape
RAM
CPU/Compiler
Editor
Sockets,people
AD/DA
Minerals
chip
Cut/Paste
scanner
Genome
a,c,g,t
DNA
RNA
enzyme/Ribosome
tRNA
Water,salts,heat
proteins
Nucleotide
DNA,RNA,protein
DNA replication
Chem/photo receptor
Elements
of RNA-based life: C,H,N,O,P
Useful for many species:
Na, K, Fe, Cl, Ca, Mg, Mo, Mn, S, Se, Cu, Ni, Co, Si
The Four Nucleosides of DNA
A nucleoside is a sugar, here deoxyribose, plus a base
dA = deoxyadenosine, etc.
dA
dG
PURINES
dC
dT
PYRIMIDINES
BASES
Adenine
Thymine
Guanine
Cytosine
Uracil
Base Pairing
The monomeric units
of nucleic acids are
called nucleotides.
A nucleotide is a phospate,
a sugar, and a purine or a
pyramidine base.
Chromosomes
Genome and gene
Entity
Genome
Definition
Unit of information transmission
Molecular Mechanisms
DNA replication
Gene
Unit of information expression
Transcription to RNA
(a special sequence of nucleotide
Translation to protein
bases, whose sequences carry the
information required for constructing
protein)
Nucleic acid and proteins
Backbone
Macromolecule
Nucleic
acid
Protein
( structure
components of
cells/tissues/en
zymes)
Repeating unit
Length
DNA
Phosphodiester bonds Deoxyribonucleotides 103-108
(A, C, G, T)
RNA
Phosphodiester bonds Ribonucleotides
(A, C, G, U)
Peptide bonds
Amino acids
(A, C, D, E, F, G, H,
I, K, L, M, N, P, Q,
R, S, T, V, W, Y)
Role
Genome
103-105
103-104
102-103
Genome
Messenger
Gene product
102-103
Gene product
Nucleotide codes
A
Adenine
W
Weak (A or T)
G
Guanine
S
Strong (G or C)
C
Cytosine
M
Amino (A or C)
T
Thymine
K
Keto (G or T)
U
Uracil
B
Not A (G or C or T)
R
Purine ( A or G)
H
Not G (A or C or T)
Y
Pyrimidin e (C or T)
D
Not C (A or G or T)
N
Any nucleotide
V
Not T (A or G or C)
Amino acid codes
Ala
Arg
Asn
Asp
Cys
Gln
Glu
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
Asx
Glx
Sec
Unk
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
B
Z
U
X
Alanin e
Arginin e
Asparagin e
Aspartic acid
Cysteine
Glut amin e
Glut amic acid
Glycine
Histidin e
Isoleucine
Leucine
Lysine
Methionine
Phenylalanin e
Prolin e
Serin e
Threonine
Tryptophan
Tyrosine
Valin e
Asn or Asp
Gln or Glu
Selenocysteine
Unknown
Standard
Genetic
Code
Schematic illustration of a plant cell
(Home for DNA)
History of structure determination for nucleic acids and proteins
1950
Technology development
Structure determi nation
49 Edman degr adation
-heli x model
54 Isomorphou s replaceme nt
1960
53 DNA double heli x model
Insu li n p rim ary struc ture
60 Myog lobin tertiary structure
62 Restriction enzy me
65 tRNAAla prim ary struc ture
1970
72 DNA clon ing
73 tRNAPhe tertiary structure
75 DNA sequenc ing
77 X174 complete genom e
79 Z-DNA by s ingle crystal differentiation
1980
84 Puls e fi eld gel electrophoresis
85 Polymerase chain reaction
87 YAC vec tor
86 Protein structure by 2D NMR
88 Human Geno me Project
1990
93 DNA chip
95 H influenzae complete geno me
2000
Human chromosomes: idiograms
X-linked recessive disorder. The inheritance pattern is shown for
a recessive gene on the chromosome X, designated in bold.
Male
XY
(normal)
Male
XY
(normal)
Male
XY
(affected)
Female
XX
(normal)
Female
XX
(normal)
Female
XX
(normal)
Reductionistic and synthetic approaches in biology
Biological System
(Organism)
Reductionistic
Synthetic
Approach
Approach
(Experiments)
(Bioinformatics)
Building Blocks
(Genes/Molecules)
Basic principles in physics, chemistry and biology.
Principles Known?
Physics
Chemistry
Biology
Matter
Compound
Organism
Elementary
Particles
Elements
Genes
Yes
Yes
No
100 000
10 000
1000
Amount (x1000)
100
10
1
0.1
MEDLINE records
MEDLINE G5 MeSH
Transistors / chip
DNA sequences
Mapped human genes
3-D structures
0.01
0.001
1965
1970
1975
1980
1985
Year
1990
1995
2000
The addresses for the major databases
Database
Organization
Address
MEDLINE
National Library of Medicine
www.nlm.nih.gov
GenBank
National Center for Biotechno logy Info rmation
www.ncbi.nlm. nih.gov
EMBL
European Bioinformatics Institute
www.ebi.ac.uk
DDBJ
National Institute of Genetics, Japan
www.ddbj.nig.ac.jp
SWISS-PROT
Swiss Institute of Bioinformatics
www.expasy.ch
PIR
National Biomedical Research Founda tion
www-nbrf.georgetown.edu
PRF
Protein Research Found ation, Japan
www.prf.or.jp
PDB
Research Collaboratory for Structural Bioinfo rmatics
www.rcsb.org
CSD
Cambridge Crystallographic Data Centre
www.ccdc.cam.ac.uk
New generation of molecular biology databases
Info rmation
Database
Address
Compounds and reactions
LIGAND
Aaindex
PROSITE
Blocks
PRINTS
Pfam
Pro Dom
SCOP
CATH
COG
KEGG
KEGG
WIT
EcoCyc
UM-BBD
NCBI Taxono my
OMIM
www.geno me.ad.jp/dbget/li gand .html
www.geno me.ad.jp/dbget/aaindex.html
www.expasy.ch/sprot/prosite.html
www.blocks.fhcrc.org/
www.biochem.ucl.ac.uk.bsm.dbbrowser/PRINTS/
www.sanger.ac.uk/Pfam/,pfam.wus tl.edu/
protein.toulouse.inra.fr/prodom.html
scop.mrc-lmb.cam.ac.uk/scop/
www.biochem.ucl.ac.uk/bsm/cath/
www.ncbi.nlm. nih.gov /COG/
www.geno me.ad.jp/kegg/
www.geno me.ad.jp/kegg/
www.mcs.anl.gov/WIT2/
ecocyc.Pange aSystems.com/ecocyc/
www.labmed.umn.edu/umbbd/
www.ncbi.nlm. nih.gov /Taxono my/
www.ncbi.nlm. nih.gov /Omim/
Protein families and
sequence motifs
3D fold classifications
Orthologous genes
Biochemical pathways
Geno me diversity
Example of sequence database entry for Genbank
LOCUS
DRODPPC
4001 bp
INV
15-MAR-1990
DEFINITION
D.melanogaster decapentaplegic gene complex (DPP-C), complete cds.
ACCESSION
M30116
KEYWORDS
.
SOURCE D.melanogaster, cDNA to mRNA.
ORGANISM
Drosophila melanogaster
Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda;
Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha;
Ephydroidea; Drosophilidae; Drosophilia.
REFER ENCE
1 (bases 1 to 4001)
AUTHORS
Padgett, R.W., St Johnston, R.D. and Gelbart, W.M.
TITLE
A transcript from a Drosophila pattern gene predicts a protein
homologous to the transforming growth factor-beta family
JOURNAL
Nature 325, 81-84 (1987)
MEDLINE
87090408
COMMENT The initiation codon could be at either 1188-1190 or 1587-1589
FEATURES
Location/Qualifiers
source
1..4001
/organism=“Drosophila melanogaster”
/db_xref=“taxon:7227”
mRNA
<1..3918
/gene=“dpp”
/note=“decapentaplegic protein mRNA”
/db_xref=“FlyBase:FBgn0000490”
gene
1..4001
/note=“decapentaplegic”
/gene=“dpp”
/allele=“”
/db_xref=“FlyBase:FBgn0000490”
CDS
1188..2954
/gene=“dpp”
/note=“decapentaplegic protein (1188 could be 1587)”
/codon_start=1
/db_xref=“FlyBase:FBgn0000490”
/db_xref=“PID:g157292”
/translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA
SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR
……………………
LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL
NDQSTBVVLKNYQEMTBBGCGCR”
BASE COUNT
1170 a
1078 c
956 g
797 t
ORIGIN
1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc
61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg
121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc
181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc
241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg
301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca
361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca
421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac
481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca
541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat
601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa
………………………….
3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc
3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta
3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g
//
Example of sequence database entry for SWISS-PROT
ID
AC
DT
DT
DT
DE
GN
OS
OC
RN
RP
RM
RA
RL
RN
RP
RM
RA
RL
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
KW
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
SQ
DECA_DROME
STANDARD;
PRT;
588AA.
P07713;
01-APR-1988 (REL. 07, CREATED)
01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE)
01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE)
DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN).
DPP.
DROSOPHILA MELANOGASTER (FRUIT FLY).
EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA.
[1]
SEQUENCE FROM N.A.
87090408
PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.;
NATURE 325:81-84 (1987)
[2]
CHARACTERIZATION, AND SEQUENCE OF 457-476.
90258853
PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.;
MOL. CELL. BIOL. 10:2669-2677(1990).
-!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE
EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL
VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS.
-!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED.
-!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY.
EMBL; M30116; DMDPPC.
PIR; A26158; A26158.
HSSP; P08112; 1TFG.
FLYBASE; FBGN0000490; DPP.
PROSITE; PS00250; TGF_BETA.
GROWTH FACTOR; DIFFERENTIATION; SIGNAL.
SIGNAL
1
?
POTENTIAL.
PROPEP
?
456
CHAIN
457
588
DECAPENTAPLEGIC PROTEIN.
DISULFID
487
553
BY SIMILARITY.
DISULFID
516
585
BY SIMILARITY.
DISULFID
520
587
BY SIMILARITY.
DISULFID
552
552
INTERCHAIN (BY SIMILARITY).
CARBOHYD
120
120
POTENTIAL.
CARBOHYD
342
342
POTENTIAL.
CARBOHYD
377
377
POTENTIAL.
CARBOHYD
529
529
POTENTIAL.
SEQUENCE
588 AA;
65850MW;
1768420 CN;
MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG
ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN
KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV
LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP
KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH
HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG
QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH
VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR
KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV
NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR
Functional classification of E. coli genes according to Monica Riley
I.
II.
III.
IV.
V.
VI.
Intermedia ry metabolism
A.
Degradation
B.
Central intermediary metabolism
C.
Respiration (aerobic and ana erobic)
D.
Fermentation
E.
ATP-proton motive force interconver sions
F.
Broad regul atory fun ctions
Biosynthesis of small molecules
A.
Amino acids
B.
Nucleotides
C.
Suga rs and suga r molecules
D.
Cofactors, prosthetic groups, electron carriers
E.
Fatty a cids and lipids
F.
Polyamines
Macromolecule metabolism
A.
Synthesis and modification
B.
Degradation of macromolecules
Cell structure
A.
Membrane componen ts
B.
Murein sacculus
C.
Surface polysaccha rides and antigens
D.
Surface struc tures
Cellular processes
A.
Transport/binding proteins
B.
Cell division
C.
Chemotaxis and mobilit y
D.
Protein secretion
E.
Osmotic adaptions
Other func tions
A.
Cryptic genes
B.
Phage -related func tions and prophag es
C.
Colicin-related func tions
D.
Plasmid-related func tions
E.
Drug/analog sensitivity
F.
Radation sensiti vity
G.
DNA sites
H.
Adaptations to atypical cond iti ons
The Protein Folding Problem
Protein Folding Problem
(Sequence 3D Structure)
1 Protein folding is thermodynamically determined
(Anfinsen’s thermodynamic principle)
Protein + Environment
2. Protein folding is a reaction imvolving other
interacting molecules
(Principle of molecular interactions)
Protein + Chaperonins +….
Central Paradigm
Bioinformatics : A Long Journey
(How far are we away from knowing the God ??)
Sequence to exon 80% [Laub 98]
Exons to gene (without cDNA or homolog) ~30% [Laub 98]
Gene to regulation ~10% [Hughes 00]
Regulated gene to protein sequence 98% [Gesteland ]
Sequence to secondary-structure (,b,c) 77% [CASP]
Secondary-structure to 3D structure 25% [CASP]
3D structure to ligand specificity ~10% [Johnson 99]
Expected accuracy overall ~ = 0.8*.3*.1*.98*.77*.25*.1 = .0005 ?
Our Focus in Bioinformatics
Perturbation
Dynamic Response
Environment
Medication
Genetic Engineering
Gene Expression
Protein Expression
Virtual Cell
Analysis
BioChip
DataBase
Genotype/Phenotype
Biology
Molecular Biology
Bio Chemistry
Genetics
Symbolic
Algorithms/
Computing
Genome Sequencing