Sequence - BIOTEC - Biotechnology Center TU Dresden
Download
Report
Transcript Sequence - BIOTEC - Biotechnology Center TU Dresden
Introduction
based on
Chapter 1
Lesk, Introduction to Bioinformatics
Michael Schroeder
BioTechnological Center
TU Dresden
Biotec
Contents
Molecular biology primer
The role of computer science
Phylogeny
Sequence Searching
Protein structure
Clinical implications
Read chapter 1
By Michael Schroeder, Biotec,
2
23 June 2000: Draft of Human
genome sequenced!
1953: Watson and Crick discover the structure of DNA
2000: Draft of human genome is published
“The most wondrous map ever produced by human kind”
“One of the most significant scientific landmarks of all
time, comparable with the invention of the wheel or the
splitting of the atom”
By Michael Schroeder, Biotec,
3
High-throughput biomedicine
Microarrays
Measure activity of thousands of genes at the same time
Example:
Cancer
Compare activity with and without drug treatment
Result: Hundreds of candidate drug targets
RNAi (Noble prize 2004, Fire and Mello)
Knock-down genes and observe effect
Example:
Infectious diseases
Which proteins orchestrate entry into cell?
Result: Hundreds of candidate proteins
Atomic force microscopes (Noble prize Binnig)
Pull protein out of membrane and measure force
Example:
Eye diseases resulting fomr misfolding
Result: Hundreds of candidate residues
By Michael Schroeder, Biotec,
4
Drug Discovery
80
New Drugs
70
R&D spendings
20
15
60
50
10
40
30
5
20
10
R&D spendings ($ Billion)
New drugs per year
Challenge: Longer time to market, fewer drugs,
exploding costs
Approach: Use of compound libraries and highthroughput screening
0
0
60
65
By Michael Schroeder, Biotec,
70
75
80
Year
85
90
95
5
HTS and Bioinformatics
High-throughput technologies have completely
changed the work of biomedical researchers
Challenge: Interpret (often large) results of screens
Approach: Before running secondary assays use
bioinformatics and IT to assemble all possible
information
By Michael Schroeder, Biotec,
6
Good News
Number of PubMed Abstracts
14,000,000
>1.000.000
Sequences
12,000,000
>16.000.000
Articles
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
0
1960
1970
1980
1990
2000
Year
Molecular Biology Database List at Nucleic Acids Research
>30.000
3D Structures
number of data sources
800
700
600
500
400
>700
DBs/Tools
300
200
100
0
2000
By Michael Schroeder, Biotec,
2001
2002
2003
year
2004
2005
7
2010
Bad News: Data != Knowledge
How to analyse data, how to integrate data?
Comptuer science to the rescue…
By Michael Schroeder, Biotec,
8
Examlpe: computer science
is key for sequencing
Human genome is a string of length 3.200.000.000
Shotgun sequencing: Break multiple copies of string
into shorter substrings
Example:
shotgunsequencing shotgunsequencing
shotgunsequencing
cing en encing equ gun ing ns otgu seq
sequ sh sho shot tg uenc un
Computing problem: Assemble strings
By Michael Schroeder, Biotec,
9
Computer science key
for sequencing
sh
sho
shot
otgu
tg
gun
un
ns
seq
sequ
equ
uenc
encing
en
cing
ing
By Michael Schroeder, Biotec,
QUESTION: How can you handle
long repetitive sequences?
Heeeeelllllllllllooooooo
QUESTION: Why was a draft
announced? When was the final
version ready?
10
Yersinia
pestis
Arabidopsis
thaliana
Buchnerasp.
APS
Caenorhabitis Campylobacter
elegans
jejuni
Helicobacter
pylori
rat
Chlamydia
pneumoniae
Mycobacterium
leprae
Rickettsia
prowazekii
mouse
Aquifex
aeolicus
Vibrio
cholerae
Drosophila
melanogaster
Neisseria
meningitidis
Z2491
Plasmodium
falciparum
Saccharomyces Salmonella
cerevisiae
enterica
By Michael Schroeder, Biotec,
Archaeoglobus Borrelia
fulgidus
burgorferi
Bacillus
subtilis
Mycobacterium
tuberculosis
Escherichia Thermoplasma
acidophilum
coli
Pseudomonas Ureaplasma
aeruginosa urealyticum
Thermotoga
maritima
Xylella
fastidiosa
11
Break through of the year 2000
Next quest:
Sequencing a genome for 1000$
By Michael Schroeder, Biotec,
12
Quantity and quality of data lead to
ambitious goals
Understand integrative aspects of the biology of
organisms
Interrelate sequence, three-dimensional structure,
interactions, function of proteins, nucleic acids and
protein-nucleic acid complexes
Travel in time
backward (deduce events in evolutionary history) and
forward (deliberate modification of biological systems)
Applications in medicine, agriculture, and other
scientific fields
By Michael Schroeder, Biotec,
13
Scenario
New virus (e.g. SARS) and goal to develop treatment
Scientists isolate genetic material of virus
Screen genome for relationships with previously studied viruses [10]
From virus’ DNA they compute the proteins it produces [1]
Compute proteins’ three-dimensional structure and thereby obtain
clues about their functions
Screen for similar proteins sequences with known structure [15]
If any are found
Then interpret difference (homology modelling) [25]
Else predict structure from sequence [55]
Identify or design small molecule blocking relevant active sites of the
protein [50]
Design antibodies to neutralize the virus [50]
Index of problem difficulty:
<30: solution exists already,
>30: we cannot solve this (yet)
By Michael Schroeder, Biotec,
14
Life in Time and Space
Life
A biological organism is a naturally-occurring, self-reproducing
device that effects controlled manipulations of matter, energy and
information
Time
Species evolve through
natural mutation,
recombination of genes in sexual reproduction, or
direct gene transfer
Read the past in contemporary genomes
Space
Species occupy local ecosystems
Species are composed of organisms
Organisms are composed of cells
Cells are composed of molecules
By Michael Schroeder, Biotec,
15
DNA – the molecule of life
By Michael Schroeder, Biotec,
http://www.ornl.gov/hgmis
16
Proteins
20 naturally occurring amino acids in proteins
Non-polar
G glycine, A alanine, P proline, V valine
I isoleucine, L leucine, F phenylalanine, M methionine
Polar
S serine, C cysteine, T threonine, N asparagine
Q glutamine, H histidine, Y tyrosine, W tryptophan
Charged
D aspartic acid, E glutamic acid, K lysine, R arginine
Other classification
H,F,Y,W are aromatic and play role in membrane proteins
Distinguish
atg = adenine-thymine-guanine and
ATG = Alanine-Threonine-Glycine
By Michael Schroeder, Biotec,
17
The genetic code
First
Position
(5’ end)
T
C
A
G
T
TTT
TTC
TTA
TTG
CTT
CTC
CTA
CTG
ATT
ATC
ATA
ATG
GTT
GTC
GTA
GTG
Phe
Phe
Leu
Leu
Leu
Leu
Leu
Leu
Ile
Ile
Ile
Met*
Val
Val
Val
Val
By Michael Schroeder, Biotec,
C
TCT
TCC
TCA
TCG
CCT
CCC
CCA
CCG
ACT
ACC
ACA
ACG
GCC
GCC
GCA
GCG
Second
Position
A
Ser
TAT
Ser
TAC
Ser
TAA
Ser
TAG
Pro
CAT
Pro
CAC
Pro
CAA
Pro
CAG
Thr
AAT
Thr
AAC
Thr
AAA
Thr
AAG
Ala
GAT
Ala
GAC
Ala
GAA
Ala
GAG
Tyr
Tyr
Stop
Stop
His
His
Gln
Gln
Asn
Asn
Lys
Lys
Asp
Asp
Glu
Glu
G
TGT
TGC
TGA
TGG
CGT
CGC
CGA
CGG
AGT
AGC
AGA
AGG
GGT
GGC
GGA
GGG
Cys
Cys
Stop
Trp
Arg
Arg
Arg
Arg
Ser
Ser
Arg
Arg
Gly
Gly
Gly
Gly
Third
Position
(3’ end)
T
C
A
G
T
C
A
G
T
C
A
G
T
C
A
G
18
Protein Structure
DNA:
Nucleotides are very similar
and hence the structure of
DNA is very uniform
Proteins:
Great variety in threedimensional conformation to
support diverse structure
and functions
If heated, protein “unfolds” to
biologically-inactive
structure; in normal
conditions protein folds
By Michael Schroeder, Biotec,
19
Paradox
Translation from DNA sequence to amino acid
sequence
is very simple to describe,
but requires immensely complicated machinery
(ribosome, tRNA)
The folding of the protein sequence into its threedimensional structure
is very difficult to describe
But occurs spontaneously
By Michael Schroeder, Biotec,
20
Central Dogma
DNA sequence determines protein sequence
Protein sequence determines protein structure
Protein structure determines protein function
By Michael Schroeder, Biotec,
21
Observables and Data Archives
Databases in molecular biology cover
Nucleic acid and protein sequences,
Macromolecular structures and functions
Archival databanks of biological information
DNA and protein sequences including annotations
Nucleic acid and protein structures including annotations
Protein expression patterns
Derived Databases
Sequence motifs (“signatures” of protein families)
Mutations and variants in DNA and protein sequences
Classification or relationships (e.g. hierarchy of structures)
Bibliographic databases (PubMed with 17M abstracts)
Collections
of links to web sites
of databases
By Michael Schroeder, Biotec,
22
What is Bioinformatics
Bioinformatics is the marriage of biology and
information technology
Bioinformatics is an integrated multidisciplinary
field
Covers computational tools and methods for
managing, analysing and manipulating sets
of biological data
Disciplines include:
biochemistry, genetics, structural biology, artificial
intelligence, machine learning, software
engineering, statistics, database theory,
information visualisation, algorithm design
By Michael Schroeder, Biotec,
23
Bioinformatics
Has three components
Creation of databases
Development of algorithms to analyse data
Use of these tools for analysing biological data
By Michael Schroeder, Biotec,
24
Databases: Types of Queries 1/2
1. Given a sequence (fragment), find sequences in
the database that are similar to it
2. Given a protein structure (or fragment), find
protein structures in the database that are similar to it
3. Given sequence of a protein of unknown structure,
find structures in the database that adopt similar threedimensional structures
4. Given a protein structure, find sequences in the
database that correspond to similar structures.
By Michael Schroeder, Biotec,
25
Databases: Given sequence, find structure
3. Given sequence of a protein of unknown structure, find
structures in the database that adopt similar three-dimensional
structures.
But How?
Easy: Find similar sequences with known structure!
But: There might be similar structures, whose sequence is not
similar!
4. Given a protein structure, find sequences in the database
that correspond to similar structures.
But How?
Easy: Find similar structures and hence sequences
But: There are so many more sequences with unknown structure
that the above method will have only very limited success
1 and 2 are solved, 3 and 4 are active fields of research
By Michael Schroeder, Biotec,
26
Databases: Types of Queries 2/2
E.g. for which proteins of known structure involved in
disease of disrupted purine biosynthesis in humans,
are there related proteins in yeast?
Solution: Virtual databases that provide transparent
access to a number of underlying data sources and
query and analysis tools
By Michael Schroeder, Biotec,
27
Databases: Curation and Quality
Problems:
Given that there are primary and secondary
databases,
how to control updates,
how to propagate change,
how to maintain consistency?
Contents (experimental results, annotations,
supplementary information) all have there own source
of error
Older data were limited by older techniques
By Michael Schroeder, Biotec,
28
Databases: Annotation
Experimental data (e.g. raw DNA sequence) needs to be
enriched with annotations
Source of data
Investigators responsible
Relevant publication
Feature tables (e.g. coding regions)
Problems:
(often) lack of controlled and coherent vocabulary
Computer parseable
Automated annotation needed
SwissProt = ca. 130.000 annotated sequences
TrEMBL = ca. 850.000 unannotated sequences
Maintanence of annotations (what if error detected?)
By Michael Schroeder, Biotec,
29
Computers and Computer Science
Relevant areas:
Artificial Intelligence
Machine Learning
Neural networks, rulebased learning
Datamining
Association rules
Software Engineering
Design, implementation,
testing of software
Programming
Object-oriented C++,
Java
Imperative: C, Modula,
Pascal, Cobol, Fortran
Logic: Prolog
Funtional: ML
Scripting: Perl, Python
By Michael Schroeder, Biotec,
Statistics
Database theory
Design and maintenance of
databases
How to index sequences,
time series, 3D strucutres
Information Visualisation
Graph drawing, diagrams,
cartoons, 3D graphics
Algorithm design
Complexity of algorithms
Efficient data structures
30
Programming
We will use Python
Scripting language
Supports string processing well
Widely used in bioinformatics
By Michael Schroeder, Biotec,
31
Biological Classification and
Nomenclature
Back in 18th century, Linnaeus, a Swedish naturalist,
classified living things according to a hierarchy:
Kingdom, Phylum, Class, Order, Family, Genus,
Species
Generally only genus and species are used for
identification
Homo sapiens
Drosophila melanogastor
Bos taurus
Linnaeus’ classification based on observed
similarity
Widely reflects biological ancestry
By Michael Schroeder, Biotec,
32
Classification of Humans and Fruit Flies
Kingdom:
Phylum:
Class:
Order:
Family:
Genus:
Species:
By Michael Schroeder, Biotec,
Animalia
Chordata
Mammalia
Primata
Hominidae
Homo
sapiens
Animalia
Chordata
Insecta
Diptera
Drosophilidae
Drosophila
melanogastor
33
Homology = derived from common ancestor
Characteristics derived from a common ancestor
are called homologous
E.g. eagle’s wing and human’s arm
Other apparently similar characteristics may have
arisen independently by convergent evolution
E.g. eagle’s wing and bee’s wing. The most common
ancestor of eagles and bees did not have wings
Homologous characters may diverge functionally
E.g. bones in human middle and jaws of primitive fish
By Michael Schroeder, Biotec,
34
Sequence analysis and Homology
Sequence analysis gives unambiguous evidence
for relationship of species
For higher organisms sequence analysis and the
classical tools of comparative anatomy,
palaeontology, and embryology are often consistent
For microorganisms there are problems
Classical methods: how to describe features
Sequence analysis: lateral gene transfer
By Michael Schroeder, Biotec,
35
Domains of Life
Ribosomal RNA is present in all organisms
Based on 15S ribosomal RNAs life is divided
Bacteria
No nucleus (procaryote)
E.g. tuberculosis and E. coli
Archaea
No nucleus (procaryote)
few organisms living in hostile environments (termophiles, halophiles,
sulphur reducers, methanogens)
Eukarya
Has a nucleus contained in membrane
Nucleus contains chromosomes
Internal compartments called organelles for specialised biological
processes
Area outside nucleus and organelles called cytoplasm
E.g. yeast and human beings
By Michael Schroeder, Biotec,
36
Eukaryotic cell
By Michael Schroeder, Biotec,
37
Domains of Life
By Michael Schroeder, Biotec,
38
Example: Use of sequences to
determine phylogenetic relationships
Use ExPASy (www.expasy.ch/cgi-bin/sprot-search-ful) to
search for pancreatic ribonuclease for
horse (Equus caballus),
minke whale (Balaenoptera acutorostrata),
red kangaroo (Macropus rufus)
sp|P00674|RNP_HORSE Ribonuclease pancreatic
(EC 3.1.27.5) (RNase 1) (RNase A) - Equus
caballus (Horse).
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTF
VHEPLADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKY
PNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST
Use sequence alignment to determine evolutionary relationship
By Michael Schroeder, Biotec,
39
Sequence alignment
Global match: align all of one with all of the other
sequence (mismatches, insertions, deletions)
And.--so,.from.hour.to.hour.we.ripe.and.ripe
||||
||||||||||||||||||||||||
||||||
And.then,.from.hour.to.hour.we.rot-.and.rot-
Local match: find region in one sequence that
matches the other (mismatches, insertions, deletions
; ends can be ignored)
My.care.is.loss.of.care,.by.old.care.done,
|||||||||
|||||||||||||
|||||| ||
Your.care.is.gain.of.care,.by.new.care.won
By Michael Schroeder, Biotec,
40
Sequence alignment
Motif search:
find matches of short sequence in long sequence
Option:
perfect,
1 mismatch,
mismatches+gaps+insertions+deletions
match
||||
for the watch to babble and to talk is most tolerable
By Michael Schroeder, Biotec,
41
Sequence alignment
Multiple sequence alignment
No.sooner.---met.--------.but.they.look’d
No.sooner.look’d.--------.but.they.lo-v’d
No.sooner.lo-v’d.--------.but.they.sigh’d
No.sooner.sigh’d.--------.but.they.--asked.one.another.the.reason
No.sooner.knew.the.reason.but.they.-------------sought.the.remedy
No.sooner.
.but.they.
By Michael Schroeder, Biotec,
42
Example: Multiple alignment
Use sequence alignment to determine evolutionary
relationship…
Example: horse, whale and kangoroo
Expected: horse and whale are placental mammals,
kangoroo is marsupial
Multiple alignment with CLUSTAL-W
(www.ebi.ac.uk/clustalw)
By Michael Schroeder, Biotec,
43
FASTA format
>sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase
1) (RNase A) - Equus caballus (Horse).
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ
KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF
DASVEVST
>sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC 3.1.27.5) (RNase
1) (RNase A) - Balaenoptera acutorostrata (Minke whale) (Lesser
rorqual).
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ
KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF
DNSV
>sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC 3.1.27.5) (RNase
1) (RNase A) - Macropus rufus (Red kangaroo) (Megaleia rufa).
ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQE
NVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDA
YV
By Michael Schroeder, Biotec,
44
Multiple Alignment with ClustalW
(www.ebi.ac.uk/clustalw)
CLUSTAL W (1.82) multiple sequence alignmen
sp|P00674|RNP_HORSE
sp|P00673|RNP_BALAC
sp|P00686|RNP_MACRU
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60
-ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59
*:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* *
KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120
KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120
ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118
:*: ****::***:*.* : **:** *..****** *:**: :::******* ******
DASVEVST 128
DNSV---- 124
DAYV---- 122
* *
By Michael Schroeder, Biotec,
45
Example: Number of Aligned Residues
Horse and Minke whale:
Minke whale and Red kangoroo:
Horse and Red kangoroo:
95
82
75
Conclusion: Horse and whale share the most
identical resiues
By Michael Schroeder, Biotec,
46
Example: Elephant and Mammoth
Mitochondrial cytochrome b from
Siberian woolly mammoth
(Mammuthus primigenius)
preserved in arctic perma frost
African elephant (Loxodonta africana)
Indian elephant (Elephans maximus)
By Michael Schroeder, Biotec,
47
Indian elephant: sp|P24958|CYB_LOXAF
Mammoth: sp|P92658|CYB_MAMPR
African elephant: sp|O47885|CYB_ELEMA
MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
*** ** ***:**:**********************************************
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
************************************************************
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180
**************************************:*********************
LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240
LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240
FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240
:********:***********************************************:**
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300
******************************************************:*****
LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360
LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360
LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360
**:*************************: *** **********:***************
IILAFLPIAGVIENYLIK 378
IILAFLPIAGMIENYLIK 378
IILAFLPIAGMIENYLIK 378
**********:*******
By Michael Schroeder, Biotec,
48
Example: Elephant and Mammoth
Mammoth and African elephant have 10 mismatches,
mammoth and Indian elephant 14.
Significant?
By Michael Schroeder, Biotec,
49
Similarity and Homology
Important difference:
Similarity is the measurement of resemblance of
sequences
Homology: common ancestor
Similarity is gradual, homology is either true or false
Similarity = now, homology = past events
Homology is only very rarely directly observed (e.g. lab
population, clinical study of viral infection)
Homology is inferred from sequence similarity
By Michael Schroeder, Biotec,
50
Example: Homology/Similarity
The assertion that the cytocrome b sequences are
homologues means that there is a common ancestor
BUT:
1. Maybe cytochrome b functionally requires so many
conserved residues and will hence occur in many species ( In
fact, This is not the case here)
2. Maybe cytochrome b has to function this way in elephant-like
species, but in fact started out from different ancestors (i.e.
convergent evolution)
3. Maybe mammoth and African elephant have only fewer
mismatches, because Indian elephant’s DNA mutated faster
4. Maybe all of them acquired cytochrome b through a virus
(horizontal gene transfer)
By Michael Schroeder, Biotec,
51
Example: Conclusion
Classical methods confirm that for pancreatic
ribonuclease inferring homology from similarity is
justified
But to answer whether Mammoth are closer to
African or Indian elephants is too close to call
Problems with inferring phylogeny from gene and
protein sequences
Wide range of variation (possibly below statistical
significance)
Different rates of evolution for different branches of the
evolutionary tree
By Michael Schroeder, Biotec,
52
Inferring Phylogenies
with SINES and LINES
Requirements:
‘all-or-none’ character
Irreversible appearance
Solution:
SINES and LINES (Short and Long Interspersed
Nuclear Elements)
Repetitive, non-coding sequences in eukaryotic
genomes
>30% in human genome, >50% in some plants
SINES = 70-500 base pairs long, up to 106 copies
LINES up to 7000 base pairs, up to 105 copies
They enter genome by reverse transcription of RNA
By Michael Schroeder, Biotec,
53
A practical example:
Fatherhood
The picture shows a Southern
blot of DNA from different family
members, probed using a minisatellite.
You can work out which of F1
and F2 is the father of child C,
by observing which bands they
have in common.
(Reproduced from "Essential
Medical Genetics" by M.Connor
and M.Ferguson-Smith, with
permission from Blackwell
Science).
By Michael Schroeder, Biotec,
54
Why SINES are useful in phylogeny
Either present or absent
Inserted at random in non-coding portion of genome
i.e. SINE has no important function so that convergent
evolution can be excluded
Presence of a SINE in two species and absence in a third implies
that first two species are more closely related
SINE insertion appears to be irreversible
Temporal order
Presence of a SINE in two species and absence in a third implies
that ancestor of first two species is younger than ancestor of all
three
By Michael Schroeder, Biotec,
55
Example revisited
What is the closest land-based relative of the whales
Classical palaeontology
links Cetacea (whales, dolphins, porpoises) with Arteriodactyla
(including e.g. cattle)
Belief that Cetaceans diverged before Arteriodactyla split into
suborder
Suiformes (e.g. pigs),
Tylopoda (e.g. camels, llamas),
Ruminantia (e.g. deer, cattle, goats, sheep, antelopes, giraffe)
Sequence comparison results
Based on mitochondrial DNA, pancreatic ribonuclease, fibrinogen,
and others
Closest relatives of whales are hippopotamuses (They share 4
SINES)
These two are closest to Ruminantia
By Michael Schroeder, Biotec,
56
Searching for Similar
Sequences with PSI-Blast
Any search method for
sequences should be
Sequence Database
Sensitive: also pick up distant
relationships
Selective: reported relationships
are true
Example: database with (among
others) 1000 globin sequences
Globin familiy (oxygen transport) of
proteins occurs in many species
Proteins have same function and
structure and
positives:
But there are pairs of membersTrue
of
the family sharing less than 10%
700 out of 900
identical residues
By Michael Schroeder, Biotec,
False negatives:
300 out of 1000
are not found
1000 Globin
Sequences
900 Search
results
are really globins
False positives:
200 out of 900
are not globins
57
Searching for Distant Relationships
with PSI-BLAST
How can we find distant relationships without
increasing the false negatives?
PSI-BLAST:
Position Sensistive Iterated – Basic Linear Alignment
Sequence Tool
Identifies patterns within the sequences
Score via intermediaries may be better than score from
direct comparison
A
50%
B
50%
C
Only 10%
By Michael Schroeder, Biotec,
58
PSI-BLAST Example
Human PAX-6 gene (SwissProt ID P26367) has
homologues in many different species
PSI-Blast at NCBI site www.ncbi .nlm.nih.gov
By Michael Schroeder, Biotec,
59
Result
BLASTP 2.2.6 [Apr-09-2003]
RID: 1062117117-16602-2157828.BLASTQ3
Query= gi|6174889|sp|P26367|PAX6_HUMAN Paired box protein Pax-6
(Oculorhombin) (Aniridia, type II protein).
(422 letters)
Database: All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF
1,509,571 sequences; 486,132,453 total letters
Results of PSI-Blast iteration 1
Sequences with E-value BETTER than threshold
Sequences producing significant alignments:
Score
E
(bits) Value
gi|4505615|ref|NP_000271.1| paired box gene 6 isoform a; Paired box h...
gi|189353|gb|AAA59962.1| oculorhombin >gi|189354|gb|AAA59963.1| oculo...
gi|6981334|ref|NP_037133.1| paired box homeotic gene 6 [Rattus norveg...
gi|26389393|dbj|BAC25729.1| unnamed protein product [Mus musculus]
gi|7305369|ref|NP_038655.1| paired box gene 6; small eye; Dickie's sm...
gi|383296|prf||1902328A PAX6 gene
gi|4580424|ref|NP_001595.2| paired box gene 6 isoform b; Paired box h...
gi|18138028|emb|CAC80516.1| paired box protein [Mus musculus]
gi|2576237|dbj|BAA23004.1| PAX6 protein [Gallus gallus]
gi|27469846|gb|AAH41712.1| Similar to paired box gene 6 [Xenopus laevis]
…
By Michael Schroeder, Biotec,
781
780
778
776
776
775
775
773
770
768
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
60
Introduction to Protein Structure
Proteins play a variety of roles:
Structural (viral coat proteins, horny outer layer of
human and animal skin, cytoskeleton)
Catalysis of chemical reactions (enzymes)
Transport and Storage (e.g. haemoglobin)
Regulation (e.g. hormones)
Receptor and signal transduction
Genetic transcription
Recognition (cell adhesion molecules)
Antibodies and other proteins of the immune system
By Michael Schroeder, Biotec,
61
Proteins
Are large molecules
Only small part – the active site – is functional
Evolve by structural changes produced by mutations
in the amino acid sequence
Ca. 40000 proteins structures are now known
Can be obtained by X-ray crystallography or nuclear
magnetic resonance (NMR)
By Michael Schroeder, Biotec,
62
Structure of Proteins
Backbone and sidechain
Residue i-1, Residue i, Residue i+1,
Si-1
Si
Si+1
|
|
|
…N-Cα-C-N-Cα-C-N-Cα-C-…
||
||
||
O
O
O
Sidechain (variable)
Mainchain (constant)
Polypeptide chain folds into a curve in space
Common structural feature
Alpha-helix
Beta-sheet
By Michael Schroeder, Biotec,
63
Hierarchy of Architecture
Primary structure: Amino acid sequence
Secondary structure: Helices, sheets, loops,
hydrogen-bonding pattern of main chain
Tertiary structure: Assembly and interactions of
helices, sheets, etc.
Quaternary structure: Assembly of monomers
Evolution can merge proteins
Five enzymes in E. coli that catalyze successive steps
in biosynthesis of aromatic amino acids correspond to
one protein in Aspergillus nidulans
Globins form tetramers in mammalian haemoglobin
and dimers in ark clam Scaoharca inaequivalvis
By Michael Schroeder, Biotec,
64
Protein Structure
Triosephosphate isomerase from Bacillus stearothermophilus
Highly efficient enzyme appearing in most species
By Michael Schroeder, Biotec,
65
Hierarchy of
Architecture:
supersecondary
structure
Alpha-helix hairpin
Beta hairpin
Beta-alpha-beta unit
By Michael Schroeder, Biotec,
66
Hierarchy of Architecture
Supersecondary structures:
Alpha-helix hairpin
Beta hairpin
Beta-alpha-beta unit
Domains:
Compact unit, single chain, independent stability
Modular proteins:
Multi-domain
Copies of related domains or “mix-and-match”
By Michael Schroeder, Biotec,
67
Classification of Protein Structure
All Alpha: mostly alpha helices
All Beta: mostly beta sheets
Alpha+Beta: Helices and sheets in different parts of
the molecule, no beta-alpha-beta units
Alpha/Beta: Helices and sheets assembled from
beta-alpha-beta units
Alpha/Beta linear
Alpha/Beta barrel
Little or no secondary structure
By Michael Schroeder, Biotec,
68
SCOP: Structural Classification of Proteins
top
CLASS
All alpha (218)
All Beta (144)
Alpha+Beta (279)
Alpha/Beta (136)
FOLD
Trypsin-like serine proteases (1)
Immunoglobulin-like (23)
SUPERFAMILY
=evolutionary related, similar structure,
not necessarily similar sequence
Transglutaminase (1)
Immunoglobulin (6)
FAMILY = set of domains with similar sequence
C1 set domains
(antibody constant)
By Michael Schroeder, Biotec,
V set domains
(antibody variable)
69
Pymol
By Michael Schroeder, Biotec,
70
Engrailed homeodomain (1enh)
Transcription factor important in developend
Used to study protein folding
Utrophin calmodulin homology
domain (1bhd)
Actin binding
Closely relatd to dystrophin,
whose lack causes muscular
dystrophies (weak muscles)
Cytochrome c, rice (1ccr)
Electron transport across
mitochondrial membrane
By Michael Schroeder, Biotec,
DNA-binding domain of HIN
recombinase (1hcr)
71
Fibronectin III domain (1fna)
Found on cell surface
Mannose-binding protein (1npl)
Barnase (1brn)
Cleaves RNA and is lethal if
intracellular and not inhibited by
barstar
By Michael Schroeder, Biotec,
TATA-box-binding protein (1cdw)
72
OB-domain from Lys-tRNA
synthetase (1bbw)
Scytalone dehydratase (3std)
Alcohol dehydrogenase, NADbinding domain (1ee2)
Break down of alcohol into
simpler compounds
By Michael Schroeder, Biotec,
Adenylate kinase (3adk)
Energy production
73
Chemotaxis receptor
methyltransferase (1af7)
Thiamine phosphate
synthase (2tps)
Pancreatic spasmolytic
polypeptide (2psp)
By Michael Schroeder, Biotec,
74
Protein Structure Prediction and
Engineering
If sequence of amino acids contains enough information to
specify three-dimensional structure of proteins, it should be
possible to devise algorithm for prediction
Secondary structure prediction: Which segments of the
sequence are helices, which strands?
Fold recognition: Given
library of known structures with their sequences and
a sequence with unknown structure,
can we find the structure that is most similar
Homology modelling
Given two homologous sequences, one with one without
structure. If more than 50% of the residues are identical the
structure can serve as a model
By Michael Schroeder, Biotec,
75
Critical Asessment of Structure
Prediction (CASP)
Chicken lysozyme
Baboon alpha-lactalbumin
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGS
KQFTKCELSQNLY--DIDGYGRIALPELICTMFHTSGYDTQAIVEND-ES
Chicken lysozyme
Baboon alpha-lactalbumin
TDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVS
TEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILD
Chicken lysozyme
Baboon alpha-lactalbumin
DGN-GMNAWVAWRNRCKGTDVQA-WIRGCRLI--KGIDYWIAHKALC-TEKL-EQWL--CE-K
By Michael Schroeder, Biotec,
76
Clinical Implications
Fast and reliable diagnosis of disease and risk:
With symptoms
In advance of appearance (e.g. Huntington)
In utero (e.g. cystic fibrosis: mutation in cystic fibrosis transmembrane
conductance regulator (CFTR), which is a chloride ion channel
Genetic counselling
Customized treatment
E.g. childhood leukaemia is treated with toxic drug 6-mercaptopurine.
Small fraction of patients used to die as they lack enzyme thiopurine
methyltransferase.
Identify drug targets
½ are receptors, ¼ are enzymes, ¼ are hormones
7% have unknown targets
Gene therapy
Replace defective genes or supply gene products (insulin for diabetes
and Blood Factor VIII for haemophilia)
However: Most diseases do not have a single genetic cause!
By Michael Schroeder, Biotec,
77
Quick check
By now you should
Have read chapter 1
Know the main data sources (sequence and structure)
Know the role that bioinformatics plays
Understand the difference between homology and similarity
Understand what sequence comparison and alignment are
Understand how they can be useful for phylogenetic studies
Understand primary, secondary, tertiary structure
Be able to assess the assumptions made and the quality of data
By Michael Schroeder, Biotec,
78