Sequence - BIOTEC - Biotechnology Center TU Dresden

Download Report

Transcript Sequence - BIOTEC - Biotechnology Center TU Dresden

Introduction
based on
Chapter 1
Lesk, Introduction to Bioinformatics
Michael Schroeder
BioTechnological Center
TU Dresden
Biotec
Contents
 Molecular biology primer
 The role of computer science
 Phylogeny
 Sequence Searching
 Protein structure
 Clinical implications
 Read chapter 1
By Michael Schroeder, Biotec,
2
23 June 2000: Draft of Human
genome sequenced!
 1953: Watson and Crick discover the structure of DNA
 2000: Draft of human genome is published
 “The most wondrous map ever produced by human kind”
 “One of the most significant scientific landmarks of all
time, comparable with the invention of the wheel or the
splitting of the atom”
By Michael Schroeder, Biotec,
3
High-throughput biomedicine
 Microarrays
 Measure activity of thousands of genes at the same time
 Example:
 Cancer
 Compare activity with and without drug treatment
 Result: Hundreds of candidate drug targets
 RNAi (Noble prize 2004, Fire and Mello)
 Knock-down genes and observe effect
 Example:
 Infectious diseases
 Which proteins orchestrate entry into cell?
 Result: Hundreds of candidate proteins
 Atomic force microscopes (Noble prize Binnig)
 Pull protein out of membrane and measure force
 Example:
 Eye diseases resulting fomr misfolding
 Result: Hundreds of candidate residues
By Michael Schroeder, Biotec,
4
Drug Discovery
80
New Drugs
70
R&D spendings
20
15
60
50
10
40
30
5
20
10
R&D spendings ($ Billion)
New drugs per year
 Challenge: Longer time to market, fewer drugs,
exploding costs
 Approach: Use of compound libraries and highthroughput screening
0
0
60
65
By Michael Schroeder, Biotec,
70
75
80
Year
85
90
95
5
HTS and Bioinformatics
 High-throughput technologies have completely
changed the work of biomedical researchers
 Challenge: Interpret (often large) results of screens
 Approach: Before running secondary assays use
bioinformatics and IT to assemble all possible
information
By Michael Schroeder, Biotec,
6
Good News
Number of PubMed Abstracts
14,000,000
>1.000.000
Sequences
12,000,000
>16.000.000
Articles
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
0
1960
1970
1980
1990
2000
Year
Molecular Biology Database List at Nucleic Acids Research
>30.000
3D Structures
number of data sources
800
700
600
500
400
>700
DBs/Tools
300
200
100
0
2000
By Michael Schroeder, Biotec,
2001
2002
2003
year
2004
2005
7
2010
Bad News: Data != Knowledge
 How to analyse data, how to integrate data?
 Comptuer science to the rescue…
By Michael Schroeder, Biotec,
8
Examlpe: computer science
is key for sequencing
 Human genome is a string of length 3.200.000.000
 Shotgun sequencing: Break multiple copies of string
into shorter substrings
 Example:
 shotgunsequencing shotgunsequencing
shotgunsequencing
 cing en encing equ gun ing ns otgu seq
sequ sh sho shot tg uenc un
 Computing problem: Assemble strings
By Michael Schroeder, Biotec,
9
Computer science key
for sequencing
 sh
 sho
 shot

otgu

tg

gun

un

ns

seq

sequ

equ

uenc

encing

en

cing

ing
By Michael Schroeder, Biotec,
QUESTION: How can you handle
long repetitive sequences?
Heeeeelllllllllllooooooo
QUESTION: Why was a draft
announced? When was the final
version ready?
10
Yersinia
pestis
Arabidopsis
thaliana
Buchnerasp.
APS
Caenorhabitis Campylobacter
elegans
jejuni
Helicobacter
pylori
rat
Chlamydia
pneumoniae
Mycobacterium
leprae
Rickettsia
prowazekii
mouse
Aquifex
aeolicus
Vibrio
cholerae
Drosophila
melanogaster
Neisseria
meningitidis
Z2491
Plasmodium
falciparum
Saccharomyces Salmonella
cerevisiae
enterica
By Michael Schroeder, Biotec,
Archaeoglobus Borrelia
fulgidus
burgorferi
Bacillus
subtilis
Mycobacterium
tuberculosis
Escherichia Thermoplasma
acidophilum
coli
Pseudomonas Ureaplasma
aeruginosa urealyticum
Thermotoga
maritima
Xylella
fastidiosa
11
Break through of the year 2000
Next quest:
Sequencing a genome for 1000$
By Michael Schroeder, Biotec,
12
Quantity and quality of data lead to
ambitious goals
 Understand integrative aspects of the biology of
organisms
 Interrelate sequence, three-dimensional structure,
interactions, function of proteins, nucleic acids and
protein-nucleic acid complexes
 Travel in time
 backward (deduce events in evolutionary history) and
 forward (deliberate modification of biological systems)
 Applications in medicine, agriculture, and other
scientific fields
By Michael Schroeder, Biotec,
13
Scenario





New virus (e.g. SARS) and goal to develop treatment
Scientists isolate genetic material of virus
Screen genome for relationships with previously studied viruses [10]
From virus’ DNA they compute the proteins it produces [1]
Compute proteins’ three-dimensional structure and thereby obtain
clues about their functions
 Screen for similar proteins sequences with known structure [15]
 If any are found
 Then interpret difference (homology modelling) [25]
 Else predict structure from sequence [55]
 Identify or design small molecule blocking relevant active sites of the
protein [50]
 Design antibodies to neutralize the virus [50]
 Index of problem difficulty:
 <30: solution exists already,
 >30: we cannot solve this (yet)
By Michael Schroeder, Biotec,
14
Life in Time and Space
 Life
 A biological organism is a naturally-occurring, self-reproducing
device that effects controlled manipulations of matter, energy and
information
 Time
 Species evolve through
 natural mutation,
 recombination of genes in sexual reproduction, or
 direct gene transfer
 Read the past in contemporary genomes
 Space




Species occupy local ecosystems
Species are composed of organisms
Organisms are composed of cells
Cells are composed of molecules
By Michael Schroeder, Biotec,
15
DNA – the molecule of life
By Michael Schroeder, Biotec,
http://www.ornl.gov/hgmis
16
Proteins
 20 naturally occurring amino acids in proteins
 Non-polar
 G glycine, A alanine, P proline, V valine
 I isoleucine, L leucine, F phenylalanine, M methionine
 Polar
 S serine, C cysteine, T threonine, N asparagine
 Q glutamine, H histidine, Y tyrosine, W tryptophan
 Charged
 D aspartic acid, E glutamic acid, K lysine, R arginine
 Other classification
 H,F,Y,W are aromatic and play role in membrane proteins
 Distinguish
 atg = adenine-thymine-guanine and
 ATG = Alanine-Threonine-Glycine
By Michael Schroeder, Biotec,
17
The genetic code
First
Position
(5’ end)
T
C
A
G
T
TTT
TTC
TTA
TTG
CTT
CTC
CTA
CTG
ATT
ATC
ATA
ATG
GTT
GTC
GTA
GTG
Phe
Phe
Leu
Leu
Leu
Leu
Leu
Leu
Ile
Ile
Ile
Met*
Val
Val
Val
Val
By Michael Schroeder, Biotec,
C
TCT
TCC
TCA
TCG
CCT
CCC
CCA
CCG
ACT
ACC
ACA
ACG
GCC
GCC
GCA
GCG
Second
Position
A
Ser
TAT
Ser
TAC
Ser
TAA
Ser
TAG
Pro
CAT
Pro
CAC
Pro
CAA
Pro
CAG
Thr
AAT
Thr
AAC
Thr
AAA
Thr
AAG
Ala
GAT
Ala
GAC
Ala
GAA
Ala
GAG
Tyr
Tyr
Stop
Stop
His
His
Gln
Gln
Asn
Asn
Lys
Lys
Asp
Asp
Glu
Glu
G
TGT
TGC
TGA
TGG
CGT
CGC
CGA
CGG
AGT
AGC
AGA
AGG
GGT
GGC
GGA
GGG
Cys
Cys
Stop
Trp
Arg
Arg
Arg
Arg
Ser
Ser
Arg
Arg
Gly
Gly
Gly
Gly
Third
Position
(3’ end)
T
C
A
G
T
C
A
G
T
C
A
G
T
C
A
G
18
Protein Structure
 DNA:
 Nucleotides are very similar
and hence the structure of
DNA is very uniform
 Proteins:
 Great variety in threedimensional conformation to
support diverse structure
and functions
 If heated, protein “unfolds” to
biologically-inactive
structure; in normal
conditions protein folds
By Michael Schroeder, Biotec,
19
Paradox
 Translation from DNA sequence to amino acid
sequence
 is very simple to describe,
 but requires immensely complicated machinery
(ribosome, tRNA)
 The folding of the protein sequence into its threedimensional structure
 is very difficult to describe
 But occurs spontaneously
By Michael Schroeder, Biotec,
20
Central Dogma
DNA sequence determines protein sequence
Protein sequence determines protein structure
Protein structure determines protein function
By Michael Schroeder, Biotec,
21
Observables and Data Archives
 Databases in molecular biology cover
 Nucleic acid and protein sequences,
 Macromolecular structures and functions
 Archival databanks of biological information
 DNA and protein sequences including annotations
 Nucleic acid and protein structures including annotations
 Protein expression patterns
 Derived Databases
 Sequence motifs (“signatures” of protein families)
 Mutations and variants in DNA and protein sequences
 Classification or relationships (e.g. hierarchy of structures)
 Bibliographic databases (PubMed with 17M abstracts)
 Collections
 of links to web sites
 of databases
By Michael Schroeder, Biotec,
22
What is Bioinformatics
 Bioinformatics is the marriage of biology and
information technology
 Bioinformatics is an integrated multidisciplinary
field
 Covers computational tools and methods for
managing, analysing and manipulating sets
of biological data
 Disciplines include:
 biochemistry, genetics, structural biology, artificial
intelligence, machine learning, software
engineering, statistics, database theory,
information visualisation, algorithm design
By Michael Schroeder, Biotec,
23
Bioinformatics
 Has three components
 Creation of databases
 Development of algorithms to analyse data
 Use of these tools for analysing biological data
By Michael Schroeder, Biotec,
24
Databases: Types of Queries 1/2
 1. Given a sequence (fragment), find sequences in
the database that are similar to it
 2. Given a protein structure (or fragment), find
protein structures in the database that are similar to it
 3. Given sequence of a protein of unknown structure,
find structures in the database that adopt similar threedimensional structures
 4. Given a protein structure, find sequences in the
database that correspond to similar structures.
By Michael Schroeder, Biotec,
25
Databases: Given sequence, find structure
 3. Given sequence of a protein of unknown structure, find
structures in the database that adopt similar three-dimensional
structures.
But How?
 Easy: Find similar sequences with known structure!
 But: There might be similar structures, whose sequence is not
similar!
 4. Given a protein structure, find sequences in the database
that correspond to similar structures.
But How?
 Easy: Find similar structures and hence sequences
 But: There are so many more sequences with unknown structure
that the above method will have only very limited success
 1 and 2 are solved, 3 and 4 are active fields of research
By Michael Schroeder, Biotec,
26
Databases: Types of Queries 2/2
 E.g. for which proteins of known structure involved in
disease of disrupted purine biosynthesis in humans,
are there related proteins in yeast?
 Solution: Virtual databases that provide transparent
access to a number of underlying data sources and
query and analysis tools
By Michael Schroeder, Biotec,
27
Databases: Curation and Quality
 Problems:
 Given that there are primary and secondary
databases,
how to control updates,
how to propagate change,
how to maintain consistency?
 Contents (experimental results, annotations,
supplementary information) all have there own source
of error
 Older data were limited by older techniques
By Michael Schroeder, Biotec,
28
Databases: Annotation
 Experimental data (e.g. raw DNA sequence) needs to be
enriched with annotations




Source of data
Investigators responsible
Relevant publication
Feature tables (e.g. coding regions)
 Problems:
 (often) lack of controlled and coherent vocabulary
 Computer parseable
 Automated annotation needed
 SwissProt = ca. 130.000 annotated sequences
 TrEMBL = ca. 850.000 unannotated sequences
 Maintanence of annotations (what if error detected?)
By Michael Schroeder, Biotec,
29
Computers and Computer Science
 Relevant areas:
 Artificial Intelligence
 Machine Learning
 Neural networks, rulebased learning
 Datamining
 Association rules
 Software Engineering
 Design, implementation,
testing of software
 Programming
 Object-oriented C++,
Java
 Imperative: C, Modula,
Pascal, Cobol, Fortran
 Logic: Prolog
 Funtional: ML
 Scripting: Perl, Python
By Michael Schroeder, Biotec,
 Statistics
 Database theory
 Design and maintenance of
databases
 How to index sequences,
time series, 3D strucutres
 Information Visualisation
 Graph drawing, diagrams,
cartoons, 3D graphics
 Algorithm design
 Complexity of algorithms
 Efficient data structures
30
Programming
 We will use Python
 Scripting language
 Supports string processing well
 Widely used in bioinformatics
By Michael Schroeder, Biotec,
31
Biological Classification and
Nomenclature
 Back in 18th century, Linnaeus, a Swedish naturalist,
classified living things according to a hierarchy:
Kingdom, Phylum, Class, Order, Family, Genus,
Species
 Generally only genus and species are used for
identification
 Homo sapiens
 Drosophila melanogastor
 Bos taurus
 Linnaeus’ classification based on observed
similarity
 Widely reflects biological ancestry
By Michael Schroeder, Biotec,
32
Classification of Humans and Fruit Flies
 Kingdom:
 Phylum:
 Class:
 Order:
 Family:
 Genus:
 Species:
By Michael Schroeder, Biotec,
Animalia
Chordata
Mammalia
Primata
Hominidae
Homo
sapiens
Animalia
Chordata
Insecta
Diptera
Drosophilidae
Drosophila
melanogastor
33
Homology = derived from common ancestor
 Characteristics derived from a common ancestor
are called homologous
 E.g. eagle’s wing and human’s arm
 Other apparently similar characteristics may have
arisen independently by convergent evolution
 E.g. eagle’s wing and bee’s wing. The most common
ancestor of eagles and bees did not have wings
 Homologous characters may diverge functionally
 E.g. bones in human middle and jaws of primitive fish
By Michael Schroeder, Biotec,
34
Sequence analysis and Homology
 Sequence analysis gives unambiguous evidence
for relationship of species
 For higher organisms sequence analysis and the
classical tools of comparative anatomy,
palaeontology, and embryology are often consistent
 For microorganisms there are problems
 Classical methods: how to describe features
 Sequence analysis: lateral gene transfer
By Michael Schroeder, Biotec,
35
Domains of Life
 Ribosomal RNA is present in all organisms
 Based on 15S ribosomal RNAs life is divided
 Bacteria
 No nucleus (procaryote)
 E.g. tuberculosis and E. coli
 Archaea
 No nucleus (procaryote)
 few organisms living in hostile environments (termophiles, halophiles,
sulphur reducers, methanogens)
 Eukarya
 Has a nucleus contained in membrane
 Nucleus contains chromosomes
 Internal compartments called organelles for specialised biological
processes
 Area outside nucleus and organelles called cytoplasm
 E.g. yeast and human beings
By Michael Schroeder, Biotec,
36
Eukaryotic cell
By Michael Schroeder, Biotec,
37
Domains of Life
By Michael Schroeder, Biotec,
38
Example: Use of sequences to
determine phylogenetic relationships
 Use ExPASy (www.expasy.ch/cgi-bin/sprot-search-ful) to
search for pancreatic ribonuclease for
 horse (Equus caballus),
 minke whale (Balaenoptera acutorostrata),
 red kangaroo (Macropus rufus)
 sp|P00674|RNP_HORSE Ribonuclease pancreatic
(EC 3.1.27.5) (RNase 1) (RNase A) - Equus
caballus (Horse).
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTF
VHEPLADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKY
PNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST
 Use sequence alignment to determine evolutionary relationship
By Michael Schroeder, Biotec,
39
Sequence alignment
 Global match: align all of one with all of the other
sequence (mismatches, insertions, deletions)
And.--so,.from.hour.to.hour.we.ripe.and.ripe
||||
||||||||||||||||||||||||
||||||
And.then,.from.hour.to.hour.we.rot-.and.rot-
 Local match: find region in one sequence that
matches the other (mismatches, insertions, deletions
; ends can be ignored)
My.care.is.loss.of.care,.by.old.care.done,
|||||||||
|||||||||||||
|||||| ||
Your.care.is.gain.of.care,.by.new.care.won
By Michael Schroeder, Biotec,
40
Sequence alignment
 Motif search:
 find matches of short sequence in long sequence
 Option:
perfect,
1 mismatch,
mismatches+gaps+insertions+deletions

match
||||
for the watch to babble and to talk is most tolerable
By Michael Schroeder, Biotec,
41
Sequence alignment
Multiple sequence alignment
No.sooner.---met.--------.but.they.look’d
No.sooner.look’d.--------.but.they.lo-v’d
No.sooner.lo-v’d.--------.but.they.sigh’d
No.sooner.sigh’d.--------.but.they.--asked.one.another.the.reason
No.sooner.knew.the.reason.but.they.-------------sought.the.remedy
No.sooner.
.but.they.
By Michael Schroeder, Biotec,
42
Example: Multiple alignment
 Use sequence alignment to determine evolutionary
relationship…
 Example: horse, whale and kangoroo
 Expected: horse and whale are placental mammals,
kangoroo is marsupial
 Multiple alignment with CLUSTAL-W
(www.ebi.ac.uk/clustalw)
By Michael Schroeder, Biotec,
43
FASTA format
>sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase
1) (RNase A) - Equus caballus (Horse).
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ
KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF
DASVEVST
>sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC 3.1.27.5) (RNase
1) (RNase A) - Balaenoptera acutorostrata (Minke whale) (Lesser
rorqual).
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ
KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF
DNSV
>sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC 3.1.27.5) (RNase
1) (RNase A) - Macropus rufus (Red kangaroo) (Megaleia rufa).
ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQE
NVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDA
YV
By Michael Schroeder, Biotec,
44
Multiple Alignment with ClustalW
(www.ebi.ac.uk/clustalw)
CLUSTAL W (1.82) multiple sequence alignmen
sp|P00674|RNP_HORSE
sp|P00673|RNP_BALAC
sp|P00686|RNP_MACRU
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60
-ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59
*:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* *
KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120
KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120
ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118
:*: ****::***:*.* : **:** *..****** *:**: :::******* ******
DASVEVST 128
DNSV---- 124
DAYV---- 122
* *
By Michael Schroeder, Biotec,
45
Example: Number of Aligned Residues
 Horse and Minke whale:
 Minke whale and Red kangoroo:
 Horse and Red kangoroo:
95
82
75
 Conclusion: Horse and whale share the most
identical resiues
By Michael Schroeder, Biotec,
46
Example: Elephant and Mammoth
 Mitochondrial cytochrome b from
 Siberian woolly mammoth
(Mammuthus primigenius)
preserved in arctic perma frost
 African elephant (Loxodonta africana)
 Indian elephant (Elephans maximus)
By Michael Schroeder, Biotec,
47
Indian elephant: sp|P24958|CYB_LOXAF
Mammoth: sp|P92658|CYB_MAMPR
African elephant: sp|O47885|CYB_ELEMA
MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
*** ** ***:**:**********************************************
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
************************************************************
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180
**************************************:*********************
LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240
LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240
FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240
:********:***********************************************:**
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300
******************************************************:*****
LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360
LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360
LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360
**:*************************: *** **********:***************
IILAFLPIAGVIENYLIK 378
IILAFLPIAGMIENYLIK 378
IILAFLPIAGMIENYLIK 378
**********:*******
By Michael Schroeder, Biotec,
48
Example: Elephant and Mammoth
 Mammoth and African elephant have 10 mismatches,
 mammoth and Indian elephant 14.
 Significant?
By Michael Schroeder, Biotec,
49
Similarity and Homology
 Important difference:
 Similarity is the measurement of resemblance of
sequences
 Homology: common ancestor
 Similarity is gradual, homology is either true or false
 Similarity = now, homology = past events
 Homology is only very rarely directly observed (e.g. lab
population, clinical study of viral infection)
 Homology is inferred from sequence similarity
By Michael Schroeder, Biotec,
50
Example: Homology/Similarity
 The assertion that the cytocrome b sequences are
homologues means that there is a common ancestor
 BUT:
 1. Maybe cytochrome b functionally requires so many
conserved residues and will hence occur in many species ( In
fact, This is not the case here)
 2. Maybe cytochrome b has to function this way in elephant-like
species, but in fact started out from different ancestors (i.e.
convergent evolution)
 3. Maybe mammoth and African elephant have only fewer
mismatches, because Indian elephant’s DNA mutated faster
 4. Maybe all of them acquired cytochrome b through a virus
(horizontal gene transfer)
By Michael Schroeder, Biotec,
51
Example: Conclusion
 Classical methods confirm that for pancreatic
ribonuclease inferring homology from similarity is
justified
 But to answer whether Mammoth are closer to
African or Indian elephants is too close to call
 Problems with inferring phylogeny from gene and
protein sequences
 Wide range of variation (possibly below statistical
significance)
 Different rates of evolution for different branches of the
evolutionary tree
By Michael Schroeder, Biotec,
52
Inferring Phylogenies
with SINES and LINES
 Requirements:
 ‘all-or-none’ character
 Irreversible appearance
 Solution:
 SINES and LINES (Short and Long Interspersed
Nuclear Elements)
 Repetitive, non-coding sequences in eukaryotic
genomes
 >30% in human genome, >50% in some plants
 SINES = 70-500 base pairs long, up to 106 copies
 LINES up to 7000 base pairs, up to 105 copies
 They enter genome by reverse transcription of RNA
By Michael Schroeder, Biotec,
53
A practical example:
Fatherhood
 The picture shows a Southern
blot of DNA from different family
members, probed using a minisatellite.
 You can work out which of F1
and F2 is the father of child C,
by observing which bands they
have in common.
 (Reproduced from "Essential
Medical Genetics" by M.Connor
and M.Ferguson-Smith, with
permission from Blackwell
Science).
By Michael Schroeder, Biotec,
54
Why SINES are useful in phylogeny
 Either present or absent
 Inserted at random in non-coding portion of genome
 i.e. SINE has no important function so that convergent
evolution can be excluded
 Presence of a SINE in two species and absence in a third implies
that first two species are more closely related
 SINE insertion appears to be irreversible
 Temporal order
 Presence of a SINE in two species and absence in a third implies
that ancestor of first two species is younger than ancestor of all
three
By Michael Schroeder, Biotec,
55
Example revisited
 What is the closest land-based relative of the whales
 Classical palaeontology
 links Cetacea (whales, dolphins, porpoises) with Arteriodactyla
(including e.g. cattle)
 Belief that Cetaceans diverged before Arteriodactyla split into
suborder
 Suiformes (e.g. pigs),
 Tylopoda (e.g. camels, llamas),
 Ruminantia (e.g. deer, cattle, goats, sheep, antelopes, giraffe)
 Sequence comparison results
 Based on mitochondrial DNA, pancreatic ribonuclease, fibrinogen,
and others
 Closest relatives of whales are hippopotamuses (They share 4
SINES)
 These two are closest to Ruminantia
By Michael Schroeder, Biotec,
56
Searching for Similar
Sequences with PSI-Blast
 Any search method for
sequences should be
Sequence Database
 Sensitive: also pick up distant
relationships
 Selective: reported relationships
are true
 Example: database with (among
others) 1000 globin sequences
 Globin familiy (oxygen transport) of
proteins occurs in many species
 Proteins have same function and
structure and
positives:
 But there are pairs of membersTrue
of
the family sharing less than 10%
700 out of 900
identical residues
By Michael Schroeder, Biotec,
False negatives:
300 out of 1000
are not found
1000 Globin
Sequences
900 Search
results
are really globins
False positives:
200 out of 900
are not globins
57
Searching for Distant Relationships
with PSI-BLAST
 How can we find distant relationships without
increasing the false negatives?
 PSI-BLAST:
 Position Sensistive Iterated – Basic Linear Alignment
Sequence Tool
 Identifies patterns within the sequences
 Score via intermediaries may be better than score from
direct comparison
A
50%
B
50%
C
Only 10%
By Michael Schroeder, Biotec,
58
PSI-BLAST Example
 Human PAX-6 gene (SwissProt ID P26367) has
homologues in many different species
 PSI-Blast at NCBI site www.ncbi .nlm.nih.gov
By Michael Schroeder, Biotec,
59
Result
BLASTP 2.2.6 [Apr-09-2003]
RID: 1062117117-16602-2157828.BLASTQ3
Query= gi|6174889|sp|P26367|PAX6_HUMAN Paired box protein Pax-6
(Oculorhombin) (Aniridia, type II protein).
(422 letters)
Database: All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF
1,509,571 sequences; 486,132,453 total letters
Results of PSI-Blast iteration 1
Sequences with E-value BETTER than threshold
Sequences producing significant alignments:
Score
E
(bits) Value
gi|4505615|ref|NP_000271.1| paired box gene 6 isoform a; Paired box h...
gi|189353|gb|AAA59962.1| oculorhombin >gi|189354|gb|AAA59963.1| oculo...
gi|6981334|ref|NP_037133.1| paired box homeotic gene 6 [Rattus norveg...
gi|26389393|dbj|BAC25729.1| unnamed protein product [Mus musculus]
gi|7305369|ref|NP_038655.1| paired box gene 6; small eye; Dickie's sm...
gi|383296|prf||1902328A PAX6 gene
gi|4580424|ref|NP_001595.2| paired box gene 6 isoform b; Paired box h...
gi|18138028|emb|CAC80516.1| paired box protein [Mus musculus]
gi|2576237|dbj|BAA23004.1| PAX6 protein [Gallus gallus]
gi|27469846|gb|AAH41712.1| Similar to paired box gene 6 [Xenopus laevis]
…
By Michael Schroeder, Biotec,
781
780
778
776
776
775
775
773
770
768
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
60
Introduction to Protein Structure
 Proteins play a variety of roles:
 Structural (viral coat proteins, horny outer layer of
human and animal skin, cytoskeleton)
 Catalysis of chemical reactions (enzymes)
 Transport and Storage (e.g. haemoglobin)
 Regulation (e.g. hormones)
 Receptor and signal transduction
 Genetic transcription
 Recognition (cell adhesion molecules)
 Antibodies and other proteins of the immune system
By Michael Schroeder, Biotec,
61
Proteins
 Are large molecules
 Only small part – the active site – is functional
 Evolve by structural changes produced by mutations
in the amino acid sequence
 Ca. 40000 proteins structures are now known
 Can be obtained by X-ray crystallography or nuclear
magnetic resonance (NMR)
By Michael Schroeder, Biotec,
62
Structure of Proteins
 Backbone and sidechain
 Residue i-1, Residue i, Residue i+1,
Si-1
Si
Si+1
|
|
|
…N-Cα-C-N-Cα-C-N-Cα-C-…
||
||
||
O
O
O
Sidechain (variable)
Mainchain (constant)
 Polypeptide chain folds into a curve in space
 Common structural feature
 Alpha-helix
 Beta-sheet
By Michael Schroeder, Biotec,
63
Hierarchy of Architecture
 Primary structure: Amino acid sequence
 Secondary structure: Helices, sheets, loops,
hydrogen-bonding pattern of main chain
 Tertiary structure: Assembly and interactions of
helices, sheets, etc.
 Quaternary structure: Assembly of monomers
 Evolution can merge proteins
 Five enzymes in E. coli that catalyze successive steps
in biosynthesis of aromatic amino acids correspond to
one protein in Aspergillus nidulans
 Globins form tetramers in mammalian haemoglobin
and dimers in ark clam Scaoharca inaequivalvis
By Michael Schroeder, Biotec,
64
Protein Structure
Triosephosphate isomerase from Bacillus stearothermophilus
Highly efficient enzyme appearing in most species
By Michael Schroeder, Biotec,
65
Hierarchy of
Architecture:
supersecondary
structure
 Alpha-helix hairpin
 Beta hairpin
 Beta-alpha-beta unit
By Michael Schroeder, Biotec,
66
Hierarchy of Architecture
 Supersecondary structures:
 Alpha-helix hairpin
 Beta hairpin
 Beta-alpha-beta unit
 Domains:
 Compact unit, single chain, independent stability
 Modular proteins:
 Multi-domain
 Copies of related domains or “mix-and-match”
By Michael Schroeder, Biotec,
67
Classification of Protein Structure
 All Alpha: mostly alpha helices
 All Beta: mostly beta sheets
 Alpha+Beta: Helices and sheets in different parts of
the molecule, no beta-alpha-beta units
 Alpha/Beta: Helices and sheets assembled from
beta-alpha-beta units
 Alpha/Beta linear
 Alpha/Beta barrel
 Little or no secondary structure
By Michael Schroeder, Biotec,
68
SCOP: Structural Classification of Proteins
top
CLASS
All alpha (218)
All Beta (144)
Alpha+Beta (279)
Alpha/Beta (136)
FOLD
Trypsin-like serine proteases (1)
Immunoglobulin-like (23)
SUPERFAMILY
=evolutionary related, similar structure,
not necessarily similar sequence
Transglutaminase (1)
Immunoglobulin (6)
FAMILY = set of domains with similar sequence
C1 set domains
(antibody constant)
By Michael Schroeder, Biotec,
V set domains
(antibody variable)
69
Pymol
By Michael Schroeder, Biotec,
70
Engrailed homeodomain (1enh)
Transcription factor important in developend
Used to study protein folding
Utrophin calmodulin homology
domain (1bhd)
Actin binding
Closely relatd to dystrophin,
whose lack causes muscular
dystrophies (weak muscles)
Cytochrome c, rice (1ccr)
Electron transport across
mitochondrial membrane
By Michael Schroeder, Biotec,
DNA-binding domain of HIN
recombinase (1hcr)
71
Fibronectin III domain (1fna)
Found on cell surface
Mannose-binding protein (1npl)
Barnase (1brn)
Cleaves RNA and is lethal if
intracellular and not inhibited by
barstar
By Michael Schroeder, Biotec,
TATA-box-binding protein (1cdw)
72
OB-domain from Lys-tRNA
synthetase (1bbw)
Scytalone dehydratase (3std)
Alcohol dehydrogenase, NADbinding domain (1ee2)
Break down of alcohol into
simpler compounds
By Michael Schroeder, Biotec,
Adenylate kinase (3adk)
Energy production
73
Chemotaxis receptor
methyltransferase (1af7)
Thiamine phosphate
synthase (2tps)
Pancreatic spasmolytic
polypeptide (2psp)
By Michael Schroeder, Biotec,
74
Protein Structure Prediction and
Engineering
 If sequence of amino acids contains enough information to
specify three-dimensional structure of proteins, it should be
possible to devise algorithm for prediction
 Secondary structure prediction: Which segments of the
sequence are helices, which strands?
 Fold recognition: Given
 library of known structures with their sequences and
 a sequence with unknown structure,
 can we find the structure that is most similar
 Homology modelling
 Given two homologous sequences, one with one without
structure. If more than 50% of the residues are identical the
structure can serve as a model
By Michael Schroeder, Biotec,
75
Critical Asessment of Structure
Prediction (CASP)
Chicken lysozyme
Baboon alpha-lactalbumin
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGS
KQFTKCELSQNLY--DIDGYGRIALPELICTMFHTSGYDTQAIVEND-ES
Chicken lysozyme
Baboon alpha-lactalbumin
TDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVS
TEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILD
Chicken lysozyme
Baboon alpha-lactalbumin
DGN-GMNAWVAWRNRCKGTDVQA-WIRGCRLI--KGIDYWIAHKALC-TEKL-EQWL--CE-K
By Michael Schroeder, Biotec,
76
Clinical Implications
 Fast and reliable diagnosis of disease and risk:
 With symptoms
 In advance of appearance (e.g. Huntington)
 In utero (e.g. cystic fibrosis: mutation in cystic fibrosis transmembrane
conductance regulator (CFTR), which is a chloride ion channel
 Genetic counselling
 Customized treatment
 E.g. childhood leukaemia is treated with toxic drug 6-mercaptopurine.
Small fraction of patients used to die as they lack enzyme thiopurine
methyltransferase.
 Identify drug targets
 ½ are receptors, ¼ are enzymes, ¼ are hormones
 7% have unknown targets
 Gene therapy
 Replace defective genes or supply gene products (insulin for diabetes
and Blood Factor VIII for haemophilia)
 However: Most diseases do not have a single genetic cause!
By Michael Schroeder, Biotec,
77
Quick check
 By now you should








Have read chapter 1
Know the main data sources (sequence and structure)
Know the role that bioinformatics plays
Understand the difference between homology and similarity
Understand what sequence comparison and alignment are
Understand how they can be useful for phylogenetic studies
Understand primary, secondary, tertiary structure
Be able to assess the assumptions made and the quality of data
By Michael Schroeder, Biotec,
78