Sequence - BIOTEC - Biotechnology Center TU Dresden
Download
Report
Transcript Sequence - BIOTEC - Biotechnology Center TU Dresden
Introduction
based on
Chapter 1
Lesk, Introduction to Bioinformatics
Michael Schroeder
BioTechnological Center
TU Dresden
Biotec
Contents
n
n
n
n
n
n
Molecular biology primer
The role of computer science
Phylogeny
Sequence Searching
Protein structure
Clinical implications
n Read chapter 1
By Michael Schroeder, Biotec,
2
23 June 2000: Draft of Human
genome sequenced!
n 1953: Watson and Crick discover the structure of DNA
n 2000: Draft of human genome is published
n “The most wondrous map ever produced by human kind”
n “One of the most significant scientific landmarks of all
time, comparable with the invention of the wheel or the
splitting of the atom”
By Michael Schroeder, Biotec
3
High-throughput biomedicine
n Microarrays
n Measure activity of thousands of genes at the same time
n Example:
n Cancer
n Compare activity with and without drug treatment
n Result: Hundreds of candidate drug targets
n RNAi (Noble prize 2004, Fire and Mello)
n Knock-down genes and observe effect
n Example:
n Infectious diseases
n Which proteins orchestrate entry into cell?
n Result: Hundreds of candidate proteins
n Atomic force microscopes (Noble prize Binnig)
n Pull protein out of membrane and measure force
n Example:
n Eye diseases resulting fomr misfolding
n Result: Hundreds of candidate residues
By Michael Schroeder, Biotec
4
Drug Discovery
n Challenge: Longer time to market, fewer drugs,
exploding costs
n Approach: Use of compound libraries and highthroughput screening
By Michael Schroeder, Biotec,
5
HTS and Bioinformatics
n High-throughput technologies have completely
changed the work of biomedical researchers
n Challenge: Interpret (often large) results of screens
n Approach: Before running secondary assays use
bioinformatics and IT to assemble all possible
information
By Michael Schroeder, Biotec
6
Good News
Number of PubMed Abstracts
14,000,000
>1.000.000
Sequences
>16.000.000
Articles
12,000,000
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
0
1960
1970
1980
1990
2000
Year
Molecular Biology Database List at Nucleic Acids Research
>30.000
3D Structures
number of data sources
800
700
600
500
400
>700
DBs/Tools
300
200
100
0
2000
By Michael Schroeder, Biotec
2001
2002
2003
year
2004
2005
7
2010
Bad News: Data != Knowledge
n How to analyse data, how to integrate data?
n Comptuer science to the rescue…
By Michael Schroeder, Biotec
8
Examlpe: computer science
is key for sequencing
n Human genome is a string of length 3.200.000.000
n Shotgun sequencing: Break multiple copies of string
into shorter substrings
n Example:
n shotgunsequencing shotgunsequencing
shotgunsequencing
n cing en encing equ gun ing ns otgu seq
sequ sh sho shot tg uenc un
n Computing problem: Assemble strings
By Michael Schroeder, Biotec
9
Computer science key
for sequencing
n sh
n sho
n shot
n
otgu
n
tg
n
gun
n
un
n
ns
n
seq
n
sequ
n
equ
n
uenc
n
encing
n
en
n
cing
n
ing
By Michael Schroeder, Biotec
QUESTION: How can you handle
long repetitive sequences?
Heeeeelllllllllllooooooo
QUESTION: Why was a draft
announced? When was the final
version ready?
10
Yersinia
pestis
Arabidopsis
thaliana
Buchnerasp.
APS
Caenorhabitis Campylobacter
elegans
jejuni
Helicobacter
pylori
rat
Chlamydia
pneumoniae
Mycobacterium
leprae
Rickettsia
prowazekii
mouse
Aquifex
aeolicus
Vibrio
cholerae
Drosophila
melanogaster
Neisseria
meningitidis
Z2491
Plasmodium
falciparum
Saccharomyces Salmonella
cerevisiae
enterica
By Michael Schroeder, Biotec
Archaeoglobus Borrelia
fulgidus
burgorferi
Bacillus
subtilis
Mycobacterium
tuberculosis
Escherichia Thermoplasma
acidophilum
coli
Pseudomonas Ureaplasma
aeruginosa urealyticum
Thermotoga
maritima
Xylella
fastidiosa
11
Break through of the year 2000
Next quest:
Sequencing a genome for 1000$
By Michael Schroeder, Biotec
12
Quantity and quality of data lead to
ambitious goals
n Understand integrative aspects of the biology of
organisms
n Interrelate sequence, three-dimensional structure,
interactions, function of proteins, nucleic acids and
protein-nucleic acid complexes
n Travel in time
n backward (deduce events in evolutionary history) and
n forward (deliberate modification of biological systems)
n Applications in medicine, agriculture, and other
scientific fields
By Michael Schroeder, Biotec
13
Scenario
n
n
n
n
n
New virus (e.g. SARS) and goal to develop treatment
Scientists isolate genetic material of virus
Screen genome for relationships with previously studied viruses [10]
From virus’ DNA they compute the proteins it produces [1]
Compute proteins’ three-dimensional structure and thereby obtain
clues about their functions
n Screen for similar proteins sequences with known structure [15]
n If any are found
n Then interpret difference (homology modelling) [25]
n Else predict structure from sequence [55]
n Identify or design small molecule blocking relevant active sites of the
protein [50]
n Design antibodies to neutralize the virus [50]
n Index of problem difficulty:
n <30: solution exists already,
n >30: we cannot solve this (yet)
By Michael Schroeder, Biotec
14
Life in Time and Space
n Life
n A biological organism is a naturally-occurring, self-reproducing
device that effects controlled manipulations of matter, energy and
information
n Time
n Species evolve through
n natural mutation,
n recombination of genes in sexual reproduction, or
n direct gene transfer
n Read the past in contemporary genomes
n Space
n
n
n
n
Species occupy local ecosystems
Species are composed of organisms
Organisms are composed of cells
Cells are composed of molecules
By Michael Schroeder, Biotec
15
DNA – the molecule of life
By Michael Schroeder, Biotec,
http://www.ornl.gov/hgmis
16
Proteins
n 20 naturally occurring amino acids in proteins
n Non-polar
n G glycine, A alanine, P proline, V valine
n I isoleucine, L leucine, F phenylalanine, M methionine
n Polar
n S serine, C cysteine, T threonine, N asparagine
n Q glutamine, H histidine, Y tyrosine, W tryptophan
n Charged
n D aspartic acid, E glutamic acid, K lysine, R arginine
n Other classification
n H,F,Y,W are aromatic and play role in membrane proteins
n Distinguish
n atg = adenine-thymine-guanine and
n ATG = Alanine-Threonine-Glycine
By Michael Schroeder, Biotec,
17
The genetic code
First
Position
(5Õend)
T
C
A
G
T
TTT
TTC
TTA
TTG
CTT
CTC
CTA
CTG
ATT
ATC
ATA
ATG
GTT
GTC
GTA
GTG
Phe
Phe
Leu
Leu
Leu
Leu
Leu
Leu
Ile
Ile
Ile
Met*
Val
Val
Val
Val
By Michael Schroeder, Biotec,
C
TCT
TCC
TCA
TCG
CCT
CCC
CCA
CCG
ACT
ACC
ACA
ACG
GCC
GCC
GCA
GCG
Second
Position
A
TAT
Ser
TAC
Ser
TAA
Ser
TAG
Ser
CAT
Pro
CAC
Pro
CAA
Pro
CAG
Pro
AAT
Thr
AAC
Thr
AAA
Thr
AAG
Thr
GAT
Ala
GAC
Ala
GAA
Ala
GAG
Ala
Tyr
Tyr
Stop
Stop
His
His
Gln
Gln
Asn
Asn
Lys
Lys
Asp
Asp
Glu
Glu
G
TGT
TGC
TGA
TGG
CGT
CGC
CGA
CGG
AGT
AGC
AGA
AGG
GGT
GGC
GGA
GGG
Cys
Cys
Stop
Trp
Arg
Arg
Arg
Arg
Ser
Ser
Arg
Arg
Gly
Gly
Gly
Gly
Third
Position
(3Õend)
T
C
A
G
T
C
A
G
T
C
A
G
T
C
A
G
18
Protein Structure
n DNA:
n Nucleotides are very similar
and hence the structure of
DNA is very uniform
n Proteins:
n Great variety in threedimensional conformation to
support diverse structure
and functions
n If heated, protein “unfolds” to
biologically-inactive
structure; in normal
conditions protein folds
By Michael Schroeder, Biotec
19
Paradox
n Translation from DNA sequence to amino acid
sequence
n is very simple to describe,
n but requires immensely complicated machinery
(ribosome, tRNA)
n The folding of the protein sequence into its threedimensional structure
n is very difficult to describe
n But occurs spontaneously
By Michael Schroeder, Biotec
20
Central Dogma
n DNA sequence determines protein sequence
n Protein sequence determines protein structure
n Protein structure determines protein function
By Michael Schroeder, Biotec
21
Observables and Data Archives
n Databases in molecular biology cover
n Nucleic acid and protein sequences,
n Macromolecular structures and functions
n Archival databanks of biological information
n DNA and protein sequences including annotations
n Nucleic acid and protein structures including annotations
n Protein expression patterns
n Derived Databases
n Sequence motifs (“signatures” of protein families)
n Mutations and variants in DNA and protein sequences
n Classification or relationships (e.g. hierarchy of structures)
n Bibliographic databases (PubMed with 17M abstracts)
n Collections
n of links to web sites
n of databases
By Michael Schroeder, Biotec
22
What is Bioinformatics
n Bioinformatics is the marriage of biology and
information technology
n Bioinformatics is an integrated multidisciplinary
field
n Covers computational tools and methods for
managing, analysing and manipulating sets
of biological data
n Disciplines include:
n biochemistry, genetics, structural biology, artificial
intelligence, machine learning, software
engineering, statistics, database theory,
information visualisation, algorithm design
By Michael Schroeder, Biotec,
23
Bioinformatics
n Has three components
n Creation of databases
n Development of algorithms to analyse data
n Use of these tools for analysing biological data
By Michael Schroeder, Biotec,
24
Databases: Types of Queries 1/2
n 1. Given a sequence (fragment), find sequences in
the database that are similar to it
n 2. Given a protein structure (or fragment), find
protein structures in the database that are similar to it
n 3. Given sequence of a protein of unknown structure,
find structures in the database that adopt similar threedimensional structures
n 4. Given a protein structure, find sequences in the
database that correspond to similar structures.
By Michael Schroeder, Biotec,
25
Databases: Given sequence, find structure
n 3. Given sequence of a protein of unknown structure, find
structures in the database that adopt similar three-dimensional
structures.
But How?
n Easy: Find similar sequences with known structure!
n But: There might be similar structures, whose sequence is not
similar!
n 4. Given a protein structure, find sequences in the database
that correspond to similar structures.
But How?
n Easy: Find similar structures and hence sequences
n But: There are so many more sequences with unknown structure
that the above method will have only very limited success
n 1 and 2 are solved, 3 and 4 are active fields of research
By Michael Schroeder, Biotec,
26
Databases: Types of Queries 2/2
n E.g. for which proteins of known structure involved in
disease of disrupted purine biosynthesis in humans,
are there related proteins in yeast?
n Solution: Virtual databases that provide transparent
access to a number of underlying data sources and
query and analysis tools
By Michael Schroeder, Biotec,
27
Databases: Curation and Quality
n Problems:
n Given that there are primary and secondary
databases,
n how to control updates,
n how to propagate change,
n how to maintain consistency?
n Contents (experimental results, annotations,
supplementary information) all have there own source
of error
n Older data were limited by older techniques
By Michael Schroeder, Biotec,
28
Databases: Annotation
n Experimental data (e.g. raw DNA sequence) needs to be
enriched with annotations
n
n
n
n
Source of data
Investigators responsible
Relevant publication
Feature tables (e.g. coding regions)
n Problems:
n (often) lack of controlled and coherent vocabulary
n Computer parseable
n Automated annotation needed
n SwissProt = ca. 540.000 annotated sequences
n TrEMBL = ca. 40 Mio unannotated sequences
n Maintanence of annotations (what if error detected?)
By Michael Schroeder, Biotec,
29
Computers and Computer Science
n Relevant areas:
n Artificial Intelligence
n Machine Learning
n Neural networks, rulebased learning
n Datamining
n Association rules
n Software Engineering
n Design, implementation,
testing of software
n Programming
n Object-oriented C++,
Java
n Imperative: C, Modula,
Pascal, Cobol, Fortran
n Logic: Prolog
n Funtional: ML
n Scripting: Perl, Python
By Michael Schroeder, Biotec,
n Statistics
n Database theory
n Design and maintenance of
databases
n How to index sequences,
time series, 3D strucutres
n Information Visualisation
n Graph drawing, diagrams,
cartoons, 3D graphics
n Algorithm design
n Complexity of algorithms
n Efficient data structures
30
Programming
n We will use Python
n Scripting language
n Supports string processing well
n Widely used in bioinformatics
By Michael Schroeder, Biotec,
31
Biological Classification and
Nomenclature
n Back in 18th century, Linnaeus, a Swedish naturalist,
classified living things according to a hierarchy:
Kingdom, Phylum, Class, Order, Family, Genus,
Species
n Generally only genus and species are used for
identification
n Homo sapiens
n Drosophila melanogastor
n Bos taurus
n Linnaeus’ classification based on observed
similarity
n Widely reflects biological ancestry
By Michael Schroeder, Biotec,
32
Classification of Humans and Fruit Flies
n
n
n
n
n
n
n
Kingdom:
Phylum:
Class:
Order:
Family:
Genus:
Species:
By Michael Schroeder, Biotec,
Animalia
Chordata
Mammalia
Primata
Hominidae
Homo
sapiens
Animalia
Chordata
Insecta
Diptera
Drosophilidae
Drosophila
melanogastor
33
Homology = derived from common ancestor
n Characteristics derived from a common ancestor
are called homologous
n E.g. eagle’s wing and human’s arm
n Other apparently similar characteristics may have
arisen independently by convergent evolution
n E.g. eagle’s wing and bee’s wing. The most common
ancestor of eagles and bees did not have wings
n Homologous characters may diverge functionally
n E.g. bones in human middle and jaws of primitive fish
By Michael Schroeder, Biotec,
34
Sequence analysis and Homology
n Sequence analysis gives unambiguous evidence
for relationship of species
n For higher organisms sequence analysis and the
classical tools of comparative anatomy,
palaeontology, and embryology are often consistent
n For microorganisms there are problems
n Classical methods: how to describe features
n Sequence analysis: lateral gene transfer
By Michael Schroeder, Biotec,
35
Domains of Life
n Ribosomal RNA is present in all organisms
n Based on 15S ribosomal RNAs life is divided
n Bacteria
n No nucleus (procaryote)
n E.g. tuberculosis and E. coli
n Archaea
n No nucleus (procaryote)
n few organisms living in hostile environments (termophiles, halophiles,
sulphur reducers, methanogens)
n Eukarya
n Has a nucleus contained in membrane
n Nucleus contains chromosomes
n Internal compartments called organelles for specialised biological
processes
n Area outside nucleus and organelles called cytoplasm
n E.g. yeast and human beings
By Michael Schroeder, Biotec,
36
Eukaryotic cell
By Michael Schroeder, Biotec,
37
Domains of Life
By Michael Schroeder, Biotec,
38
Example: Use of sequences to
determine phylogenetic relationships
Use ExPASy (www.expasy.ch) to search for
pancreatic ribonuclease for
horse (Equus caballus),
minke whale (Balaenoptera acutorostrata),
red kangaroo (Macropus rufus)
>sp|P00674|RNP_HORSE Ribonuclease pancreatic
(EC 3.1.27.5) (RNase 1) (RNase A) - Equus
caballus (Horse).
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTF
VHEPLADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKY
PNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST
Use sequence alignment to determine evolutionary relationship
By Michael Schroeder, Biotec,
39
Sequence alignment
1. Global match: align all of one with all of the other
sequence (mismatches, insertions, deletions)
And.--so,.from.hour.to.hour.we.ripe.and.ripe
||||
||||||||||||||||||||||||
||||||
And.then,.from.hour.to.hour.we.rot-.and.rot-
2. Local match: find region in one sequence that
matches the other (mismatches, insertions,
deletions ; ends can be ignored)
My.care.is.loss.of.care,.by.old.care.done,
|||||||||
|||||||||||||
|||||| ||
Your.care.is.gain.of.care,.by.new.care.won
By Michael Schroeder, Biotec,
40
Sequence alignment
3. Motif search:
find matches of short sequence in long sequence
Option:
perfect,
1 mismatch,
mismatches+gaps+insertions+deletions
match
||||
for the watch to babble and to talk is most tolerable
By Michael Schroeder, Biotec,
41
Sequence alignment
4. Multiple sequence alignment
No.sooner.---met.--------.but.they.look’d
No.sooner.look’d.--------.but.they.lo-v’d
No.sooner.lo-v’d.--------.but.they.sigh’d
No.sooner.sigh’d.--------.but.they.--asked.one.another.the.reason
No.sooner.knew.the.reason.but.they.-------------sought.the.remedy
No.sooner.
.but.they.
By Michael Schroeder, Biotec,
42
Example: Multiple alignment
Use sequence alignment to determine evolutionary
relationship…
Example: horse, whale and kangaroo
Expected: horse and whale are placental mammals,
kangaroo is marsupial
Multiple alignment with CLUSTAL-W
(http://www.genome.jp/tools/clustalw)
multiple sequence alignment computer program
main parameters: gap opening/extension penalty
By Michael Schroeder, Biotec,
43
FASTA format
>sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5)
(RNase 1) (RNase A) - Equus caballus (Horse).
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ
KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF
DASVEVST
>sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC 3.1.27.5)
(RNase 1) (RNase A) - Balaenoptera acutorostrata (Minke
whale) (Lesser rorqual).
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ
KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF
DNSV
>sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC 3.1.27.5)
(RNase 1) (RNase A) - Macropus rufus (Red kangaroo)
(Megaleia rufa).
ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQE
NVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDA
YV
By Michael Schroeder, Biotec,
44
Multiple Alignment with ClustalW
(http://www.genome.jp/tools/clustalw)
CLUSTAL W (1.82) multiple sequence alignmen
sp|P00674|RNP_HORSE
sp|P00673|RNP_BALAC
sp|P00686|RNP_MACRU
KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60
RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60
-ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59
*:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* *
KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120
KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120
ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118
:*: ****::***:*.* : **:** *..****** *:**: :::******* ******
DASVEVST 128
DNSV---- 124
DAYV---- 122
* *
By Michael Schroeder, Biotec,
45
Example: Number of Aligned Residues
Horse and Minke whale:
Minke whale and Red kangoroo:
Horse and Red kangoroo:
95
82
75
Conclusion: Horse and whale share the most
identical residues
By Michael Schroeder, Biotec,
46
New Example:
Elephant and Mammoth
Mitochondrial cytochrome b from
Siberian woolly mammoth (Mammuthus
primigenius) preserved in arctic permafrost
African elephant (Loxodonta africana)
Indian elephant (Elephans maximus)
Q: To which one is the Mammuth
more closely related?
By Michael Schroeder, Biotec,
47
Indian elephant: sp|P24958|CYB_LOXAF
Mammoth: sp|P92658|CYB_MAMPR
African elephant: sp|O47885|CYB_ELEMA
MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60
*** ** ***:**:**********************************************
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120
************************************************************
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180
LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180
**************************************:*********************
LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240
LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240
FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240
:********:***********************************************:**
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300
LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300
******************************************************:*****
LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360
LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360
LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360
**:*************************: *** **********:***************
IILAFLPIAGVIENYLIK 378
IILAFLPIAGMIENYLIK 378
IILAFLPIAGMIENYLIK 378
**********:*******
By Michael Schroeder, Biotec,
48
Example: Elephant and Mammoth
Mammoth and African elephant have 10 mismatches,
Mammoth and Indian elephant 14.
Significant?
Q1: can we tell from these sequences alone that they are
closely related?
Q2: differences are small – do they come from selection,
random noise or drift
Strategies needed difference judging of similiarities
By Michael Schroeder, Biotec,
49
Excursion: Similarity
and Homology
Important difference:
Similarity is the measurement of resemblance of
sequences
Homology: common ancestor
Similarity is gradual, homology is either true or false
Similarity = now, homology = past events
Homology is only very rarely directly observed (e.g. lab
population, clinical study of viral infection)
Homology is inferred from sequence similarity
By Michael Schroeder, Biotec,
50
Example: Homology/Similarity
The assertion that the cytochrome b sequences are
homologues means that there is a common ancestor
BUT:
1. Maybe cytochrome b functionally requires so many
conserved residues and will hence occur in many species ( In
fact, This is not the case here)
2. Maybe cytochrome b has to function this way in elephant-like
species, but in fact started out from different ancestors (i.e.
convergent evolution)Mammoth are homolgues – are also
ribonuclease sequences homologues? Difference is much bigger
3. Maybe mammoth and african elephant have only fewer
mismatches, because Indian elephant’s DNA mutated faster
4. Maybe all of them acquired cytochrome b through a virus
(horizontal gene transfer)
By Michael Schroeder, Biotec,
51
Examples: Conclusion
Classical methods confirm that for pancreatic
ribonuclease (Horse – whale - kangoroo) inferring
homology from similarity is justified
But to answer whether Mammoth are closer to African or
Indian elephants is too close to call (non-significant)
Problems with inferring phylogeny from gene and protein
sequence comparison
Wide range of variation (possibly below statistical
significance)
Different rates of evolution for different branches of the
evolutionary tree
Even if relationship - which sequence came first?
By Michael Schroeder, Biotec,
52
Inferring Phylogenies
with SINES and LINES
Pylogeneticist’s dream of features:
‘all-or-none’ character
Irreversible appearance
Solution:
SINES and LINES (Short and Long Interspersed
Nuclear Elements)
Repetitive, non-coding sequences in eukaryotic
genomes
>30% in human genome, >50% in some plants
SINES = 70-500 base pairs long, up to 106 copies
LINES up to 7000 base pairs, up to 105 copies
They enter genome by reverse transcription of RNA
By Michael Schroeder, Biotec,
53
A practical example:
Fatherhood
The picture shows a Southern
blot of DNA from different
family members, probed using
a mini-satellite.
You can work out which of F1
and F2 is the father of child C,
by observing which bands they
have in common.
(Reproduced from "Essential Medical Genetics" by M.Connor and
M.Ferguson-Smith, with permission from Blackwell Science.)
By Michael Schroeder, Biotec,
54
Why SINES are useful in phylogeny
Either present or absent
Inserted at random in non-coding portion of genome
i.e. SINE has no important function so that convergent
evolution can be excluded
Presence of a SINE in two species and absence in a third implies
that first two species are more closely related
SINE insertion appears to be irreversible
Temporal order
Presence of a SINE in two species and absence in a third implies
that ancestor of first two species is younger than ancestor of all
three
By Michael Schroeder, Biotec,
55
Example revisited
Q: What is the closest land-based relative of the whales?
Classical palaeontology
links Cetacea (whales, dolphins, porpoises) with Artiodactyla
(including e.g. cattle)
Belief that Cetaceans diverged before Artiodactyla split into
suborder of
Suiformes (e.g. pigs),
Tylopoda (e.g. camels, llamas),
Ruminantia (e.g. deer, cattle, goats, sheep, antelopes, giraffe)
By Michael Schroeder, Biotec,
56
Example revisited
Sequence comparison results
Based on mitochondrial DNA, pancreatic ribonuclease, fibrinogen,
and others
Closest relatives of whales are hippopotamuses (share 4
SINES)
These two are closest to Ruminantia
By Michael Schroeder, Biotec,
57
Searching for Similar
Sequences with PSI-Blast
Any search method for sequences
should be
Sensitive: pick up distant
relationships
Selective: reported relationships
are true
False negatives:
300 out of 1000
are not found
Sequence Database
1000 Globin
Sequences
Example: database with (among
others) 1000 globin sequences
Globin familiy (oxygen transport) of
proteins occurs in many species
Proteins have same function and
structure
But there are pairs of members of the
family sharing less than 10% identical
residues
By Michael Schroeder, Biotec,
900 Search
results
True positives:
700 out of 900
are really globins
False positives:
200 out of 900
are not globins
58
Searching for Distant Relationships
with PSI-BLAST
How can we find distant relationships without
increasing the false negatives?
PSI-BLAST:
Position Sensitive Iterated – Basic Linear Alignment
Sequence Tool
Identifies conserved patterns within the sequences
Improves Sens and Spec
Score via intermediaries may be better than score from
direct comparison
A
50%
B
50%
C
Only 10%
By Michael Schroeder, Biotec,
59
PSI-BLAST Example
Human PAX-6 gene (SwissProt ID P26367) has
homologues in many different species (human, Drosophila, etc.)
TF for eye development
Mutations in:
Human: no or deformed iris
Drosophila: no eyes, expressed in wing or leg ectopic eyes
PSI-Blast at NCBI site (www.ncbi.nlm.nih.gov)
By Michael Schroeder, Biotec,
60
Result
By Michael Schroeder, Biotec,
61
Result
• Description of sequence
• Max score – linked to data that show where sequences match
• Total score - includes scores from non-contiguous portions of the subject
sequence that match the query
• Query coverage
• Identity - % of a sequence with the highest percentage of identical bases
• E-Value
• Accession number – linked to Gene bank record
By Michael Schroeder, Biotec,
62
Result
BLASTP 2.2.28+
RID: 6D2U321501N
Database: All non-redundant GenBank CDS
translations+PDB+SwissProt+PIR+PRF excluding environmental samples
from WGS projects
33,121,465 sequences; 11,555,699,950 total letters
Query= gi|6174889|sp|P26367.2|PAX6_HUMAN RecName: Full=Paired box protein
Pax-6; AltName: Full=Aniridia type II protein; AltName:
Full=Oculorhombin
Length=422
Sequences producing significant alignments:
ref|NP_000271.1| paired box protein Pax-6 isoform a [Homo sap...
ref|XP_004264012.1| PREDICTED: paired box protein Pax-6 isofo...
ref|XP_003910122.1| PREDICTED: paired box protein Pax-6 isofo...
ref|XP_004683008.1| PREDICTED: paired box protein Pax-6 isofo...
ref|XP_005064880.1| PREDICTED: paired box protein Pax-6 isofo...
ref|NP_001035735.1| paired box protein Pax-6 [Bos taurus] >re...
gb|AAA59962.1| oculorhombin [Homo sapiens]
ref|NP_037133.1| paired box protein Pax-6 [Rattus norvegicus]...
gb|EAW68233.1| paired box gene 6 (aniridia, keratitis), isofo...
...
By Michael Schroeder, Biotec,
Score
(Bits)
870
869
869
869
868
868
868
868
869
E
Value
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
63
Introduction to Protein Structure
Proteins play a variety of roles:
Structural (viral coat proteins, horny outer layer of
human and animal skin, cytoskeleton)
Catalysis of chemical reactions (enzymes)
Transport and Storage (e.g. haemoglobin)
Regulation (e.g. hormones)
Receptor and signal transduction
Genetic transcription
Recognition (cell adhesion molecules)
Antibodies and other proteins of the immune system
By Michael Schroeder, Biotec,
64
Proteins
Are large molecules
Only small part – the active site – is functional
Evolve by structural changes produced by mutations
in the amino acid sequence
Ca. 21.000 human proteins structures are now known
Overall 90.000 protein structures in PDB
Can be obtained by X-ray crystallography or nuclear
magnetic resonance (NMR)
By Michael Schroeder, Biotec,
65
Structure of Proteins
Backbone and side chain
Residue i-1, Residue i, Residue i+1,
Si-1
Si
Si+1
|
|
|
…N-Cα-C-N-Cα-C-N-Cα-C-…
||
||
||
O
O
O
Side chain (variable)
Main chain (constant)
Polypeptide chain folds into a curve in space
Common structural feature
Alpha-helix
Beta-sheet
Turns and Loops
By Michael Schroeder, Biotec,
66
Hierarchy of Architecture
Primary structure: Amino acid sequence
Secondary structure: Helices, sheets, loops,
hydrogen-bonding pattern of main chain
Tertiary structure: Assembly and interactions of
helices, sheets, etc.
Quaternary structure: Assembly of monomers
Evolution can merge proteins
E.g.: 5 enzymes in E. coli = 1 protein in fungi Aspergillus
nidulans
catalyze successive steps in biosynthesis of aromatic
amino acids
E.g.: Globins form tetramers in mammalian haemoglobin
and dimers in ark clam Scaoharca inaequivalvis
By Michael Schroeder, Biotec,
67
Protein Structure
DHAP to GAP in Glycolyse
Triosephosphate isomerase from Bacillus stearothermophilus
Highly efficient enzyme appearing in most species
By Michael Schroeder, Biotec,
68
Extra layer of Architecture:
supersecondary structure
Alpha-helix hairpin
Beta hairpin
Beta-alpha-beta unit
= Patterns of interaction
between helices and sheets
By Michael Schroeder, Biotec,
69
Hierarchy of Architecture
Supersecondary structures:
Alpha-helix hairpin
Beta hairpin
Beta-alpha-beta unit
Domains:
Compact unit, single chain, independent stability
Modular proteins:
Multi-domain
Copies of related domains or “mix-and-match”
By Michael Schroeder, Biotec,
70
Classification of Protein Structure
All Alpha: mostly alpha helices
All Beta: mostly beta sheets
Alpha+Beta: Helices and sheets in different parts of
the molecule, no beta-alpha-beta units
Alpha/Beta: Helices and sheets assembled from
beta-alpha-beta units
Alpha/Beta linear
Alpha/Beta barrel
Little or no secondary structure
By Michael Schroeder, Biotec,
71
SCOP: Structural Classification of Proteins
top
CLASS
All alpha (284)
All Beta (174)
Alpha+Beta (376)
Alpha/Beta (147)
FOLD
Trypsin-like serine proteases (1)
Immunoglobulin-like (23)
SUPERFAMILY
= evolutionary related, similar structure,
not necessarily similar sequence
Transglutaminase (1)
Immunoglobulin (6)
FAMILY
= set of domains with similar sequence
By Michael Schroeder, Biotec,
C1 set domains
(antibody constant)
V set domains
(antibody variable)
72
Pymol
By Michael Schroeder, Biotec,
73
Engrailed homeodomain (1enh)
Transcription factor important in development
Used to study protein folding
Utrophin calmodulin homology
domain (1bhd)
Actin binding
Closely relatd to dystrophin,
whose lack causes muscular
dystrophies (weak muscles)
Cytochrome c, rice (1ccr)
Electron transport across
mitochondrial membrane
By Michael Schroeder, Biotec,
DNA-binding domain of HIN
recombinase (1hcr)
74
Engrailed homeodomain (1enh)
By Michael Schroeder, Biotec,
75
Fibronectin III domain (1fna)
Found on cell surface
Mannose-binding protein (1npl)
Barnase (1brn)
Cleaves RNA and is lethal if
intracellular and not inhibited by
barstar
By Michael Schroeder, Biotec,
TATA-box-binding protein (1cdw)
76
OB-domain from Lys-tRNA
synthetase (1bbw)
Scytalone dehydratase (3std)
Alcohol dehydrogenase, NADbinding domain (1ee2)
Break down of alcohol into
simpler compounds
By Michael Schroeder, Biotec,
Adenylate kinase (3adk)
Energy production
77
Chemotaxis receptor
methyltransferase (1af7)
Thiamine phosphate
synthase (2tps)
Pancreatic spasmolytic
polypeptide (2psp)
By Michael Schroeder, Biotec,
78
Protein Structure Prediction and
Engineering
If sequence of amino acids contains enough information to
specify three-dimensional structure of proteins, it should be
possible to devise algorithm for prediction
Secondary structure prediction: Which segments of the
sequence are helices, which strands?
Fold recognition: Given
library of known structures with their sequences and
a sequence with unknown structure,
can we find the structure that is most similar
Homology modelling
Given two homologous sequences, one with one without
structure.
If between 30 and 50% of the residues are identical, the
structure can serve as a model
By Michael Schroeder, Biotec,
79
Critical Asessment of Structure
Prediction (CASP)
Chicken lysozyme
Baboon alpha-lactalbumin
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGS
KQFTKCELSQNLY--DIDGYGRIALPELICTMFHTSGYDTQAIVEND-ES
Chicken lysozyme
Baboon alpha-lactalbumin
TDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVS
TEYGLFQISNALWCKSSQSPQSRNICDITCDKFLDDDITDDIMCAKKILD
Chicken lysozyme
Baboon alpha-lactalbumin
DGN-GMNAWVAWRNRCKGTDVQA-WIRGCRLI--KGIDYWIAHKALC-TEKL-EQWL--CE-K
By Michael Schroeder, Biotec,
80
Clinical Implications of
Sequencing
Fast and reliable diagnosis of disease and risk:
Easy diagnosis (with symptoms)
In advance of appearance (e.g. Huntington)
In utero diagnosis (e.g. cystic fibrosis: thick secretions in lung)
Genetic counselling
Customized treatment (predict response to therapy/side effects)
E.g. childhood leukaemia is treated with toxic drug 6-mercaptopurine.
Small fraction of patients used to die as they lack enzyme thiopurine
methyltransferase.
Identify drug targets
Nowadays targets are: ½ receptors, ¼ enzymes, ¼ hormones
7% have unknown targets
Gene therapy
Replace defective genes or supply gene products (insulin for diabetes
and Blood Factor VIII for haemophilia)
However: Most diseases do not have a single genetic cause!
By Michael Schroeder, Biotec,
81
Quick check
By now you should
Have read chapter 1
Know the main data sources (sequence and structure)
Know the role that bioinformatics plays
Understand the difference between homology and similarity
Understand what sequence comparison and alignment are
Understand how they can be useful for phylogenetic studies
Understand primary, secondary, tertiary structure
Be able to assess the assumptions made and the quality of data
By Michael Schroeder, Biotec,
82