Transcript PPT
bioinformatics
c o u r s e
l a y o u t
introduction
molecular biology
biotechnology
bioMEMS
bioinformatics
bio-modeling
cells and e-cells
transcription and regulation
cell communication
neural networks
dna computing
fractals and patterns
the birds and the bees ….. and ants
book
Introduction to Computational Molecular Biology
introduction
DNA
DNA
central dogma
definitions
Informatics
the science of information management
Bioinformatics
the science of biological information management
what is bioinformatics?
Bioinformatics is Multidisciplinary
interdisciplinary
Genomics
Drug Design
Molecular
Biology
Phylogenetics
Computer
Science
Math
Statistics
Structural
Biology
increasing levels of complexity
Metabalome (metabolic pathways)
Proteome (proteins)
Transcriptosome (RNA)
Genome (DNA)
growth of biological databases
GenBank basepair growth
3,841
Millions
4,000
3,500
3,000
2,009
2,500
2,000
1,160
1,500
1,000
652
1
2 3
5
10
16
24
35 49
72
101 157
217
385
500
0
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
Source: GenBank
growth of biological databases
3D structures growth
http://www.rcsb.org/pdb/holdings.html
DNA/RNA
symbol
meaning
explanation
G
G
Guanine
A
A
Adenine
T
T
Thymine
C
C
Cytosine
R
A or G
puRine
Y
C or T
pYrimidine
N
A, C, G or T
aNy base
U
U
Uracil
definitions
of
bioinformatics
some definitions
use of computers to catalog and organize molecular life science
information into meaningful entities.
subset of computational biology
Methods to analyse, store, search, retrieve and represent
biological data by computers /in computers
massive amounts of data: databases
extracting information and knowledge from "raw" data
for most bioscientists, all they need in bioinformatics is sequence
analysis
what
does
it
do?
bioinformatics is not just the storage of data in a
computer.
bioinformatics is the use of computers to test a
biological hypothesis prior to performing the
experiment in the laboratory.
bioinformatics is the design of software programs
that analyse data.
bioinformatics databases
nucleotide and protein sequences
protein structures
all sorts of functional data related to genes, proteins and
their regulation, interactions etc.
curated and non-curated databases
some goals
sequence searching and sequence alignments
looking at properties that can be analyzed/predicted
from sequence data
protein structures and their analysis
structural classification
visualisation of macromolecules
”system-wide” understanding of the biology of a given
organism
genomes and their annotation
complete genomes of many organisms are available
seeing ”parts lists” of everything an organism needs and
figuring out how they work together
annotation: looking at the DNA sequence
genomes and their annotation
gene finding is not always straightforward
problem: rare gene products, for which you cannot find
corresponding mRNA or protein sequences in databanks
additional complication: alternative splicing, many
transcripts per gene
genomes and their annotation
if you intend to analyze or just use data from a databank
it is useful to know both the goals and the reality of their
annotation level
inconsistencies, missing data
even well-annotated databanks provide only a fraction
of all biologically relevant information relevant to a gene
or a molecule (compared to literature)
annotation: a vision
databank content: all knowlegde on functions of a gene
product
add structural information
insights in structure-function relationships
add data on expression patterns and regulation
understanding cell differentiation and other big questions
in biology on molecular level
current -omics
metabolomics
“…to identify, measure and interpret the complex timerelated concentration, activity and flux of metabolites in
cells, tissues, and other bio-samples such as blood, urine,
and saliva.”
systems biology
Integrated view of biology at
multiple levels
Generation of quantitative,
predictive models of the
behavior of biological systems,
such as organisms
bioinformatics
in
short
very short
common
genes?
what
is
bioinformatics?
Application of information technology to the storage,
management and analysis of biological information
Facilitated by the use of computers
what
is
bioinformatics?
Sequence analysis
Geneticists/
molecular
biologists
analyse
genome
sequence information to understand disease processes
Molecular modeling
Crystallographers/
biochemists
computer-aided tools
design
drugs
using
Phylogeny/evolution
Geneticists obtain information about the evolution of
organisms by looking for similarities in gene sequences
Ecology and population studies
Bioinformatics is used to handle large amounts of data
obtained in population studies
Medical informatics
Personalised medicine
sequence analysis: overview
Sequencing project
management
Nucleotide
sequence
analysis
Sequence
database
browsing
Sequence
entry
Manual
sequence
entry
Nucleotide sequence file
Search for protein
coding regions
Search databases for
similar sequences
Design further experiments
Restriction mapping
PCR planning
coding
non-coding
Protein
sequence
analysis
Translate
into protein
Search databases for
similar sequences
Sequence comparison
Search for
known motifs
RNA structure
prediction
Multiple sequence analysis
Create a multiple
sequence alignment
Edit the alignment
Format the alignment
for publication
Molecular
phylogeny
Protein family
analysis
Protein sequence file
Search for
known motifs
Predict
secondary
structure
Sequence
comparison
Predict
tertiary
structure
gene
sequencing
Automated chemical sequencing methods allow rapid generation
of large data banks of gene sequences
database
similarity
searching
Sequences producing significant alignments:
(bits)
Value
gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae]
112
gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106
gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69
gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae]
30
gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae]
29
gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae]
29
7e-26
5e-24
7e-13
0.66
1.1
1.5
gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae]
Length = 478
Score = 112 bits (278), Expect = 7e-26
Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%)
Query: 2
QSVPWGISRVQAPAAHNRG---------LTGSGVKVAVLDTGIST-HPDLNIRGG-ASFV 50
+ PWG+ RV
G
G GV
VLDTGI T H D
R
+ +
Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233
Query: 51
PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110
P
D NGHGTH AG I + +
GVA + ++
+G+E
Sbjct: 234 PANDEASDLNGHGTHCAGIIGSKH-----FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288
The BLAST program has been written to allow rapid comparison of
a new gene sequence with the 100s of 1000s of gene sequences
in data bases
sequence comparison
Gene sequences can be aligned to see similarities
between gene from different sources
768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG
||
||
|| | | ||| | |||| |||||
||| |||
87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG
.
.
.
.
.
814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG
| | |
| |||||| |
|||| | || |
|
136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG
.
.
.
.
.
864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT
||| | ||| || || |||
|
||||||||| ||
|||||| |
173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT
813
135
863
172
913
216
restriction
50
AceIII
AluI
AlwI
ApoI
BanII
BfaI
BfiI
BsaXI
BsgI
BsiHKAI
Bsp1286I
BsrI
BsrFI
CjeI
CviJI
CviRI
DdeI
DpnI
EcoRI
HinfI
MaeIII
MnlI
MseI
MspI
NdeI
Sau3AI
SstI
TfiI
Tsp45I
Tsp509I
TspRI
mapping
100
150
200
250
1
2
1
2
1
2
1
1
1
1
1
2
1
2
4
1
2
2
1
2
1
1
2
1
1
2
1
2
1
3
1
CAGCTCnnnnnnn’nnn...
AG’CT
GGATCnnnn’n_
r’AATT_y
G_rGCy’C
C’TA_G
ACTGGG
ACnnnnnCTCC
GTGCAGnnnnnnnnnnn...
G_wGCw’C
G_dGCh’C
ACTG_Gn’
r’CCGG_y
CCAnnnnnnGTnnnnnn...
rG’Cy
TG’CA
C’TnA_G
GA’TC
G’AATT_C
G’AnT_C
’GTnAC_
CCTCnnnnnn_n’
T’TA_A
C’CG_G
CA’TA_TG
’GATC_
G_AGCT’C
G’AwT_C
’GTsAC_
’AATT_
CAGTGnn’
Genes can be analysed to detect gene sequences that
can be cleaved with restriction enzymes
PCR
primer
design
OPTIMAL primer length
MINIMUM primer length
MAXIMUM primer length
OPTIMAL primer melting temperature
MINIMUM acceptable melting temp
MAXIMUM acceptable melting temp
MINIMUM acceptable primer GC%
MAXIMUM acceptable primer GC%
Salt concentration (mM)
DNA concentration (nM)
MAX no. unknown bases (Ns) allowed
MAX acceptable self-complementarity
MAXIMUM 3' end self-complementarity
GC clamp how many 3' bases
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
20
18
22
60.000
57.000
63.000
20.000
80.000
50.000
50.000
0
12
8
0
Oligonucleotides for use in the polymerisation chain
reaction can be designed using computer based programs
multiple sequence alignment
Sequences of proteins from different organisms can be aligned to see
similarities and differences
Alignment formatted using MacBoxshade
phylogeny inference
E.coli
C.botulinum
C.cadavers
C.butyricum
B.subtilis
Analysis of sequences allows evolutionary
relationships to be determined
Phylogenetic tree constructed using the Phylip package
B.cereus
large scale bioinformatics: genome projects
Mapping
Identifying the location of clones and markers on the chromosome by
genetic linkage analysis and physical mapping
Sequencing
Assembling clone sequence reads into large (eventually complete)
genome sequences
Gene discovery
Identifying coding regions in genomic DNA by database searching and
other
Function assignment
Using database searches, pattern searches, protein family analysis and
structure prediction to assign a function to each predicted gene
Data mining
Searching for relationships and correlations in the information
Genome comparison
Comparing different complete genomes to infer evolutionary history
and genome rearrangements
genomics
introduction to DNA microarrays
massive data sets from simultaneous expression levels of
thousands of genes
impossible to grasp directly by the human mind
methods are needed for finding meaningful results and
patterns from the bulk of data
DNA microarray bioinformatics
data manipulation: normalization etc.
data clustering
genes which behave in a similar fashion
sample classification by profiles of predictive genes (e.g. c
ancer typing)
data mining:
finding interpretation to clustering results
example: recognition of regulatory factor binding sites in co
expressed genes
basis of molecular biology
hierarchy of relationships:
genome
gene 1
gene 2
gene 3
gene X
protein 1
protein 2
protein 3
protein X
function 1
function 2
function 3
function X
g e n o m e
FERN
LUNGFISH
SALAMANDER
NEWT
ONION
GORILLA
MOUSE
HUMAN
Drosophila
C. Elegans
Yeast
E. Coli
smallest Genome
s i z e
160,000,000,000
139,000,000,000
81,300,000,000
20,600,000,000
18,000,000,000
3,523,200,000
3,454,200,000
3,400,000,000
137,000,000
96,000,000
12,000,000
5,000,000
??????
genes
31,000
13,500
19,000
6,315
5,361
genomics
comparative genomics
whole-genome analyses
evolution studies
analyses of components in a ”complete” system
functional genomics = inferring functions from data
expression patterns, gene regulation
sequence comparisons, homologue relationships
studies of gene variation, altered phenotypes
genomics
gene finding is not always straightforward
problem: rare gene products, for which you cannot find
corresponding mRNA or protein sequences in databanks
additional complication: alternative splicing, many trans
cripts per gene
even well-annotated databanks provide only a fraction
of all biologically relevant information relevant to a gene
or a molecule (compared to literature)
DNA
microarrays
massive data sets from simultaneous expression levels of
thousands of genes
impossible to grasp directly by the human mind
methods are needed for finding meaningful results and
patterns from the bulk of data
DNA
microarrays
data manipulation: normalization etc.
data clustering
genes which behave in a similar fashion
sample classification by profiles of predictive genes (e.g.
cancer typing)
data mining:
finding interpretation to clustering results
example: recognition of regulatory factor binding sites in
coexpressed genes
DNA
array
Array Type
Nylon Macroarrays
Nylon Microarrays
Glass Microarrays
Oligonucleotide Chips
technology
Spot Density
(per cm 2 )
< 100
< 5000
< 10,000
<250,000
Probe
Target
Labeling
cDNA
cDNA
cDNA
oligo's
RNA
mRNA
mRNA
mRNA
Radioactive
Radioactive/Flourescent
Flourescent
Flourescent
spotting robot
microarray expression analysis
microarray
photolithography
array
terminology
70 mer vs 40 mer Attachment
70 mer
40 mer
Target
NH2
NH2
NH2
microarray
microarray
data
gene
expression
analysis
control mouse
a stressed mouse
determination
image
RNA of expression
target levels
extraction
labeling
analysis
DNA
micro-array
genomics
The application of high-throughput automated
technologies to molecular biology.
The experimental study of complete genomes.
genomics
technologies
Automated DNA sequencing
Automated annotation of sequences
DNA microarrays
gene expression (measure RNA levels)
single nucleotide polymorphisms (SNPs)
Protein chips (SELDI, etc.)
Protein-protein interactions
cDNA spotted microarrays
Affymetrix gene chips
microarray data analysis
Clustering and pattern detection
Data mining and visualization
Controls and normalization of results
Statistical validation
Linkage between gene expression data and gene
sequence/function/metabolic pathways databases
Discovery of common sequences in co-regulated genes
Meta-studies using data from multiple experiments
microarray data analysis
impact on bioinformatics
Genomics produces high-throughput, high-quality data,
and
bioinformatics
provides
the
analysis
and
interpretation of these massive data sets.
It is impossible to separate genomics laboratory
technologies from the computational tools required for
data analysis.
proteomics
what is proteomics?
The analysis of the entire protein complement expressed
by a genome, or by a cell or tissue type.“
Two most related technologies
2-D electrophoresis: separation of complex protein
mixtures
Mass spectrometry: Identification and structure analysis
Wasinger VC et al Progress with gene-product mapping of the mollicutes: Mycoplasma genitalium. Electrophoresis
16 (1995) 1090-1094
transcription
g e n o m i c
D N A
Structure
Regulation
Information
Computers cannot determine which of these 3 roles DNA
play solely based on sequence… (although we would all
like to believe they can)
introduction to proteomics
Definitions
Classical - restricted to large scale analysis of gene product
s involving only proteins
Inclusive - combination of protein studies with analyses that
have genetic components such as mRNA, genomics, and y
east two-hybrid
Don’t forget that the proteome is dynamic, changing to
reflect the environment that the cell is in
1 gene = 1 protein?
1 gene is no longer equal to one protein
1 gene = how many proteins?
why proteomics?
Annotation of genomes, i.e. functional annotation
Genome + proteome = annotation
Protein Function
Protein Post-Translational Modification
Protein Localization and Compartmentalization
Protein-Protein Interactions
Protein Expression Studies
Differential gene expression is not the answer
types of proteomics
Protein Expression
Quantitative study of protein expression
samples that differ by some variable
between
Structural Proteomics
Goal is to map out the 3-D structure of proteins and
protein complexes
Functional Proteomics
To study protein-protein interaction, 3-D structures, cellul
ar localization and PTMS in order to understand the physi
ological function of the whole set of proteome.
introduction to proteomics
composition of the proteome depends on cell type,
developmental phase and conditions
proteome analyses are still struggling to solve the ”basic
proteome” of different cells and tissues or limited
changes under changing conditions or during processes
current methods can only ”see” the most abundant
proteins
proteomics
expression proteomics = differential proteomics = 2D-PE +
MS
interaction proteomics
functional proteomics = systematic perturbation or functi
onal inactivation of proteins in a given environment
structural proteomics
proteomics experiments
typically a combination of 2D protein electrophoresis an
d mass spectrometry
labour-intensive, not really ”high-throughput” methods
more efficient ”protein array” methods are emerging
bioinformatics in proteomics
structural proteomics
High-throughput determination of the 3D structure of
proteins
Goal: to be able to determine or predict the structure of
every protein.
Direct determination - X-ray crystallography and nuclear
magnetic resonance (NMR).
Prediction
Comparative modeling Threading/Fold recognition
Ab initio
why structural proteomics?
To study proteins in their active conformation.
Study protein:drug interactions
Protein engineering
Proteins that show little or no similarity at the primary
sequence level can have strikingly similar structures.
an
example
FtsZ - protein required for cell division in prokaryotes,
mitochondria, and chloroplasts.
Tubulin - structural component of microtubules important for intracellular trafficking and cell division.
FtsZ and Tubulin have limited sequence similarity and
would not be identified as homologous proteins by
sequence analysis.
homologues
FtsZ and tubulin have little
similarity at the amino acid
sequence level
Burns, R., Nature 391:121-123
Picture from E. Nogales
are FtsZ and tubulin homologues?
Yes!
Proteins that have conserved secondary structure can
be derived from a common ancestor even if the primary
sequence has diverged to the point that no similarity is
detected.
structure is function
protein structure
Imaging Experimental X-ray diffraction data
Predicting structure in silico from sequence
X-ray crystallography
Make crystals of your protein
0.3-1.0mm in size
Proteins must be in an ordered, repeating pattern.
X-ray beam is aimed at crystal and data is collected.
Structure is determined from the diffraction data.
X-ray crystallography
http://www-structure.llnl.gov/Xray/101index.html
crystals
Schmid, M. Trends in Microbiolgy, 10:s27-s31.
X-ray crystallography
Protein must crystallize.
Need large amounts (good expression)
Soluble (many proteins aren’t, membrane proteins).
Need to have access to an X-ray beam.
Solving the structure is computationally intensive.
Time - can take several months to years to solve a
structure
Efforts to shorten this time are underway to make this
technique high-throughput.
general proces s for proteomi cs res ea rch
Image Analysis
Gel hotel
Spot picker
2-D Gel
Digester
Spotter
MS
general proces s for proteomi cs res ea rch
取材自台大微生物生化系莊榮輝教授網頁
protein microarray
arrayIT TM
protein microarray
arrayIT TM
G. MacBeath and S.L. Schreiber, 2000, Science 289:1760
what can protein microarrays do?
1.
2.
3.
4.
5.
6.
Protein / protein interaction
Enzyme / substrate interaction (transient)
Protein / small molecule interaction
Protein / lipid interaction
Protein / glycan interaction
Protein / Ab interaction
1. G. MacBeath and S.L. Schreiber, 2000, Science 289:1760
2. H.Zhu et al, 2001 Science 293:2101
3. Ziauddin J and Sabatini DM, 2001 Nature 411:107
protein microarrays (Antibody arrays)
the real world
The true spot quality compared to a real experiment
m o bi li t y o f p ro t ei n i n an el ect ri c fi eld
Mobility : Electrolytic molecules move in an electric field
[Electric field (mV)] [Net charge of molecule]
Mobility ~
[Friction between molecules and matrix]
2-dim electrophoresis
2-D gel electrophoresis
First dimension
denaturing iso-electric focusing
separation according to the pI
Second dimension
SDS-PAGE (Sodium Dodecyl Sulfate
coated in a Poly-Acrylamide Gel
Electrophoresis)
Separation according to the
molecular weight
pI is the iso-electric point of the protein
MS analysis
Digest to peptide fragment
result
example
mass
spectrometry
Ion source: substance to ion gas
Mass analysis: according to mass/charge (m/z)
Detection: femtomole -attomole
Ion source
Ion separator
detector
principle of mass spec
peptide mixture
embedded in
pulsed
light absorbing
detector
UV or IR laser
chemicals (matrix)
(3-4 ns)
vacuum
+ ++
+ ++ +
+
strong
electric
field
cloud of
V
protonated
acc
peptide molecules
+
+
Time Of Flight tube
principle of mass spec
Linear Time Of Flight tube
ion source
detector
time of flight
Reflector Time Of Flight tube
ion source
detector
reflector
time of flight
typical result
Nuclear Magnetic Resonance Spectroscopy (NMR)
Can perform in solution.
No need for crystallization
Can only analyze proteins that are <300aa.
Many proteins are much larger.
Can’t analyze multi-subunit complexes
Proteins must be stable.
structure
modeling
Comparative modeling
Modeling the structure of a protein that has a high degree
of sequence identity with a protein of known structure
Must be >30% identity to have reliable structure
Threading/fold recognition
Uses known fold structures to predict folds in primary
sequence.
Ab initio
Predicting structure from primary sequence data
Usually not as robust, computationally intensive
sequence alignment
sequence alignment
©Ken Howard, Scientific American, July 2000
s e q u e n c e
GATCAAACATTAAACATCCTGAGATCCAAAGGTAAGAGATCTAGCCACAGGGAGTGCTGGGGATTCGGGTCCTGGTGATCTTCACATGCTGACATAGCTCAG
CCCTTTTTGGCCCTGGCTTTGTCCTGTTGTGGGCTTTCCCATCTGCAACCCATGCTCCTGGGCCATTTTCCTATGGGCCAGGGAAAACAAGATGGGGTGAAGGC
ACCCTTACATTTAGGGGCAAGACCTAGTACTCAGAAGGATTCAGAAACTGAAATAGCTGGGTGATACCACACAGGTGCTAGGGATAAGGGGCCTTGAGCCAT
GGACCATGGGAACTACAAAGCTGAAGGAGCTGCTGCCTCAGCAGAACCAGCGCTTGAATTTGTTCTTTCAGAACCTCAGTCTCTTCCTCTGAAAAATGGGTGTG
TTGTGTATCCCACATTCCCAAGTCAGCCATGGGACCAAATGTGAGCGTGTGGGTTTTGCCTCCTGAGAAACTCAGGGGAGCAGAATGCTACAGTGGGTGAATT
GGATTCTTTCAGAGAGCCCACCCTGTTTCCCACATCAGCCAGAAGGCTCAAAACCCTGAAGAGCTTTCTGAACTTTGAGGTGCCCAAAGCTTCAGGGCTGTAT
GGGAAGCACCTGAGGTCCAAGTCCGTTTACAAGAATTTTGTTTTTTGGTTTACAGCTGCTTGGCCGGTCCAAGGAGCAGGTTTGGGTCCTGTGCTCCACAGACCT
AAGGGTTACCTTAGAGCTTATGGGAGAGCATTGTGTGTGGACAGTGGACAGTGCCCTCTAGTGCTCAGTGTTAGCACTACATCCAGTTGCCCTCCACCAGTTTAT
GCTGCTGAGGAAGTCTTTCTTTTCCCAACAGCAGTGTCTCTCCCTCTCCCACCCCCTCTCCCTCTCCCTCCCCCCCTAGGTTATTTTTATTTTTACTGGTGTGTATGTGT
GTGAGTCTATGTCACATGTATGAGAGTGCTTGTGGAGACCAGAAGAGGGCATCAGAAGAGCCCCTAGAACTGGAGTATAGGTGGTTGTGAGCCACTTGTCATG
GGTGTTGGGAACCAAACTCAGGTTCTCTGGAAGAACAACAAGCTCCCTTATCATATAAGCCATCTCTAAATCCAGGACATTTTTTTTTTTTTTTTTGAGATTTAGAGATTC
AAGGAGGAGGAACAATAGGAGGAAGAAGGGGACAGAATAAGGCCAACAAAATGACCAAGGAGGTATAGGCACTTGAAGCCAAACCTAAGTACCTGAGT
TCAATCCCTGGGACCCACATGATGGAAAGATGGAATCGATCCCCAAAAGTTATCTTCTGATCCCTATATGCACACACTTGAGGATGGACAGACAAAGAGACA
GACACACAAACACACACAAATGTAACTGAAAAAGAAACCTCTATGGGGACATCGCCTTCTTGGAGAGGCTCTGTTGCCCCTCATCCTAGTGAACAAACAACT
CCTACTCCCTGCCAGAGTATCCTACCCTTGGATTCAAAATGGTCTCAGAGGACACACCGGGTGGGCTCTGTCGCTGGGATCTTGCATAACCAATGCCCATAA
GCCTGGCAAAGGTGGCGATGAGACGATAAGGTCAGGGACATGACCGCAGAAGAGGAGTGGGGACGCGATGAGTGGGAGGAGCTTCTAAATTATCCATC
AGCACAAGCTGTCAGTGGCCCCAGCCATGAATAAATGTATAGGGGGAAAGGCAGGAGCCTTGGGGTCGAGGAAAACAGGTAGGGTATAAAAAGGGCAC
GCAAGGGACCAAGTCCAGCATCCTAGAGTCCAGATTCCAAACTGCTCAGAGTCCTGTGGACAGATCACTGCTTGGCAATGGCTACAGGTAAGCATGCGCA
AATCCCGCTGGGTGTGGTTTGGGACCCAGGGCCCCTGAAGATGGATCTGAGGCTTCTAATGTGAGTGCGTTCCAACTTCTGCCATGTTGGGAATACTCTGGGTC
CCTATGGGGATTGGGAGAGATCGGCCATTGCTCCCAGGTTTCTCCTGCCCTCCTGTCTCTCTCTAGACTCTCGGACCTCCTGGCTCCTGACCGTCAGCCTGCTC
TGCCTGCTCTGGCCTCAGGAGGCTAGTGCTTTTCCCGCCATGCCCTTGTCCAGTCTGTTTTCTAATGCTGTGCTCCGAGCCCAGCACCTGCACCAGCTGGCTGC
TGACACCTACAAAGAGTTCGTAAGTTCCCCAGAGATGGGTGCCCGTTTGTGGAAGCAGGAAGGGGCAGGTCCTACCCCATACTCCTGGCCCCAGGGAAG
GTCAATGGAGGGGAAATTATGGGGTAGGGGAATCTTAGCCAATGCTGTACCATAGTAATGATGGTGACGAGACACAAGCTGGTCCCTCAGTGACCACCCTTC
TTCCAGGAGCGTGCCTACATTCCCGAGGGACAGCGCTATTCCATTCAGAATGCCCAGGCTGCTTTCTGCTTCTCAGAGACCATCCCGGCCCCCACAGGCAA
GGAGGAGGCCCAGCAGAGAACCGTGAGTAGTCCCAGGCCTTGTCTGCACAAATCCTCGTTTCCCTCCATGCAGCCCTAACTGCACTCCAGGCCAGGGAC
CAGCTCCTCCCTGAAGCTGGGGTAACCTGGGAGTCCCAGGCAGAGGTCACTAGGCAATACACTAACCCCAGCCCTTTTTTTCCCCCCTCAGGACATGGAATT
GCTTCGCTTCTCGCTGCTGCTCATCCAGTCATGGCTGGGGCCCGTGCAGTTCCTCAGCAGGATTTTCACCAACAGCCTGATGTTCGGCACCTCGGACCGTGTC
TATGAGAAACTGAAGGACCTGGAAGAGGGCATCCAGGCTCTGATGCAGGTGAGGATGGACTAGCCTGGGGTTATGCCTGGAGCCTAGGTGGGGCTCACTG
TCCTCTGTTTTACCGGTCAGCCCTTAGACCCTTGAGAAGGCTTCTTCTTCTTCATTTTCCTTTATGAAGCCTCCAGGCTTTTCCTTCGGTCCTGGGGTGGAGGGAGGC
ACAGCTCCCGAGTCTCCTGCCCTTCTTTCCCACGACAGGAGCTGGAAGATGGCAGCCCCCGTGTTGGGCAGATCCTCAAGCAAACCTATGACAAGTTTGAC
GCCAACATGCGCAGCGACGACGCGCTGCTCAAAAACTATGGGCTGCTCTCCTGCTTCAAGAAGGACCTGCACAAAGCGGAGACCTACCTGCGGGTCATG
AAGTGTCGCCGCTTTGTGGAAAGCAGCTGTGCCTTCTAGCCACTCACCAGTGTCTCTGCTGCACTCTCCTGTGCCTCCCTGCCCCCTGGCAACTGCCACCCC
GCGCTTTGTCCTAATAAAATTAAGATGCATCATATCACCCGGCTAGAGGTCTTTCTGTTATGGGATGGAGCAGTTGTGTCAATCTTGTTCCTGGAAGCCTGCGAGAA
sequence alignment: why?
Early in the days of protein and gene sequence analysis,
it was discovered that the sequences from related
proteins or genes were similar, in the sense that one
could align the sequences so that many corresponding
residues match.
This discovery was very important: strong similarity
between two genes is a strong argument for their
homology. Bioinformatics is based on it.
sequence alignment: why?
Terminology:
Homology means that two (or more) sequences
have a
common ancestor. This is a statement
about evolutionary history.
Similarity simply means that two sequences are
similar, by some criterion. It does not refer to any
historical process, just to a comparison of the
sequences by some method. It is a logically weaker
statement.
However, in bioinformatics these two terms are often
confused and used interchangeably. The reason is
probably that significant similarity is such a strong
argument for homology.
two protein alignment
many genes have a common ancestor
The basis for comparison of proteins and genes using the
similarity of their sequences is that the proteins or genes
are related by evolution; they have a common ancestor.
Random mutations in the sequences accumulate over
time, so that proteins or genes that have a common
ancestor far back in time are not as similar as proteins or
genes that diverged from each other more recently.
Analysis of evolutionary relationships between protein or
gene sequences depends critically on sequence
alignments.
definition of sequence alignment
Sequence alignment is the procedure of comparing
two (pair-wise alignment) or more multiple sequences by
searching for a series of individual characters or patterns
that are in the same order in the sequences.
There are two types of alignment: local and global. In
global alignment, an attempt is made to align the entire
sequence. If two sequences have approximately the
same length and are quite similar, they are suitable for
the global alignment.
Local alignment concentrates on finding stretches of
sequences with high level of
definition of sequence alignment
L G P S S K Q T G K G S - S R I W D N
Global alignment
L N - I T K S A G K G A I M R L G D A
- - - - - - - T G K G - - - - - - - Local alignment
- - - - - - - A G K G - - - - - - - -
interpretation of sequence alignment
Sequence alignment is useful for discovering structural,
functional and evolutionary information.
Sequences that are very much alike may have similar
secondary and 3D structure, similar function and likely a
common ancestral sequence. It is extremely unlikely that
such sequences obtained similarity by chance. For DNA
molecules with n nucleotides such probability is very low
P = 4-n. For proteins the probability even much lower P =
20 –n, where n is a number of amino acid residues
Large scale genome studies revealed existence of
horizontal transfer of genes and other sequences
between species, which may cause similarity between
some sequences in very distant species
methods of sequence alignment
Dot matrix analysis
The dynamic programming (DP) algorithm
Word or k-tuple methods
dot matrix analysis
A dot matrix analysis is a method for comparing two
sequences to look for possible alignment (Gibbs and
McIntyre 1970)
One sequence (A) is listed across the top of the matrix
and the other (B) is listed down the left side
Starting from the first character in B, one moves across
the page keeping in the first row and placing a dot in
many column where the character in A is the same
The process is continued until all possible comparisons
between A and B are made
Any region of similarity is revealed by a diagonal row of
dots
Isolated dots not on diagonal represent random
matches
dot matrix analysis
Detection of matching regions can be improved by
filtering out random matches and this can be achieved
by using a sliding window
It means that instead of comparing a single sequence
position more positions is compared at the same time
and dot is printed only if a certain minimal number of
matches occur
Dot matrix analysis can also be used to find direct and
inverted repeats within the sequences
dot matrix analysis: two identical sequences
Nucleic Acids Dot Plots - http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
dot matrix analysis: two very different sequences
Nucleic Acids Dot Plots of genes Adh1 and G6pd in the
mouse
dot matrix analysis: two similar sequences
Nucleic Acids Dot Plots of genes Adh1 from the mouse
and rat (25 MY)
dynamic programming
back
to
the
basics
DNA/RNA sequences: strings composed of an alphabet
of 4 letters
Protein sequences: alphabet of 20 letters
why
do
we
do
it?
Identify a gene
Find clues to gene function (ortholog?)
Find other organisms with this gene (homology)
Gather info for an evolutionary model
…
a l i g n m e n t
alignment is the basis for finding similarity
Pairwise alignment = dynamic programming
Multiple alignment: protein families and functional
domains
Multiple alignment is "impossible" for lots of sequences
Another heuristic - progressive pairwise alignment
an
example
GCGCATGGATTGAGCGA
TGCGCCATTGATGACCA
possible alignment
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
alignment
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
three elements
Perfect matches
Mismatches
Insertions & deletions (indel)
significant similarity in an alignment
Ho: the current alignment is a result of random line-up
(the 2 sequences are unrelated)
Ha: the sequences diverge from a common ancestor
(related)
Test statistic: Ymax = length of the longest running perfect
match subsequence
exact matching subsequences
In DNA alignment, the matching probability
Under Ho lengths of exact match subseq Y should follow
a geometric distribution
pmatch p p p p
2
a
2
g
2
c
2
t
well-matching subsequences
Evolution may cause small differences to even
sequences with a reasonably recent common ancestor.
We consider Ymax to be the longest subseq with up to k
mismatches.
Y follow hyper-geometric distribution
P-value: exact/simulated/approximate (independence
among Y does not hold any more)
choosing alignments
There are many possible alignments
For example, compare:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA-Which one is better?
scoring
Score
Used to determine quality of match and basis for the selecti
on of matches. Scores are relative.
Expectation value
An estimate of the likelihood that a given hit is due to pure
chance, given the size of the database; should be as low a
s possible. E.V.’s are absolute. A high score and a low E.V.
indicate a true hit.
Sequence identity (%) (or Similarity)
Number of matched residues divided by total length of pro
be
scoring rule
Example Score =
(# matches) – (# mismatches) – (# indels) x 2
examples
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score: (+1x13) + (-1x2) + (-2x4) = 3
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA-Score: (+1x5) + (-1x6) + (-2x11) = -23
edit
distance
The edit distance between two sequences is the “cost”
of the “cheapest” set of edit operations needed to
transform one sequence into the other
Computing edit distance between two sequences
almost equivalent to finding the alignment that minimizes
the distance
d(s1 , s2 ) max alignment of s1 &s2 score(alignment)
computing edit distance
How can we compute the edit distance??
|s| = n and |t| = m, there are
alignments
2 sequences each of length 1000: > 10^600
If
more
m n
m
than
The additive form of the score allows to perform
dynamic programming to compute edit distance
efficiently
r e c u r s i v e
a r g u m e n t
Suppose we have two sequences:
s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:
1. Last position is (s[n+1],t[m +1] )
2. Last position is (s[n +1],-)
3. Last position is (-, t[m +1] )
d (s [1..n 1 ],t [1..m 1 ]) d (s [1.., n ],t [1..m])
(s [n 1 ],t [m 1 ])
recursive
argument
Suppose we have two sequences:
s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:
1. Last position is (s[n+1],t[m +1] )
2. Last position is (s[n +1],-)
3. Last position is (-, t[m +1] )
d (s [1..n 1 ],t [1..m 1 ]) d (s [1.., n ],t [1..m 1 ])
(s [n 1 ],)
recursive
argument
Suppose we have two sequences:
s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:
1. Last position is (s[n+1],t[m +1] )
2. Last position is (s[n +1],-)
3. Last position is (-, t[m +1] )
d (s [1..n 1 ],t [1..m 1 ]) d (s [1.., n 1 ],t [1..m])
( ,t [n 1 ])
recursive
argument
Define the notation:
Using the recursive argument, we get the following
recurrence for V:
V [i , j ] d (s [1..i ],t [1.. j ])
V [i , j ] (s [i 1 ],t [ j 1 ])
V [i 1 , j 1 ] max V [i , j 1 ] (s [i 1 ],)
V [i 1 , j ] ( ,t [ j 1 ])
recursive
argument
Of course, we also need to handle the base cases in the
recursion:
V [0,0] 0
V [i 1,0] V [i,0] ( s[i 1], )
V [0, j 1] V [0, j ] (, t[ j 1])
dynamic programming algorithm
0
A
1
0
A1
A2
A3
C4
We fill the matrix using the recurrence rule
G
2
C
3
dynamic programming algorithm
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
Conclusion: d(AAAC,AGC) = -1
interpretation of pointers
Insertion of S2(j) into S1
Deletion of S1(i) from S1
Match or Substitution
reconstructing the best alignment
We now trace back the path the corresponds to the best
alignment
AAAC
AG-C
A G C
0 1 2 3
0 0 -2 -4 -6
A 1 -2
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
reconstructing the best alignment
Sometimes, more than one alignment has the best score
AAAC
A-GC
A G C
0 1 2 3
0 0 -2 -4 -6
A 1 -2
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
complexity
Space: O(mn)
Time: O(mn)
Filling the matrix O(mn)
Backtrace O(m+n)
other
scoring
schemes
Needleman and Wunsch: 1 for identical amino acid, 0
otherwise
Dayhoff PAM scoring matrix: variations include BLOSUM
matrices(Henikoff and Henikoff 1992, Proc. Nat. Acad. Sci.
89, 10915-10919).
…
Different Gap Cost Function
s cori ng m a t ri x for pr otei n s equ ences
substitution “log odds” matrix BLOSUM 62
Henikoff and Henikoff (1992; PNAS 89:10915-10919)
PAM
250
matrix
( M.O. Dayhoff, ed., 1978, Atlas of Protein Sequence and Structure, Vol 5).
multiple sequence alignment
Often a probe sequence will yield many hits in a search.
Then we want to know which are the residues and
positions that are common to all or most of the probe
and match sequences
In multiple sequence alignment, all similar sequences
can be compared in one single figure or table. The basic
idea is that the sequences are aligned on top of each
other, so that a coordinate system is set up, where each
row is the sequence for one protein, and each column is
the 'same' position in each sequence.
an example
name of homologous domians
consensus
position of residue
residues and position common to most homologs
cellulose-binding domain of cellobiohydrolase
why multiple sequence alignment?
Identify consensus segments
Hence the most conserved sites and residues
Use for construction of phylogenesis
Convert similarity to distance www.ch.embnet.org/software
/ClustalW.html
Of genes, strains, organisms, species, life
sequence logo
This shows the conserved residues as larger characters, where the total
height of a column is proportional to how conserved that position is.
Technically, the height is proportional to the information content of the
position.
sample multiple alignment
constructing the tree of life
Bacteria
A. aeolicus
T. maritima
Eukarya
Archaea
Black tree: dist ’n of 8-mers . Red tree: sequence aligment .
with k-mers (16s RNA, 35 organisms)
databases of multiple alignments
Pfam: Protein families database of aligments and HMMs
www.cgr.ki.se
PRINTS, multiple motifs consisting of ungapped, aligned s
egments of sequences, which serve as fingerprints for a
protein family
www.bioinf.man.ac.uk
BLOCKS, multiple motifs of ungapped, locally aligned se
gments created automatically
fhcrc.org
s o f t w a r e
manual alignment- software
GDE- The Genetic Data Environment (UNIX)
CINEMA- Java applet available from:
http://www.biochem.ucl.ac.uk
Seqapp/Seqpup- Mac/PC/UNIX available from:
http://iubio.bio.indiana.edu
SeAl for Macintosh, available from:
http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html
BioEdit for PC, available from:
http://www.mbio.ncsu.edu/RNaseP/info/programs/BIOEDIT
/bioedit.html
B L A S T
1.
2.
3.
Search a sequence database for fragments similar to
the query sequence
Compile a list of high-scoring short words shared by
the query sequence and the database;
Scan the database for “hits”
Expand the “hits” to MSP (maximum segment pair = a
pair of equal-length/no-gap segments with the highest
alignment score)
B L A S T
Altschul, et. al. (1990) "Basic local alignment search tool."
J. Mol. Biol. 215:403-410.
Variations of BLAST designed for specific purposes
http://www.ncbi.nlm.nih.gov/BLAST/
similarity searching the databanks
What is similar to my sequence?
Searching gets harder as the databases get bigger and quality degrades
Tools:
BLAST and FASTA = time saving heuristics
(approximate)
Statistics + informed judgement of the biologist
read out
>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.
Length = 369
Score = 272 bits (137), Expect = 4e-71
Identities = 258/297 (86%), Gaps = 1/297 (0%)
Strand = Plus / Plus
Query: 17
Sbjct: 1
Query: 77
Sbjct: 60
aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76
|||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||
aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59
agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136
|||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||
agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119
Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196
|||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||
Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179
Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256
||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||
Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239
Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313
|| || ||||| || ||||||||||| | |||||||||||||||||| ||||||||
Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296
structure - function relationships
Can we predict the function of protein molecules from
their sequence?
sequence > structure > function
Conserved functional domains = motifs
Prediction of some simple 3-D structures (a-helix, b-sheet,
membrane spanning, etc.)
protein domains
(from ProDom database)
DNA sequencing
Automated sequencers > 40 KB per day
500 bp reads must be assembled into complete genes
errors especially insertions and deletions
error rate is highest at the ends where we want to overlap
the reads
vector sequences must be removed from ends
Faster sequencing relies on better software
overlapping deletions vs. shotgun approaches: TIGR
DNA
sequencing
finding genes in genome sequence is not easy
About 2% of human DNA encodes functional genes.
Genes are interspersed among long stretches of non-coding DNA.
Repeats, pseudo-genes, and introns confound matters
pattern finding tools
It is possible to use DNA sequence patterns to predict genes:
promoters
translational start and stop codes (ORFs)
intron splice sites
codon bias
Can also use similarity to known genes/ESTs
phylogenetics
Evolution = mutation of DNA (and protein) sequences
Can we define evolutionary relationships between
organisms by comparing DNA sequences
is there one molecular clock?
phenetic vs. cladisitic approaches
lots of methods and software, what is the "correct" analysis?
phylogenetics
software tools on the web
Many of the best tools are free over the Web
BLAST
ENTREZ/PUBMED
Protein motifs databases
Bioinformatics “service providers”
DoubleTwist™, Celera, BioNavigator™
Hodgepodge collection of other tools
PCR primer design
Pairwise and Multiple Alignment
PC programs
Macintosh and Windows applications
-Commercial Vector NTI™, MacVector™,
Sequencher™
- Freeware Phylip, Fasta, Clustal, etc.
OMIGA™,
Better graphics, easier to use
Can't access very large databases or perform
demanding calculations
Integration with web databases and computing services
Vector NTI
most important sequence databases
Genbank– maintained by USA National Center for Biology
Information (NCBI)
All biological sequences
www.ncbi.nlm.nih.gov/Genbank/GenbankOver
view.html
Genomes
www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=
Genome
Swiss-Prot - maintained by EMBL- European Bioinformatics
Institute (EBI )
Protein sequences
www.ebi.ac.uk/swissprot/
genom e
pro ject
the human genome project
The genome sequence is complete - almost!
Approximately 3.2 billion base pairs.
all the genes
Any human gene can now be found in the genome by
similarity searching with over 99% certainty.
However, the sequence still has many gaps
hard to find an uninterrupted genomic segment for any gene
still can’t identify pseudogenes with certainty
This will improve as more sequence data accumulates
Raw Genome Data:
example of the code
example of the code
The next step is obviously to locate all of the genes and describe their
functions. This will probably take another 15-20 years!
inconsistency
Celera says that
~34,000 genes
there
are
only
so why are there ~60,000 human
genes on Affymetrix GeneChips?
Why does GenBank have 49,000
human gene coding sequences and
UniGene have 96,000 clusters of
unique human ESTs?
Clearly we are in desperate need of a
theoretical framework to go with all of this
data
http://www.celera.com/
implications for biomedicine
Physicians will use genetic information to diagnose and
treat disease.
Virtually all medical conditions have a genetic
component.
Faster drug development research
Individualized drugs
Gene therapy
All Biologists will use gene sequence information in their
daily work
the
equipment
meaning of the code …
meaning of the code …
e v o l u t i o n
how do genomes evolve?
Point mutations
Rearrangements
Recombination
Selection and Drift
how can you view the evolution?
Individual gene alignment view (usually proteins)
Dot plot or VISTA (local similarity) view
Synteny view
Composite (average) views
DNA dot plot view
Show one DNA along X-axis, second on Y-axis
For every position along both, score local similarity
Display 2-D plot of similarity in gray-scale
dot plot example
self-match
tandem duplication
random dot plot
gene structure revealed by dot plot
promoter
conservation
synteny
view
Synteny definition: a contiguous region in another
genome that has more-or-less the same genes in the
same order.
The boundaries of what constitutes synteny are a bit
fuzzy… for example you probably wouldn’t say a region
isn’t syntenic if it is missing one gene out of many.
single
inversion
insertion or deletion
double inversion
intra-chromosomal rearrangements
inter-chromosomal rearrangements
syntenic
scaling
These regions are perfectly syntenic, but on average the mouse has
shorter regions separating alignable conserved blocks.
limitations to synteny view
Provides only overview of arrangement, with no
information about the degree or areas of conservation.
As genomes become more distant synteny becomes
more chaotic, until (in the extreme) most blocks are one
gene long (e.g. flies vs. human).
In some cases, very deep synteny can be seen, most
dramatically in the Hox clusters.
Hox
cluster
composite or summary views
View comparative summaries
that encapsulate general
properties of the genome.
For example, G-C content
comparison:
phylogenetics and evolution