Transcript PPT

bioinformatics
c o u r s e













l a y o u t
introduction
molecular biology
biotechnology
bioMEMS
bioinformatics
bio-modeling
cells and e-cells
transcription and regulation
cell communication
neural networks
dna computing
fractals and patterns
the birds and the bees ….. and ants
book
Introduction to Computational Molecular Biology
introduction
DNA
DNA
central dogma
definitions
 Informatics
 the science of information management
 Bioinformatics
 the science of biological information management
what is bioinformatics?
Bioinformatics is Multidisciplinary
interdisciplinary
Genomics
Drug Design
Molecular
Biology
Phylogenetics
Computer
Science
Math
Statistics
Structural
Biology
increasing levels of complexity
Metabalome (metabolic pathways)
Proteome (proteins)
Transcriptosome (RNA)
Genome (DNA)
growth of biological databases
GenBank basepair growth
3,841
Millions
4,000
3,500
3,000
2,009
2,500
2,000
1,160
1,500
1,000
652
1
2 3
5
10
16
24
35 49
72
101 157
217
385
500
0
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
Source: GenBank
growth of biological databases
3D structures growth
http://www.rcsb.org/pdb/holdings.html
DNA/RNA
symbol
meaning
explanation
G
G
Guanine
A
A
Adenine
T
T
Thymine
C
C
Cytosine
R
A or G
puRine
Y
C or T
pYrimidine
N
A, C, G or T
aNy base
U
U
Uracil
definitions
of
bioinformatics
some definitions
 use of computers to catalog and organize molecular life science
information into meaningful entities.
 subset of computational biology
 Methods to analyse, store, search, retrieve and represent
biological data by computers /in computers
 massive amounts of data: databases
 extracting information and knowledge from "raw" data
 for most bioscientists, all they need in bioinformatics is sequence
analysis
what
does
it
do?
 bioinformatics is not just the storage of data in a
computer.
 bioinformatics is the use of computers to test a
biological hypothesis prior to performing the
experiment in the laboratory.
 bioinformatics is the design of software programs
that analyse data.
bioinformatics databases
 nucleotide and protein sequences
 protein structures
 all sorts of functional data related to genes, proteins and
their regulation, interactions etc.
 curated and non-curated databases
some goals
 sequence searching and sequence alignments
 looking at properties that can be analyzed/predicted
from sequence data
 protein structures and their analysis
 structural classification
 visualisation of macromolecules
 ”system-wide” understanding of the biology of a given
organism
genomes and their annotation
 complete genomes of many organisms are available
 seeing ”parts lists” of everything an organism needs and
figuring out how they work together
annotation: looking at the DNA sequence
genomes and their annotation
 gene finding is not always straightforward
 problem: rare gene products, for which you cannot find
corresponding mRNA or protein sequences in databanks
 additional complication: alternative splicing, many
transcripts per gene
genomes and their annotation
 if you intend to analyze or just use data from a databank
it is useful to know both the goals and the reality of their
annotation level
 inconsistencies, missing data
 even well-annotated databanks provide only a fraction
of all biologically relevant information relevant to a gene
or a molecule (compared to literature)
annotation: a vision
 databank content: all knowlegde on functions of a gene
product
 add structural information 
 insights in structure-function relationships
 add data on expression patterns and regulation 
 understanding cell differentiation and other big questions
in biology on molecular level
current -omics
metabolomics
“…to identify, measure and interpret the complex timerelated concentration, activity and flux of metabolites in
cells, tissues, and other bio-samples such as blood, urine,
and saliva.”
systems biology
 Integrated view of biology at
multiple levels
 Generation of quantitative,
predictive models of the
behavior of biological systems,
such as organisms
bioinformatics
in
short
very short
common
genes?
what
is
bioinformatics?
 Application of information technology to the storage,
management and analysis of biological information
 Facilitated by the use of computers
what
is
bioinformatics?
 Sequence analysis
 Geneticists/
molecular
biologists
analyse
genome
sequence information to understand disease processes
 Molecular modeling
 Crystallographers/
biochemists
computer-aided tools
design
drugs
using
 Phylogeny/evolution
 Geneticists obtain information about the evolution of
organisms by looking for similarities in gene sequences
 Ecology and population studies
 Bioinformatics is used to handle large amounts of data
obtained in population studies
 Medical informatics
 Personalised medicine
sequence analysis: overview
Sequencing project
management
Nucleotide
sequence
analysis
Sequence
database
browsing
Sequence
entry
Manual
sequence
entry
Nucleotide sequence file
Search for protein
coding regions
Search databases for
similar sequences
Design further experiments
Restriction mapping
PCR planning
coding
non-coding
Protein
sequence
analysis
Translate
into protein
Search databases for
similar sequences
Sequence comparison
Search for
known motifs
RNA structure
prediction
Multiple sequence analysis
Create a multiple
sequence alignment
Edit the alignment
Format the alignment
for publication
Molecular
phylogeny
Protein family
analysis
Protein sequence file
Search for
known motifs
Predict
secondary
structure
Sequence
comparison
Predict
tertiary
structure
gene
sequencing
Automated chemical sequencing methods allow rapid generation
of large data banks of gene sequences
database
similarity
searching
Sequences producing significant alignments:
(bits)
Value
gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae]
112
gi|603258 (U18795) Prb1p: vacuolar protease B [Saccharomyces ce... 106
gnl|PID|e264388 (X59720) YCR045c, len:491 [Saccharomyces cerevi... 69
gnl|PID|e239708 (Z71514) ORF YNL238w [Saccharomyces cerevisiae]
30
gnl|PID|e239572 (Z71603) ORF YNL327w [Saccharomyces cerevisiae]
29
gnl|PID|e239737 (Z71554) ORF YNL278w [Saccharomyces cerevisiae]
29
7e-26
5e-24
7e-13
0.66
1.1
1.5
gnl|PID|e252316 (Z74911) ORF YOR003w [Saccharomyces cerevisiae]
Length = 478
Score = 112 bits (278), Expect = 7e-26
Identities = 85/259 (32%), Positives = 117/259 (44%), Gaps = 32/259 (12%)
Query: 2
QSVPWGISRVQAPAAHNRG---------LTGSGVKVAVLDTGIST-HPDLNIRGG-ASFV 50
+ PWG+ RV
G
G GV
VLDTGI T H D
R
+ +
Sbjct: 174 EEAPWGLHRVSHREKPKYGQDLEYLYEDAAGKGVTSYVLDTGIDTEHEDFEGRAEWGAVI 233
Query: 51
PGEPSTQDGNGHGTHVAGTIAALNNSIGVLGVAPSAELYXXXXXXXXXXXXXXXXXQGLE 110
P
D NGHGTH AG I + +
GVA + ++
+G+E
Sbjct: 234 PANDEASDLNGHGTHCAGIIGSKH-----FGVAKNTKIVAVKVLRSNGEGTVSDVIKGIE 288
The BLAST program has been written to allow rapid comparison of
a new gene sequence with the 100s of 1000s of gene sequences
in data bases
sequence comparison
 Gene sequences can be aligned to see similarities
between gene from different sources
768 TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG
||
||
|| | | ||| | |||| |||||
||| |||
87 TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG
.
.
.
.
.
814 AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG
| | |
| |||||| |
|||| | || |
|
136 AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG
.
.
.
.
.
864 AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT
||| | ||| || || |||
|
||||||||| ||
|||||| |
173 AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT
813
135
863
172
913
216
restriction
50
AceIII
AluI
AlwI
ApoI
BanII
BfaI
BfiI
BsaXI
BsgI
BsiHKAI
Bsp1286I
BsrI
BsrFI
CjeI
CviJI
CviRI
DdeI
DpnI
EcoRI
HinfI
MaeIII
MnlI
MseI
MspI
NdeI
Sau3AI
SstI
TfiI
Tsp45I
Tsp509I
TspRI
mapping
100
150
200
250
1
2
1
2
1
2
1
1
1
1
1
2
1
2
4
1
2
2
1
2
1
1
2
1
1
2
1
2
1
3
1
CAGCTCnnnnnnn’nnn...
AG’CT
GGATCnnnn’n_
r’AATT_y
G_rGCy’C
C’TA_G
ACTGGG
ACnnnnnCTCC
GTGCAGnnnnnnnnnnn...
G_wGCw’C
G_dGCh’C
ACTG_Gn’
r’CCGG_y
CCAnnnnnnGTnnnnnn...
rG’Cy
TG’CA
C’TnA_G
GA’TC
G’AATT_C
G’AnT_C
’GTnAC_
CCTCnnnnnn_n’
T’TA_A
C’CG_G
CA’TA_TG
’GATC_
G_AGCT’C
G’AwT_C
’GTsAC_
’AATT_
CAGTGnn’
 Genes can be analysed to detect gene sequences that
can be cleaved with restriction enzymes
PCR
primer
design
OPTIMAL primer length
MINIMUM primer length
MAXIMUM primer length
OPTIMAL primer melting temperature
MINIMUM acceptable melting temp
MAXIMUM acceptable melting temp
MINIMUM acceptable primer GC%
MAXIMUM acceptable primer GC%
Salt concentration (mM)
DNA concentration (nM)
MAX no. unknown bases (Ns) allowed
MAX acceptable self-complementarity
MAXIMUM 3' end self-complementarity
GC clamp how many 3' bases
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
20
18
22
60.000
57.000
63.000
20.000
80.000
50.000
50.000
0
12
8
0
Oligonucleotides for use in the polymerisation chain
reaction can be designed using computer based programs
multiple sequence alignment
Sequences of proteins from different organisms can be aligned to see
similarities and differences
Alignment formatted using MacBoxshade
phylogeny inference
E.coli
C.botulinum
C.cadavers
C.butyricum
B.subtilis
Analysis of sequences allows evolutionary
relationships to be determined
Phylogenetic tree constructed using the Phylip package
B.cereus
large scale bioinformatics: genome projects
 Mapping
Identifying the location of clones and markers on the chromosome by
genetic linkage analysis and physical mapping
 Sequencing
Assembling clone sequence reads into large (eventually complete)
genome sequences
 Gene discovery
Identifying coding regions in genomic DNA by database searching and
other
 Function assignment
Using database searches, pattern searches, protein family analysis and
structure prediction to assign a function to each predicted gene
Data mining
Searching for relationships and correlations in the information
 Genome comparison
Comparing different complete genomes to infer evolutionary history
and genome rearrangements
genomics
introduction to DNA microarrays
 massive data sets from simultaneous expression levels of
thousands of genes
 impossible to grasp directly by the human mind
 methods are needed for finding meaningful results and
patterns from the bulk of data
DNA microarray bioinformatics
 data manipulation: normalization etc.
 data clustering
 genes which behave in a similar fashion
 sample classification by profiles of predictive genes (e.g. c
ancer typing)
 data mining:
 finding interpretation to clustering results
 example: recognition of regulatory factor binding sites in co
expressed genes
basis of molecular biology
 hierarchy of relationships:
genome
gene 1
gene 2
gene 3
gene X
protein 1
protein 2
protein 3
protein X
function 1
function 2
function 3
function X
g e n o m e
FERN
LUNGFISH
SALAMANDER
NEWT
ONION
GORILLA
MOUSE
HUMAN
Drosophila
C. Elegans
Yeast
E. Coli
smallest Genome
s i z e
160,000,000,000
139,000,000,000
81,300,000,000
20,600,000,000
18,000,000,000
3,523,200,000
3,454,200,000
3,400,000,000
137,000,000
96,000,000
12,000,000
5,000,000
??????
genes
31,000
13,500
19,000
6,315
5,361
genomics
 comparative genomics
 whole-genome analyses
 evolution studies
 analyses of components in a ”complete” system
 functional genomics = inferring functions from data
 expression patterns, gene regulation
 sequence comparisons, homologue relationships
 studies of gene variation, altered phenotypes
genomics
 gene finding is not always straightforward
 problem: rare gene products, for which you cannot find
corresponding mRNA or protein sequences in databanks
 additional complication: alternative splicing, many trans
cripts per gene
 even well-annotated databanks provide only a fraction
of all biologically relevant information relevant to a gene
or a molecule (compared to literature)
DNA
microarrays
 massive data sets from simultaneous expression levels of
thousands of genes
 impossible to grasp directly by the human mind
 methods are needed for finding meaningful results and
patterns from the bulk of data
DNA
microarrays
 data manipulation: normalization etc.
 data clustering
 genes which behave in a similar fashion
 sample classification by profiles of predictive genes (e.g.
cancer typing)
 data mining:
 finding interpretation to clustering results
 example: recognition of regulatory factor binding sites in
coexpressed genes
DNA
array
Array Type
Nylon Macroarrays
Nylon Microarrays
Glass Microarrays
Oligonucleotide Chips
technology
Spot Density
(per cm 2 )
< 100
< 5000
< 10,000
<250,000
Probe
Target
Labeling
cDNA
cDNA
cDNA
oligo's
RNA
mRNA
mRNA
mRNA
Radioactive
Radioactive/Flourescent
Flourescent
Flourescent
spotting robot
microarray expression analysis
microarray
photolithography
array
terminology
70 mer vs 40 mer Attachment
70 mer
40 mer
Target
NH2
NH2
NH2
microarray
microarray
data
gene
expression
analysis
control mouse
a stressed mouse
determination
image
RNA of expression
target levels
extraction
labeling
analysis
DNA
micro-array
genomics
 The application of high-throughput automated
technologies to molecular biology.
 The experimental study of complete genomes.
genomics
technologies
 Automated DNA sequencing
 Automated annotation of sequences
 DNA microarrays
 gene expression (measure RNA levels)
 single nucleotide polymorphisms (SNPs)
 Protein chips (SELDI, etc.)
 Protein-protein interactions
cDNA spotted microarrays
Affymetrix gene chips
microarray data analysis
Clustering and pattern detection
Data mining and visualization
Controls and normalization of results
Statistical validation
Linkage between gene expression data and gene
sequence/function/metabolic pathways databases
 Discovery of common sequences in co-regulated genes
 Meta-studies using data from multiple experiments





microarray data analysis
impact on bioinformatics
 Genomics produces high-throughput, high-quality data,
and
bioinformatics
provides
the
analysis
and
interpretation of these massive data sets.
 It is impossible to separate genomics laboratory
technologies from the computational tools required for
data analysis.
proteomics
what is proteomics?
The analysis of the entire protein complement expressed
by a genome, or by a cell or tissue type.“
Two most related technologies
 2-D electrophoresis: separation of complex protein
mixtures
 Mass spectrometry: Identification and structure analysis
Wasinger VC et al Progress with gene-product mapping of the mollicutes: Mycoplasma genitalium. Electrophoresis
16 (1995) 1090-1094
transcription
g e n o m i c
D N A
 Structure
 Regulation
 Information
Computers cannot determine which of these 3 roles DNA
play solely based on sequence… (although we would all
like to believe they can)
introduction to proteomics
 Definitions
 Classical - restricted to large scale analysis of gene product
s involving only proteins
 Inclusive - combination of protein studies with analyses that
have genetic components such as mRNA, genomics, and y
east two-hybrid
 Don’t forget that the proteome is dynamic, changing to
reflect the environment that the cell is in
1 gene = 1 protein?
 1 gene is no longer equal to one protein
 1 gene = how many proteins?
why proteomics?
 Annotation of genomes, i.e. functional annotation
 Genome + proteome = annotation




Protein Function
Protein Post-Translational Modification
Protein Localization and Compartmentalization
Protein-Protein Interactions
 Protein Expression Studies
 Differential gene expression is not the answer
types of proteomics
Protein Expression
 Quantitative study of protein expression
samples that differ by some variable
between
Structural Proteomics
 Goal is to map out the 3-D structure of proteins and
protein complexes
Functional Proteomics
 To study protein-protein interaction, 3-D structures, cellul
ar localization and PTMS in order to understand the physi
ological function of the whole set of proteome.
introduction to proteomics
 composition of the proteome depends on cell type,
developmental phase and conditions
 proteome analyses are still struggling to solve the ”basic
proteome” of different cells and tissues or limited
changes under changing conditions or during processes
 current methods can only ”see” the most abundant
proteins
proteomics
 expression proteomics = differential proteomics = 2D-PE +
MS
 interaction proteomics
 functional proteomics = systematic perturbation or functi
onal inactivation of proteins in a given environment
 structural proteomics
proteomics experiments
 typically a combination of 2D protein electrophoresis an
d mass spectrometry
 labour-intensive, not really ”high-throughput” methods
 more efficient ”protein array” methods are emerging
bioinformatics in proteomics
structural proteomics
 High-throughput determination of the 3D structure of
proteins
 Goal: to be able to determine or predict the structure of
every protein.
 Direct determination - X-ray crystallography and nuclear
magnetic resonance (NMR).
 Prediction
 Comparative modeling  Threading/Fold recognition
 Ab initio
why structural proteomics?
 To study proteins in their active conformation.
 Study protein:drug interactions
 Protein engineering
 Proteins that show little or no similarity at the primary
sequence level can have strikingly similar structures.
an
example
 FtsZ - protein required for cell division in prokaryotes,
mitochondria, and chloroplasts.
 Tubulin - structural component of microtubules important for intracellular trafficking and cell division.
 FtsZ and Tubulin have limited sequence similarity and
would not be identified as homologous proteins by
sequence analysis.
homologues
FtsZ and tubulin have little
similarity at the amino acid
sequence level
Burns, R., Nature 391:121-123
Picture from E. Nogales
are FtsZ and tubulin homologues?
 Yes!
Proteins that have conserved secondary structure can
be derived from a common ancestor even if the primary
sequence has diverged to the point that no similarity is
detected.
structure is function
protein structure
 Imaging Experimental X-ray diffraction data
 Predicting structure in silico from sequence
X-ray crystallography
 Make crystals of your protein
 0.3-1.0mm in size
 Proteins must be in an ordered, repeating pattern.
 X-ray beam is aimed at crystal and data is collected.
 Structure is determined from the diffraction data.
X-ray crystallography
http://www-structure.llnl.gov/Xray/101index.html
crystals
Schmid, M. Trends in Microbiolgy, 10:s27-s31.
X-ray crystallography
 Protein must crystallize.
 Need large amounts (good expression)
 Soluble (many proteins aren’t, membrane proteins).
 Need to have access to an X-ray beam.
 Solving the structure is computationally intensive.
 Time - can take several months to years to solve a
structure
 Efforts to shorten this time are underway to make this
technique high-throughput.
general proces s for proteomi cs res ea rch
Image Analysis
Gel hotel
Spot picker
2-D Gel
Digester
Spotter
MS
general proces s for proteomi cs res ea rch
取材自台大微生物生化系莊榮輝教授網頁
protein microarray
arrayIT TM
protein microarray
arrayIT TM
G. MacBeath and S.L. Schreiber, 2000, Science 289:1760
what can protein microarrays do?
1.
2.
3.
4.
5.
6.
Protein / protein interaction
Enzyme / substrate interaction (transient)
Protein / small molecule interaction
Protein / lipid interaction
Protein / glycan interaction
Protein / Ab interaction
1. G. MacBeath and S.L. Schreiber, 2000, Science 289:1760
2. H.Zhu et al, 2001 Science 293:2101
3. Ziauddin J and Sabatini DM, 2001 Nature 411:107
protein microarrays (Antibody arrays)
the real world
The true spot quality compared to a real experiment
m o bi li t y o f p ro t ei n i n an el ect ri c fi eld
 Mobility : Electrolytic molecules move in an electric field
[Electric field (mV)] [Net charge of molecule]
Mobility ~
[Friction between molecules and matrix]
2-dim electrophoresis
2-D gel electrophoresis
First dimension
 denaturing iso-electric focusing
 separation according to the pI
Second dimension
 SDS-PAGE (Sodium Dodecyl Sulfate
coated in a Poly-Acrylamide Gel
Electrophoresis)
 Separation according to the
molecular weight
pI is the iso-electric point of the protein
MS analysis
Digest to peptide fragment
result
example
mass
spectrometry
 Ion source: substance to ion gas
 Mass analysis: according to mass/charge (m/z)
 Detection: femtomole -attomole
Ion source
Ion separator
detector
principle of mass spec
peptide mixture
embedded in
pulsed
light absorbing
detector
UV or IR laser
chemicals (matrix)
(3-4 ns)
vacuum
+ ++
+ ++ +
+
strong
electric
field
cloud of
V
protonated
acc
peptide molecules
+
+
Time Of Flight tube
principle of mass spec
Linear Time Of Flight tube
ion source
detector
time of flight
Reflector Time Of Flight tube
ion source
detector
reflector
time of flight
typical result
Nuclear Magnetic Resonance Spectroscopy (NMR)
 Can perform in solution.
 No need for crystallization
 Can only analyze proteins that are <300aa.
 Many proteins are much larger.
 Can’t analyze multi-subunit complexes
 Proteins must be stable.
structure
modeling
 Comparative modeling
 Modeling the structure of a protein that has a high degree
of sequence identity with a protein of known structure
 Must be >30% identity to have reliable structure
 Threading/fold recognition
 Uses known fold structures to predict folds in primary
sequence.
 Ab initio
 Predicting structure from primary sequence data
 Usually not as robust, computationally intensive
sequence alignment
sequence alignment
©Ken Howard, Scientific American, July 2000
s e q u e n c e
GATCAAACATTAAACATCCTGAGATCCAAAGGTAAGAGATCTAGCCACAGGGAGTGCTGGGGATTCGGGTCCTGGTGATCTTCACATGCTGACATAGCTCAG
CCCTTTTTGGCCCTGGCTTTGTCCTGTTGTGGGCTTTCCCATCTGCAACCCATGCTCCTGGGCCATTTTCCTATGGGCCAGGGAAAACAAGATGGGGTGAAGGC
ACCCTTACATTTAGGGGCAAGACCTAGTACTCAGAAGGATTCAGAAACTGAAATAGCTGGGTGATACCACACAGGTGCTAGGGATAAGGGGCCTTGAGCCAT
GGACCATGGGAACTACAAAGCTGAAGGAGCTGCTGCCTCAGCAGAACCAGCGCTTGAATTTGTTCTTTCAGAACCTCAGTCTCTTCCTCTGAAAAATGGGTGTG
TTGTGTATCCCACATTCCCAAGTCAGCCATGGGACCAAATGTGAGCGTGTGGGTTTTGCCTCCTGAGAAACTCAGGGGAGCAGAATGCTACAGTGGGTGAATT
GGATTCTTTCAGAGAGCCCACCCTGTTTCCCACATCAGCCAGAAGGCTCAAAACCCTGAAGAGCTTTCTGAACTTTGAGGTGCCCAAAGCTTCAGGGCTGTAT
GGGAAGCACCTGAGGTCCAAGTCCGTTTACAAGAATTTTGTTTTTTGGTTTACAGCTGCTTGGCCGGTCCAAGGAGCAGGTTTGGGTCCTGTGCTCCACAGACCT
AAGGGTTACCTTAGAGCTTATGGGAGAGCATTGTGTGTGGACAGTGGACAGTGCCCTCTAGTGCTCAGTGTTAGCACTACATCCAGTTGCCCTCCACCAGTTTAT
GCTGCTGAGGAAGTCTTTCTTTTCCCAACAGCAGTGTCTCTCCCTCTCCCACCCCCTCTCCCTCTCCCTCCCCCCCTAGGTTATTTTTATTTTTACTGGTGTGTATGTGT
GTGAGTCTATGTCACATGTATGAGAGTGCTTGTGGAGACCAGAAGAGGGCATCAGAAGAGCCCCTAGAACTGGAGTATAGGTGGTTGTGAGCCACTTGTCATG
GGTGTTGGGAACCAAACTCAGGTTCTCTGGAAGAACAACAAGCTCCCTTATCATATAAGCCATCTCTAAATCCAGGACATTTTTTTTTTTTTTTTTGAGATTTAGAGATTC
AAGGAGGAGGAACAATAGGAGGAAGAAGGGGACAGAATAAGGCCAACAAAATGACCAAGGAGGTATAGGCACTTGAAGCCAAACCTAAGTACCTGAGT
TCAATCCCTGGGACCCACATGATGGAAAGATGGAATCGATCCCCAAAAGTTATCTTCTGATCCCTATATGCACACACTTGAGGATGGACAGACAAAGAGACA
GACACACAAACACACACAAATGTAACTGAAAAAGAAACCTCTATGGGGACATCGCCTTCTTGGAGAGGCTCTGTTGCCCCTCATCCTAGTGAACAAACAACT
CCTACTCCCTGCCAGAGTATCCTACCCTTGGATTCAAAATGGTCTCAGAGGACACACCGGGTGGGCTCTGTCGCTGGGATCTTGCATAACCAATGCCCATAA
GCCTGGCAAAGGTGGCGATGAGACGATAAGGTCAGGGACATGACCGCAGAAGAGGAGTGGGGACGCGATGAGTGGGAGGAGCTTCTAAATTATCCATC
AGCACAAGCTGTCAGTGGCCCCAGCCATGAATAAATGTATAGGGGGAAAGGCAGGAGCCTTGGGGTCGAGGAAAACAGGTAGGGTATAAAAAGGGCAC
GCAAGGGACCAAGTCCAGCATCCTAGAGTCCAGATTCCAAACTGCTCAGAGTCCTGTGGACAGATCACTGCTTGGCAATGGCTACAGGTAAGCATGCGCA
AATCCCGCTGGGTGTGGTTTGGGACCCAGGGCCCCTGAAGATGGATCTGAGGCTTCTAATGTGAGTGCGTTCCAACTTCTGCCATGTTGGGAATACTCTGGGTC
CCTATGGGGATTGGGAGAGATCGGCCATTGCTCCCAGGTTTCTCCTGCCCTCCTGTCTCTCTCTAGACTCTCGGACCTCCTGGCTCCTGACCGTCAGCCTGCTC
TGCCTGCTCTGGCCTCAGGAGGCTAGTGCTTTTCCCGCCATGCCCTTGTCCAGTCTGTTTTCTAATGCTGTGCTCCGAGCCCAGCACCTGCACCAGCTGGCTGC
TGACACCTACAAAGAGTTCGTAAGTTCCCCAGAGATGGGTGCCCGTTTGTGGAAGCAGGAAGGGGCAGGTCCTACCCCATACTCCTGGCCCCAGGGAAG
GTCAATGGAGGGGAAATTATGGGGTAGGGGAATCTTAGCCAATGCTGTACCATAGTAATGATGGTGACGAGACACAAGCTGGTCCCTCAGTGACCACCCTTC
TTCCAGGAGCGTGCCTACATTCCCGAGGGACAGCGCTATTCCATTCAGAATGCCCAGGCTGCTTTCTGCTTCTCAGAGACCATCCCGGCCCCCACAGGCAA
GGAGGAGGCCCAGCAGAGAACCGTGAGTAGTCCCAGGCCTTGTCTGCACAAATCCTCGTTTCCCTCCATGCAGCCCTAACTGCACTCCAGGCCAGGGAC
CAGCTCCTCCCTGAAGCTGGGGTAACCTGGGAGTCCCAGGCAGAGGTCACTAGGCAATACACTAACCCCAGCCCTTTTTTTCCCCCCTCAGGACATGGAATT
GCTTCGCTTCTCGCTGCTGCTCATCCAGTCATGGCTGGGGCCCGTGCAGTTCCTCAGCAGGATTTTCACCAACAGCCTGATGTTCGGCACCTCGGACCGTGTC
TATGAGAAACTGAAGGACCTGGAAGAGGGCATCCAGGCTCTGATGCAGGTGAGGATGGACTAGCCTGGGGTTATGCCTGGAGCCTAGGTGGGGCTCACTG
TCCTCTGTTTTACCGGTCAGCCCTTAGACCCTTGAGAAGGCTTCTTCTTCTTCATTTTCCTTTATGAAGCCTCCAGGCTTTTCCTTCGGTCCTGGGGTGGAGGGAGGC
ACAGCTCCCGAGTCTCCTGCCCTTCTTTCCCACGACAGGAGCTGGAAGATGGCAGCCCCCGTGTTGGGCAGATCCTCAAGCAAACCTATGACAAGTTTGAC
GCCAACATGCGCAGCGACGACGCGCTGCTCAAAAACTATGGGCTGCTCTCCTGCTTCAAGAAGGACCTGCACAAAGCGGAGACCTACCTGCGGGTCATG
AAGTGTCGCCGCTTTGTGGAAAGCAGCTGTGCCTTCTAGCCACTCACCAGTGTCTCTGCTGCACTCTCCTGTGCCTCCCTGCCCCCTGGCAACTGCCACCCC
GCGCTTTGTCCTAATAAAATTAAGATGCATCATATCACCCGGCTAGAGGTCTTTCTGTTATGGGATGGAGCAGTTGTGTCAATCTTGTTCCTGGAAGCCTGCGAGAA
sequence alignment: why?
 Early in the days of protein and gene sequence analysis,
it was discovered that the sequences from related
proteins or genes were similar, in the sense that one
could align the sequences so that many corresponding
residues match.
 This discovery was very important: strong similarity
between two genes is a strong argument for their
homology. Bioinformatics is based on it.
sequence alignment: why?
 Terminology:
 Homology means that two (or more) sequences
have a
common ancestor. This is a statement
about evolutionary history.
 Similarity simply means that two sequences are
similar, by some criterion. It does not refer to any
historical process, just to a comparison of the
sequences by some method. It is a logically weaker
statement.
 However, in bioinformatics these two terms are often
confused and used interchangeably. The reason is
probably that significant similarity is such a strong
argument for homology.
two protein alignment
many genes have a common ancestor
 The basis for comparison of proteins and genes using the
similarity of their sequences is that the proteins or genes
are related by evolution; they have a common ancestor.
 Random mutations in the sequences accumulate over
time, so that proteins or genes that have a common
ancestor far back in time are not as similar as proteins or
genes that diverged from each other more recently.
 Analysis of evolutionary relationships between protein or
gene sequences depends critically on sequence
alignments.
definition of sequence alignment
 Sequence alignment is the procedure of comparing
two (pair-wise alignment) or more multiple sequences by
searching for a series of individual characters or patterns
that are in the same order in the sequences.
 There are two types of alignment: local and global. In
global alignment, an attempt is made to align the entire
sequence. If two sequences have approximately the
same length and are quite similar, they are suitable for
the global alignment.
 Local alignment concentrates on finding stretches of
sequences with high level of
definition of sequence alignment
L G P S S K Q T G K G S - S R I W D N
Global alignment
L N - I T K S A G K G A I M R L G D A
- - - - - - - T G K G - - - - - - - Local alignment
- - - - - - - A G K G - - - - - - - -
interpretation of sequence alignment
 Sequence alignment is useful for discovering structural,
functional and evolutionary information.
 Sequences that are very much alike may have similar
secondary and 3D structure, similar function and likely a
common ancestral sequence. It is extremely unlikely that
such sequences obtained similarity by chance. For DNA
molecules with n nucleotides such probability is very low
P = 4-n. For proteins the probability even much lower P =
20 –n, where n is a number of amino acid residues
 Large scale genome studies revealed existence of
horizontal transfer of genes and other sequences
between species, which may cause similarity between
some sequences in very distant species
methods of sequence alignment
 Dot matrix analysis
 The dynamic programming (DP) algorithm
 Word or k-tuple methods
dot matrix analysis
 A dot matrix analysis is a method for comparing two
sequences to look for possible alignment (Gibbs and
McIntyre 1970)
 One sequence (A) is listed across the top of the matrix
and the other (B) is listed down the left side
 Starting from the first character in B, one moves across
the page keeping in the first row and placing a dot in
many column where the character in A is the same
 The process is continued until all possible comparisons
between A and B are made
 Any region of similarity is revealed by a diagonal row of
dots
 Isolated dots not on diagonal represent random
matches
dot matrix analysis
 Detection of matching regions can be improved by
filtering out random matches and this can be achieved
by using a sliding window
 It means that instead of comparing a single sequence
position more positions is compared at the same time
and dot is printed only if a certain minimal number of
matches occur
 Dot matrix analysis can also be used to find direct and
inverted repeats within the sequences
dot matrix analysis: two identical sequences
Nucleic Acids Dot Plots - http://arbl.cvmbs.colostate.edu/molkit/dnadot/index.html
dot matrix analysis: two very different sequences
Nucleic Acids Dot Plots of genes Adh1 and G6pd in the
mouse
dot matrix analysis: two similar sequences
Nucleic Acids Dot Plots of genes Adh1 from the mouse
and rat (25 MY)
dynamic programming
back
to
the
basics
 DNA/RNA sequences: strings composed of an alphabet
of 4 letters
 Protein sequences: alphabet of 20 letters
why





do
we
do
it?
Identify a gene
Find clues to gene function (ortholog?)
Find other organisms with this gene (homology)
Gather info for an evolutionary model
…
a l i g n m e n t
 alignment is the basis for finding similarity
 Pairwise alignment = dynamic programming
 Multiple alignment: protein families and functional
domains
 Multiple alignment is "impossible" for lots of sequences
 Another heuristic - progressive pairwise alignment
an
example
GCGCATGGATTGAGCGA
TGCGCCATTGATGACCA
possible alignment
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
alignment
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
three elements
 Perfect matches
 Mismatches
 Insertions & deletions (indel)
significant similarity in an alignment
 Ho: the current alignment is a result of random line-up
(the 2 sequences are unrelated)
 Ha: the sequences diverge from a common ancestor
(related)
 Test statistic: Ymax = length of the longest running perfect
match subsequence
exact matching subsequences
 In DNA alignment, the matching probability
 Under Ho lengths of exact match subseq Y should follow
a geometric distribution
pmatch  p  p  p  p
2
a
2
g
2
c
2
t
well-matching subsequences
 Evolution may cause small differences to even
sequences with a reasonably recent common ancestor.
 We consider Ymax to be the longest subseq with up to k
mismatches.
 Y follow hyper-geometric distribution
 P-value: exact/simulated/approximate (independence
among Y does not hold any more)
choosing alignments
There are many possible alignments
For example, compare:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA-Which one is better?
scoring
 Score
 Used to determine quality of match and basis for the selecti
on of matches. Scores are relative.
 Expectation value
 An estimate of the likelihood that a given hit is due to pure
chance, given the size of the database; should be as low a
s possible. E.V.’s are absolute. A high score and a low E.V.
indicate a true hit.
 Sequence identity (%) (or Similarity)
 Number of matched residues divided by total length of pro
be
scoring rule
Example Score =
(# matches) – (# mismatches) – (# indels) x 2
examples
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score: (+1x13) + (-1x2) + (-2x4) = 3
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA-Score: (+1x5) + (-1x6) + (-2x11) = -23
edit
distance
 The edit distance between two sequences is the “cost”
of the “cheapest” set of edit operations needed to
transform one sequence into the other
 Computing edit distance between two sequences
almost equivalent to finding the alignment that minimizes
the distance
d(s1 , s2 )  max alignment of s1 &s2 score(alignment)
computing edit distance
 How can we compute the edit distance??
|s| = n and |t| = m, there are
alignments
 2 sequences each of length 1000: > 10^600
 If
more
m  n 

m


than 

 The additive form of the score allows to perform
dynamic programming to compute edit distance
efficiently
r e c u r s i v e
a r g u m e n t
 Suppose we have two sequences:
s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:
1. Last position is (s[n+1],t[m +1] )
2. Last position is (s[n +1],-)
3. Last position is (-, t[m +1] )
d (s [1..n  1 ],t [1..m  1 ])  d (s [1.., n ],t [1..m]) 
 (s [n  1 ],t [m  1 ])
recursive
argument
 Suppose we have two sequences:
s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:
1. Last position is (s[n+1],t[m +1] )
2. Last position is (s[n +1],-)
3. Last position is (-, t[m +1] )
d (s [1..n  1 ],t [1..m  1 ])  d (s [1.., n ],t [1..m  1 ]) 
 (s [n  1 ],)
recursive
argument
 Suppose we have two sequences:
s[1..n+1] and t[1..m+1]
The best alignment must be in one of three cases:
1. Last position is (s[n+1],t[m +1] )
2. Last position is (s[n +1],-)
3. Last position is (-, t[m +1] )
d (s [1..n  1 ],t [1..m  1 ])  d (s [1.., n  1 ],t [1..m]) 
 ( ,t [n  1 ])
recursive
argument
Define the notation:
 Using the recursive argument, we get the following
recurrence for V:
V [i , j ]  d (s [1..i ],t [1.. j ])
V [i , j ]   (s [i  1 ],t [ j  1 ]) 


V [i  1 , j  1 ]  max V [i , j  1 ]   (s [i  1 ],) 
V [i  1 , j ]   ( ,t [ j  1 ]) 


recursive
argument
 Of course, we also need to handle the base cases in the
recursion:
V [0,0]  0
V [i  1,0]  V [i,0]   ( s[i  1], )
V [0, j  1]  V [0, j ]   (, t[ j  1])
dynamic programming algorithm
0
A
1
0
A1
A2
A3
C4
We fill the matrix using the recurrence rule
G
2
C
3
dynamic programming algorithm
0
0 0
A 1 -2
A G C
1 2 3
-2 -4 -6
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
Conclusion: d(AAAC,AGC) = -1
interpretation of pointers



Insertion of S2(j) into S1
Deletion of S1(i) from S1
Match or Substitution
reconstructing the best alignment
 We now trace back the path the corresponds to the best
alignment
AAAC
AG-C
A G C
0 1 2 3
0 0 -2 -4 -6
A 1 -2
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
reconstructing the best alignment
 Sometimes, more than one alignment has the best score
AAAC
A-GC
A G C
0 1 2 3
0 0 -2 -4 -6
A 1 -2
1
A 2 -4 -1
-1 -3
0
-2
A 3 -6 -3 -2 -1
C 4 -8 -5 -4 -1
complexity




Space: O(mn)
Time: O(mn)
Filling the matrix O(mn)
Backtrace O(m+n)
other
scoring
schemes
 Needleman and Wunsch: 1 for identical amino acid, 0
otherwise
 Dayhoff PAM scoring matrix: variations include BLOSUM
matrices(Henikoff and Henikoff 1992, Proc. Nat. Acad. Sci.
89, 10915-10919).
 …
 Different Gap Cost Function
s cori ng m a t ri x for pr otei n s equ ences
substitution “log odds” matrix BLOSUM 62
Henikoff and Henikoff (1992; PNAS 89:10915-10919)
PAM
250
matrix
( M.O. Dayhoff, ed., 1978, Atlas of Protein Sequence and Structure, Vol 5).
multiple sequence alignment
 Often a probe sequence will yield many hits in a search.
Then we want to know which are the residues and
positions that are common to all or most of the probe
and match sequences
 In multiple sequence alignment, all similar sequences
can be compared in one single figure or table. The basic
idea is that the sequences are aligned on top of each
other, so that a coordinate system is set up, where each
row is the sequence for one protein, and each column is
the 'same' position in each sequence.
an example
name of homologous domians
consensus
position of residue
residues and position common to most homologs
cellulose-binding domain of cellobiohydrolase
why multiple sequence alignment?
 Identify consensus segments
 Hence the most conserved sites and residues
 Use for construction of phylogenesis
 Convert similarity to distance www.ch.embnet.org/software
/ClustalW.html
 Of genes, strains, organisms, species, life
sequence logo
This shows the conserved residues as larger characters, where the total
height of a column is proportional to how conserved that position is.
Technically, the height is proportional to the information content of the
position.
sample multiple alignment
constructing the tree of life
Bacteria
A. aeolicus
T. maritima
Eukarya
Archaea
Black tree: dist ’n of 8-mers . Red tree: sequence aligment .
with k-mers (16s RNA, 35 organisms)
databases of multiple alignments
 Pfam: Protein families database of aligments and HMMs
 www.cgr.ki.se
 PRINTS, multiple motifs consisting of ungapped, aligned s
egments of sequences, which serve as fingerprints for a
protein family
 www.bioinf.man.ac.uk
 BLOCKS, multiple motifs of ungapped, locally aligned se
gments created automatically
 fhcrc.org
s o f t w a r e
manual alignment- software
 GDE- The Genetic Data Environment (UNIX)
 CINEMA- Java applet available from:
 http://www.biochem.ucl.ac.uk
 Seqapp/Seqpup- Mac/PC/UNIX available from:
 http://iubio.bio.indiana.edu
 SeAl for Macintosh, available from:
 http://evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html
 BioEdit for PC, available from:
 http://www.mbio.ncsu.edu/RNaseP/info/programs/BIOEDIT
/bioedit.html
B L A S T

1.
2.
3.
Search a sequence database for fragments similar to
the query sequence
Compile a list of high-scoring short words shared by
the query sequence and the database;
Scan the database for “hits”
Expand the “hits” to MSP (maximum segment pair = a
pair of equal-length/no-gap segments with the highest
alignment score)
B L A S T
 Altschul, et. al. (1990) "Basic local alignment search tool."
J. Mol. Biol. 215:403-410.
 Variations of BLAST designed for specific purposes
 http://www.ncbi.nlm.nih.gov/BLAST/
similarity searching the databanks
 What is similar to my sequence?
 Searching gets harder as the databases get bigger and quality degrades
 Tools:
BLAST and FASTA = time saving heuristics
(approximate)
 Statistics + informed judgement of the biologist
read out
>gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'.
Length = 369
Score = 272 bits (137), Expect = 4e-71
Identities = 258/297 (86%), Gaps = 1/297 (0%)
Strand = Plus / Plus
Query: 17
Sbjct: 1
Query: 77
Sbjct: 60
aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76
|||||||||||||||| | ||| | ||| || ||| | |||| ||||| |||||||||
aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59
agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136
|||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| ||
agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119
Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196
|||||||| | || | ||||||||||||||| ||||||||||| || ||||||||||||
Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179
Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256
||||||||| | |||||||| |||||||||||||||||| ||||||||||||||||||||
Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239
Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313
|| || ||||| || ||||||||||| | |||||||||||||||||| ||||||||
Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296
structure - function relationships
 Can we predict the function of protein molecules from
their sequence?
sequence > structure > function
 Conserved functional domains = motifs
 Prediction of some simple 3-D structures (a-helix, b-sheet,
membrane spanning, etc.)
protein domains
(from ProDom database)
DNA sequencing
 Automated sequencers > 40 KB per day
 500 bp reads must be assembled into complete genes
 errors especially insertions and deletions
 error rate is highest at the ends where we want to overlap
the reads
 vector sequences must be removed from ends
 Faster sequencing relies on better software
 overlapping deletions vs. shotgun approaches: TIGR
DNA
sequencing
finding genes in genome sequence is not easy
 About 2% of human DNA encodes functional genes.
 Genes are interspersed among long stretches of non-coding DNA.
 Repeats, pseudo-genes, and introns confound matters
pattern finding tools
 It is possible to use DNA sequence patterns to predict genes:




promoters
translational start and stop codes (ORFs)
intron splice sites
codon bias
 Can also use similarity to known genes/ESTs
phylogenetics
 Evolution = mutation of DNA (and protein) sequences
 Can we define evolutionary relationships between
organisms by comparing DNA sequences
 is there one molecular clock?
 phenetic vs. cladisitic approaches
 lots of methods and software, what is the "correct" analysis?
phylogenetics
software tools on the web
 Many of the best tools are free over the Web
 BLAST
 ENTREZ/PUBMED
 Protein motifs databases
 Bioinformatics “service providers”
 DoubleTwist™, Celera, BioNavigator™
 Hodgepodge collection of other tools
 PCR primer design
 Pairwise and Multiple Alignment
PC programs
 Macintosh and Windows applications
 -Commercial Vector NTI™, MacVector™,
Sequencher™
 - Freeware Phylip, Fasta, Clustal, etc.
OMIGA™,
 Better graphics, easier to use
 Can't access very large databases or perform
demanding calculations
 Integration with web databases and computing services
Vector NTI
most important sequence databases
 Genbank– maintained by USA National Center for Biology
Information (NCBI)
 All biological sequences
 www.ncbi.nlm.nih.gov/Genbank/GenbankOver
view.html
 Genomes
 www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=
Genome
 Swiss-Prot - maintained by EMBL- European Bioinformatics
Institute (EBI )

Protein sequences
 www.ebi.ac.uk/swissprot/
genom e
pro ject
the human genome project
The genome sequence is complete - almost!
Approximately 3.2 billion base pairs.
all the genes
 Any human gene can now be found in the genome by
similarity searching with over 99% certainty.
 However, the sequence still has many gaps
 hard to find an uninterrupted genomic segment for any gene
 still can’t identify pseudogenes with certainty
 This will improve as more sequence data accumulates
Raw Genome Data:
example of the code
example of the code
The next step is obviously to locate all of the genes and describe their
functions. This will probably take another 15-20 years!
inconsistency
 Celera says that
~34,000 genes
there
are
only
 so why are there ~60,000 human
genes on Affymetrix GeneChips?
 Why does GenBank have 49,000
human gene coding sequences and
UniGene have 96,000 clusters of
unique human ESTs?
 Clearly we are in desperate need of a
theoretical framework to go with all of this
data
http://www.celera.com/
implications for biomedicine
 Physicians will use genetic information to diagnose and
treat disease.
 Virtually all medical conditions have a genetic
component.
 Faster drug development research
 Individualized drugs
 Gene therapy
 All Biologists will use gene sequence information in their
daily work
the
equipment
meaning of the code …
meaning of the code …
e v o l u t i o n
how do genomes evolve?




Point mutations
Rearrangements
Recombination
Selection and Drift
how can you view the evolution?




Individual gene alignment view (usually proteins)
Dot plot or VISTA (local similarity) view
Synteny view
Composite (average) views
DNA dot plot view
 Show one DNA along X-axis, second on Y-axis
 For every position along both, score local similarity
 Display 2-D plot of similarity in gray-scale
dot plot example
self-match
tandem duplication
random dot plot
gene structure revealed by dot plot
promoter
conservation
synteny
view
 Synteny definition: a contiguous region in another
genome that has more-or-less the same genes in the
same order.
 The boundaries of what constitutes synteny are a bit
fuzzy… for example you probably wouldn’t say a region
isn’t syntenic if it is missing one gene out of many.
single
inversion
insertion or deletion
double inversion
intra-chromosomal rearrangements
inter-chromosomal rearrangements
syntenic
scaling
These regions are perfectly syntenic, but on average the mouse has
shorter regions separating alignable conserved blocks.
limitations to synteny view
 Provides only overview of arrangement, with no
information about the degree or areas of conservation.
 As genomes become more distant synteny becomes
more chaotic, until (in the extreme) most blocks are one
gene long (e.g. flies vs. human).
 In some cases, very deep synteny can be seen, most
dramatically in the Hox clusters.
Hox
cluster
composite or summary views
 View comparative summaries
that encapsulate general
properties of the genome.
 For example, G-C content
comparison:
phylogenetics and evolution