Document 317202

Download Report

Transcript Document 317202

Looking at Whole Genomes:
Frequency of Occurrence of
Oligonucleotides
Lecture I
Winter School on Modern Biophysics
National Taiwan University
December 16-18, 2002
HC Lee
Dept Physics & Dept Life Science
National Central University
The Book of Life
Millions of sequences
Growth of sequenced genome data
exploded after 1995
Genome data exploded after 1995
(GenBank: as of 2002 January 13)
CBL@NCU
Human has
24 types of The
Human
has 23
chromosomes
Human
Genome
Chromosomes
3 billion bps
Human
genome
first draft
complete
d
Feb 16,
2001
Sequencing of first working draft of
Human Genome published in 2001 February
Nature, 409, February 15, 860-921 (2001)
First working draft of Human Genome
Science, 291, February 16, 1304-1351 (2001)
Genome - Book of Life
written in four letters
Genome - book of four letter
ackaged pair of DNA strands
ith double helix structure
DNA - a polymer of
nucleotides
Nucleotide – backbone +
bases
Four types of bases:
A, C, G, T (the four
letters)
Gene – coded sequence of
bases
Genome – set of all genes;
set of all chromosomes
CBL@NCU
Central Dogma
• Genome (DNA): genetic information
(genes)
• Ribosomes: Transcribe (轉錄) &
translate (翻譯) genes (nucleotide
sequence) to proteins (amino acids
sequence)
• Proteins: expression and function
New way to do
Life Science Research
• in vivo 在活體裡
• in vitro 在試管中
• in silico 在電腦中
CBL@NCU
Frequency of occurrence
of oligonucleotides
A simple first look at
whole genomes
Oligo (or k-mer) Frequency
• Oligonucleotide (oligo): short sequence of
several nucleotides (k~2-30) long; a k-mer
• There are 4k different kinds of k-mers
• Frequencies of occurrence of all k-mer in a
sequence can be obtained by reading with a
“sliding window”
• Complete set of frequencies of k-mers
characterizes a DNA sequence
• Very fast to compute; scales with seq length
• For multiple seqs, scales w/ no. of seqs
• Related to alignment
Counting k-mers with
Sliding Window
N(GTTACCC) = N(GTTACCC) + 1
• Sum over all N(oligo) = Sequence (circular) length
• Sequence is represented by the set {N(oligo) | all oligos}
 Or: for each k, sequence represented by 4k-component vector
Number of oligos
Frequency distribution of 6-mers
Frequency of oligo
More about this
in lecture II
”Portraits” of microbial
genomes
Making a portrait
• Divide a rectangular into 2k by 2k
cells, each cell corresponding to
one of the 4k different kinds of kmers
• Write in each cell the frequency of
the k-mer
• Color-code ranges of frequencies
Mycoplasma
genitalium
Length 0.58 Mb
G+C content 32%
Bacteria,
Firmicutes
Pathogen from the
human urogenital tract
Mycoplasma
pneumoniae
Length 0.816 Mb
G+C content 40%
Bacteria
Firmicutes
Parasite of the human
respiratory tract.
Borrelia
burgdorferi
Length 0.911 Mb
G+C content 30%
Bacteria
Spirochaetales
Causitive agent of
Lyme disease (neurologic complications,
arthritis)
Rhizobium sp.
NGR234
Length 0.53 Mb
G+C content 59%
Bacteria
Proteobacteria
Representative
bacterium that fixes
nitrogen in symbiosis
with many plants.
Aquifex aeolicus
Length 1.55 Mb
G+C content 40%
Bacteria
Aquificales
Earliest diverging, and
most thermophilic
bacteria known. Can
grow on hydrogen,
oxygen, carbon dioxide.
Parasite of the human
respiratory tract.
Haemophilus
influenzae
Length 1.83 Mb
G+C content 38%
Bacteria
Proteobacteria
Blood-loving causative
agent of influenza.
Methanococcus
jannaschii
Length 1.66 Mb
G+C content 31%
Archaea
Euryarchaeota
Anaerobic,
Methane-producing
hyperthermophile;
grows at > 200 atm
and an optimum temp.
of 85 degrees C.
Note: fractals
Helicbacter pylori
Length 1.67 Mb
G+C content 40%
Bacteria
Proteobacteria
Acid-loving causative
agent of chronic gastric
Diseases
Note: fractals
Archaeoglobus
fulgidus
Length 2.18 Mb
G+C content 49%
Archaea,
Euryarchaeota
Hyperthermophilic
sulphur-reducer;
causes havoc by
souring oil wells.
Synechococcus sp.
PCC6803
Length 3.587Mb
G+C content 48%
Bacteria
Cyanobacteria
Unicellular
cyanobacterium
widely used for study
of oxygen-producing
photosynthesis
mechanism.
Exceptionally wide
distribution of frequency occurrence of
short oligos.
Phylogeny based on
alignment of homologous
sequences
Molecular Evolution & Phylogeny
• Organism represented by Genome
• A Universal Ancestor (is believed to) exists
• Random mutation of DNA sequence leads
to divergence and new species
• Pressure from fitness causes conservation of
sequence
Phylogeny & Sequence similarity
•Because fitness exerts pressure on
functional sequence to conserve, if rate
of change induced by mutation is
assumed constant, then the dissimilarity
between two homologous sequences is
indicative of time elapsed when they
diverged. Hence can use sequence
similarity to study phylogeny.
•E.g. phylogeny based on 16S/18S rRNA
Sequence Alignment
• Most important method for studying sequence
homology
• Example – alignment of two sequences a and b
Seq a: TACCATCGCAAACAT GG (length 17b)
x||||x|x|||x-|x--x|
Seq b: AACCACCACAAG ACCTCG (length 18b)
Consensus length 19, 10 matches(|), 6 mismatches (x),
1 single gap (-, SG), 1 extended gap (--, EG)
Score: matches – (SG+EG)*P – (EG-1)*PE =
(P: penalty for SG; PE: penalty for EG)
Score = 10 –2 –1 = 7
Similarity = matches/total length =10/19=55%
Sequence Alignment (II)
• Result intuitive, evolution based
• Widely used in sequence analysis – homology
search, phylogeny, etc
• Parameter dependent – many alignments
possible (Needleman-Wunsch algorithm)
• DNA & proteins sequences
• Good software. E.g., BLAST, GCG,..
• Fast for length < 2000
• NP-complete problem for long and remotely
related sequences, and for multiple alignments
The Ribosome
• E.g. phylogeny based on 16S/18S rRNA
– 16S (Prokaryotes): 1550 bases; 18S Eukaryotes):
1800 bases
• Ribosomal enzyme
• Transcription & translation
• Among the most ancient and best conserved
biological machines
• In genome of EVERY organism
• Two subunits: 30S + 50S
• 30S (small subunit): 16S/18S + 20 proteins
• Translates mRNA
“Cartoon” of 16S rRNA
Head
Body
Platform
Platform
Head
E coli 16S
rRNA
secondary
structure
Body
3‘m
Bacteria
16S rRNA
alignment tree
35 organisms:
19 bacteria
9 archaea
7 eukarya
E. coli
Bacillus
Aquifex
Herpetosiphon
Thermotoga
Mouse
Homo sapiens
Eukarya
Methanococcus
Archaea
Archaeoglobus
C. elegans
Phylogeny based on
frequency of k-mers
Sequence distance based
on Oligo Frequency
16S/18S
rRNA
k-mer
tree as
function
of k
Bacteria
Archaea
Eukarya
Oligo Frequency and sequence
alignment distances correlated
• If sequence evolve ONLY by
uncorrelated single mutations, then:
S = X n (b/c chances of any base not
changing is X)
 X - alignment similarity
 S - oligo frequency similarity
 n - oligo length.
• In practice, more than single mutation.
E.g., extended gaps. Then
S = X**(kn)
k < 1. Empirically: k = 2/3.
Simulated Random
Mutations
S = X9
Oligo length = 9
log Soligo v.s. log X align
Extended Gaps I
Extended Gaps II
Simulated Random
Mutations
with
Extended gaps
Oligo length = 9
S = X6.3
h=4
ng =3
kth=0.625
log Soligo v.s. log X align
Tree of Life
(35 organisms)
Oligo length = 9
h=5
ng=2.5
kth=0.8
kex=0.66
log Soligo v.s. log X align
Oligo frequency
Eukarya
Archaea
Aquifex
Thermotoga
Bacteria
Alignment
Aquifex
Thermotoga
Comparison of
16S/18S rRNA
Trees of Life
(35 organisms)
Similar topology
Differences in detail
Bacteria
Aquifex
Thermotoga
Eukarya
Archaea
Black: oligo frequency
Red: sequence alignment
Oligo method is Robust
• Three tests (Bacteria and Archaea)
– Random truncation of 16S rRNA to 800
to 1200 bases
– Random inversion of 16S rRNA (splice,
reverse order and reconnect)
– Random concatenation of 23S, 16S and
5S rRNA sequences
k
d
r
mo
e
L
n
Alignment
g
h
r
G
b
B
D
F
q
j
a
f
s
p
i
p
j
Thermatoga
q
z
HH
Aquifex
y
C
z
f
Thermatoga
E
Sulfolobus
HH
Aquifex
y
i
A
A
b
a
e
AA
Aeropyrum
C
k
D
F
h
s
g
E
n
B
d
G
0.1
Oligo
L
o
16s rRNA Truncated
m
16s rRNA Truncated
d
o
Alignment
n
a
f
m
E
b
D
q
Aquifex
y
F
i
A
A
Thermatoga
C
h
g
Aquifex
H
L
j
A
p
H
H
e
G
k
Oligo
B
z
Thermatoga
0.1
r
s
Alignment
d
Oligo
g
j
Aquifex z
n
HH
i
m
D
o
A
A
k
Thermatoga
h
C
Aquifex
f
b
a
y
Thermatoga
H
Mixed 5s+16s+23s rRNAs
A
Towards a
Consensus Tree based
on whole genomes
Tree is sequence dependent
• Phylogenetic relations expressed by
genes are not universal
• A tree extracted from the 16S rRNA gene
differs – not always just in detail - from a
tree extracted from another well
conserved gene
• A consensus tree may be constructed
but depends on criteria that are
subjective
Can a Consensus Tree be
construct from whole genomes?
• Also a subjective choice
• Genomes are vastly complex, hence
possible combinations of criteria that can
be chosen for tree construction is huge
• Frequency of occurrence of
oligonucleotides has universal
characteristics across life forms (see next
lecture)
– Extremely frequent and extremely rare oligos
(EFEROs) characterize groups of organisms
“Consensus” tree
of 65 microbials
with complete
genomes
Proteobacteria
Firmicutes
Archaea
Others
Topology of first-trial
EFEROs tree from 6-mers
SUMMARY
• Oligo frequency characterizes DNA seqs
• Oligo similarity is related to alignment similarity
• Oligo vs alignment gives a handle on mechanism
of generation of extended gaps
• Oligo method is robust to truncation and inversions
• May be developed into a tool for analysis and
comparison of very long sequences or complete
genomes
• (Preview lecture II): hints at how genomes grow
Lecture and Book
•Lecture by Paul Higgs
• online.itp.ucsb.edu/online/infobio01/higgs/
• see online.itp.ucsb.edu/online/infobio01/
for many lectures
•Book by Wen-Hsiong Li 李文雄
•“Molecular Evolution” (Sinauer Associates, 1997)
Some web sites on Molecular Evolution
•CMS Molecular Biology Resource
•www.unl.edu/stc-95/ResTools/cmshp.html
•Phylogeny - Molecular Evolution
•www.unl.edu/stc-95/ResTools/biotools/biotools2.html
•The Tree of Life Web Project
•tolweb.org/tree/phylogeny.html
•Web Resources in Molecular Evolution and
Systematics
•darwin.eeb.uconn.edu/molecular-evolution.html
Some web sites on ClustalW
(tree drawer)
• On-line service
• www.ebi.ac.uk/clustalw/
• clustalw.genome.ad.jp/
• Software
• ftp-igbmc.u-strasbg.fr/pub/ClustalX/
• ftp-igbmc.u-strasbg.fr/pub/ClustalW/
Bacillus subtilis
Length 4.21 Mb
G+C content 40%
Bacteria
Firmicutes
Aerobic bacterium
commonly found in soil;
important source of
industrial enzymes.
Methanobacterium
thermoautotrophicum
Length 1.75 Mb
G+C content 49%
Archaea
Euryarchaeota
Anaerobic microorganism used as
representative of
methanogens.
Escherichia coli
Length 4.64 Mb
G+C content 50%
Bacteria
Proteobacteria
Parasitic human
Pathogen of the
digestive tract.