PowerPoint file

Download Report

Transcript PowerPoint file

Gene Predictor
Date:20/11/2003
Implemented By: Zohar Idelson
Supervisor: Dr. Yizhar Lavner
Winter - Summer 2003
1
Genomic Signal Processing
• Genomic Signal Processing is a relatively new
•
field in Bioinformatics, in which signal
processing algorithms and methods are used
to study functional structures in the DNA.
An appropriate mapping of the DNA
sequence into one or more numerical
sequences, enables the use of many digital
signal processing tools.
DNA Segment
atgcggatttgccgtcgatgtc…
Gene
Predictor
DNA Segment
Gene
Gene
2
DNA Basics
• DNA in Eukaryotes is organized in chromosomes.
• The DNA in each chromosome can be read as a discrete signal to
•
•
•
•
•
{a,t,c,g}. (For example: atgatcccaaatggaca…).
In exons (protein-coding region), during the biological amino acids
building, those letters are read as triplets (codons). Every codon
signals which amino acid to build (there 20 aa).
There are 6 ways of translating DNA signal to codons signal, called
the reading frames (3 * 2 directions).
Every gene start with a start-codon and ends with a stop-codon. An
exon cannot consists of more than one stop-codon.
Non coding areas (majority usually) has a lot more random behavior
than genes. Most of the DNA is non coding.
Genes can be detected by some statistics regularities, like codon
usage, nucleotide usage, periodicity and data base comparison.
3
Organisms
• Classified into two types:
• Eukaryotes: contain a membrane-bound nucleus and
organelles (plants, animals, fungi,…)
• Prokaryotes: lack a true membrane-bound nucleus and
organelles (single-celled, includes bacteria)
• Not all single celled organisms are prokaryotes!
4
Cells
• Complex system enclosed in a
membrane
• Organisms are unicellular
(bacteria, baker’s yeast) or
multicellular
• Humans:
– 60 trillion cells
– 320 cell types
Example Animal Cell
www.ebi.ac.uk/microarray/ biology_intro.htm
5
DNA Basics – cont.
• DNA in Eukaryotes is organized in chromosomes.
6
Chromosomes
• In eukaryotes, nucleus
contains one or several
double stranded DNA
molecules orgainized as
chromosomes
• Humans:
– 22 Pairs of autosomes
– 1 pair sex chromosomes
Human Karyotype
http://avery.rutgers.edu/WSSP/StudentScholars/
Session8/Session8.html
7
www.biotec.or.th/Genome/whatGenome.html
8
What is DNA?
• DNA: Deoxyribonucleic Acid
• Single stranded molecule (oligomer,
polynucleotide) chain of nucleotides
• 4 different nucleotides:
–
–
–
–
Adenosine (A)
Cytosine (C)
Guanine (G)
Thymine (T)
9
Nucleotide Bases
• Purines (A and G)
• Pyrimidines (C and T)
• Difference is in base structure
Image Source: www.ebi.ac.uk/microarray/ biology_intro.htm
10
DNA
11
12
Genome
• chromosomal DNA of an organism
• number of chromosomes and genome size varies
quite significantly from one organism to another
• Genome size and number of genes does not
necessarily determine organism complexity
13
Genome Comparison
ORGANISM
CHROMOSOMES
GENOME SIZE
GENES
Homo sapiens
(Humans)
23
3,200,000,000
~ 30,000
Mus musculus
(Mouse)
20
, 2600,000,000
~30,000
Drosophila
melanogaster
(Fruit Fly)
4
180,000,000
~18,000
Saccharomyces
cerevisiae (Yeast)
16
14,000,000
~6,000
Zea mays (Corn)
10
2,400,000,000
???
14
15
DNA Basics – cont.
• The DNA in each
chromosome can be
read as a discrete
signal to {a,t,c,g}.
(For example:
atgatcccaaatggaca…)
16
DNA Basics – cont.
• In genes (protein-coding region), during
the construction of proteins by amino
acids, these nucleotides (letters) are read
as triplets (codons). Every codon signals
one amino acid for the protein synthesis
(there are 20 aa).
17
DNA Basics – cont.
• There are 6 ways of translating DNA signal
to codons signal, called the reading
frames (3 * 2 directions).
…CATTGCCAGT…
18
DNA Basics – Cont.
…CATTGCCAGT…
Start: ATG
Stop: TAA, TGA, TAG
gene
Exon
Intron
Exon Intron
Exon
Exon
19
The Problem
• Given unannotated DNA, find the genes.
• In practice, find the exons and their RF.
• Smaller scale problem: given some
annotated DNA of a creature, find the
exons of unannotated DNA of the same
creature.
atgcggatttgccgtcgatgtc…
Gene
Predictor
Exon
Exon
20
Solution Scheme
• Solution scheme:
– Work in windows analysis.
– Find parameters that gives a good prediction in
annotated DNA (of the same organism). Learn how to
distinguish exons regions from non-exons regions.
– Extract those parameters from the unannotated DNA,
and use the discrimination rule in order to predict.
• Almost all methods shown here fit to this
scheme.
21
Creatures in the Project
C. elegans
S. cerevisiae
(yeast)
22
Existing Methods
• Many methods relies on the pseudo periodicity
of 3 in genes. For that we define:
– Ub is the binary indicator series for base B.
– UB is the STFT of ub.
• N, the window size, is in the hundreds. Exons size is in order
of 101…103 (in S. Cerevisiae).
• Overlapping windows.
– There exists a connection between the DFT in k =
N/3 frequency and nucleotides usage.
23
Calculating the DFT of a DNA sequence*
N 1
n 0
X (k )  DFT {x(n)}
  x ( n )e
N 1
i
N
U b     ub  n  e
 3  n 0
S(n)
N 1
i
2
nk
N
0  k  N 1
n 0
2
n
3
b  A, T , C , G
ATCGTACAGCTGCAAAGCATAGATTCGGTCACAGTTG…
uA(n) 1000010100000111001010100000001010000
uT(n) 01001000001…
uC(n) 0010001001…
uG(n) 000100001001…
1
N
DFT  
N
3
1
N
UA 

N
 3  1
N
UT   1
N
N
 3 
UC  
N
 3 1 U N
G 

N
*Silverman and Linsker 1986; Voss 1992
 3 
24
Spectrogram
Position(nucleotides)
A way for showing the
amplitude of UA, UC, UG
and UT together.
Linear Transform to
RGB.
Magnitude is
represented by
N/3
brightness
Finding exons visually:
bright horizontal lines,
usually in k = N/3
25
Spectrogram – cont.
DNA of C. Elegans chr. III versus totally random DNA
26
Power Spectrum
2
2
2
S= A +C +G + T
2
1
1
A = Ua(k),C = Uc(k) ...
N
N
k {0, 1, ... N/2}
Difference between
gene to non-gene
areas is in 1 order of
magnitude
Used for k = N/3
27
IIR Anti Notch Filtering
R 2  2 R cos z 1  z 2
A( z ) 
1
2 2
1  2 R cos z  R z
all-pass
1  A( z )
H ( z) 
Anti-notch
2
• IIR anti notch filter
aimed to find “peaks” of
a chosen frequency
28
Optimized Spectral Content
Measure (OSCM)
2
argmax
a,t,g
E{ aA  tT  gG }  E{ aAr  tTr  gGr }
std ( aA  tT  gG )  std ( aAr  tTr  gGr )
W = aA + cC + gG + tT
2
Find good coefficients (a,g,t)
for high differentiation
between exons and introns.
Ignoring C since of the linear
dependency in the rest.
Ar, Tr, Gr are generated
from random DNA
sequence, or Introns.
Performance:
29
OSCM Example
Direction
mistake
Good
forward
detection
Good
reverse
detection
30
OSCM Justification
• In genes, the 4 complex
•
•
variables A,T,C,G are not
all-random and tend to be
near a specific angle
(phase).
In introns, the values of
phase seems to be pure
random.
Those unique angles
enable us to detect
reading frame as well.
31
Distribution of the phase of the DFT at the freq
of 1/3 in the genes of S. Cerevisiae:
angular mean
= -1.3734
angular deviation = 0.7903
Distribution of arg(A)
angular mean
= -2.6862
angular deviation = 0.8416
Distribution of arg(C)
angular mean
= 2.7962
angular deviation = 0.5723
Distribution of arg(T)
angular mean
= 0.3556
angular deviation
= 0.4016
Distribution of arg(G)
Argument distributions for all experimental genes in all chromosomes in S. Cerevisiae
32
Distribution of the phase of the DFT at the freq
of 1/3 in the introns of S. Cerevisiae:
Distribution of arg(A)
Distribution of arg(T)
Distribution of arg(C)
Distribution of arg(G)
Argument distribution for non-coding regions in all chromosomes in S. Cerevisiae
33
Fourier Spectra and Position Asymmetry
i
N
U b    f  b,1 1  f  b, 2   e
3
2
3
 f  b, 3   e
i
2
3
f(b,i) is the frequency of the base b in the codon position i, i=1,2,3.
34
Genes versus Introns
Distribution of the DFT of T at 1/3 frequency
Distribution of the DFT of G at 1/3 frequency
(Data taken from S.Cerevisiae, chr. IV)
Introns and
intergenic spacers
Coding regions
genes and exons
Magnitude
small
LARGE
Phase
Randomly
distributed
Narrow
distribution
35
Finding Reading Frame (OSCM
Phase)
 = arg(aA + cC + gG + tT)
Reading
Frame
Color
1
Red
2
Green
3
Blue
•  Is concentrated around
1, 2 and 3 corresponding
to each reading frame.
• Lowering the variance of 
with the optimization:
aA  gG  tT
argmax E{
}
aA  gG  tT
a,t,g
• Transforming  to color.
• Deriving reading frame by
a simple look.
36
New Methods in This Project
• Linear prediction
• Classification by clustering (CC)
• Classification by compression ratios
37
Linear Prediction
n
x[n]   au A [k ]  cuC [k ]  guG [k ]  tuT [k ]
k 1
• Create a walk from the
indicator sequences
• For each window, find
LP coefficients. Look for
differences in
correlation by:
– Poles map
– Frequency response
– Prediction error
• No new findings in this
method.
38
Classification by Clustering
• Recall: DFT in k=N/3 frequency has a
strong correlation with genes locations
and reading frames (as shown in part A)
• Here we’ll attempt to use it in order to
discriminate exons from the rest, in a 6D
space
• Learning phase: clustering
• Classification phase: fuzzy KNN
39
Classification by Clustering
Clustering Stage: Example
From left to
right: C, G and
T.
S. Cerevisiae 5th
chromosome.
40
Classification by Clustering
RF
=1
(T,C,G) new
sample
RF =? 1
T
Exon?
RF
=1
C
G
RF =? 3
Max
-120°
RF
=1
Reading
frame (if
it’s an
exon)
RF =? 2
+120°
DFT(k=N/3)
DFT(k=N/3)
DFT(k=N/3)
uT
uC
uG
Indicator
Indicator
Indicator
DNA = … atcgtgactagc …
Start here
‫סף‬
41
Classification Rule
• Fuzzy KNN: create a fuzzy • Two methods for
membership function and
choose the one with the
highest score. Add fuzzy
clustering iteration to the
LBG algorithm.
classifying gene/nongene:
– Add genes and non-genes
scores, and max sum wins.
– Max centroid score wins.
• 2nd method used (better
performance). Scores
sums are used for
reading frame: max r.f.
wins.
42
Results
f_p
• Creature: S. Cerevisiae.
• Learning was done on the
5th chromosome.
• Parameters:
– K=7 and m=2 of fuzzy
KNN.
– True exon  50% exon.
– Thresh = 1.
• Total: only 4.6% of true
exons weren’t detected at
all.
f_n
rf_true
f_n_exons
# exons
# missed
1
0.1037
0.4524
0.9574
0.0882
102
9
2
0.0821
0.4735
0.9685
0.0472
381
18
3
0.0917
0.4618
0.9551
0.071
155
11
4
0.0821
0.4615
0.9654
0.029
725
21
6
0.1102
0.4247
0.9762
0.05
120
6
7
0.0821
0.4749
0.9647
0.0258
504
13
8
0.103
0.4716
0.9671
0.0456
263
12
9
0.1091
0.452
0.9476
0.04
200
8
10
0.1005
0.4719
0.9723
0.0293
341
10
11
0.0822
0.4816
0.9641
0.0703
327
23
12
0.0973
0.4759
0.9722
0.0514
486
25
13
0.0885
0.4682
0.9607
0.0365
438
16
14
0.1041
0.4597
0.9616
0.0397
378
15
15
0.0904
0.4644
0.9665
0.0311
514
16
16
0.0824
0.4744
0.9662
0.0452
442
20
Total
5376
223
43
CC - Example
44
CC - Improving
• Instead of deciding for each reading frame
separately and then decide which r.F.
“Won”, we can replicate the centroids for
the other reading frames and the
classification rule will determine [exon /
non-exon] + [reading frame], at the same
time. This suppose to cause a more fair
competition between the reading frames.
45
Classification by Compression
Rates
A T C G A T C G T A C G C A T G C A T G C A T G C A T G A A A A
29
18
1
60…
Nucleotides
(‘A’,’C’,’T’,’G’)
Codons (0..63)
• In forward coding, creating
3 different codon sequences.
• In classification of reverse
coding, first complementing all
the DNA, then treating it like
forward (and results will also
be reversed)
• In the end of this stage, we
have 6 codon seriates.
46
The Idea
• If we have a dictionary with the popular
words ( = codon sequences) in exons
which aren’t popular in non-exons then:
– Good compression will be achieved in exons
– Good compression will not be achieved in
introns
• So we need a good dictionary and a good
compressing algorithm
47
Building the Dictionary
• Aim: the output
• Add restriction on
•
•
•
•
dictionary is expected
to hold short popular
words in exons.
Using LZW algorithm.
Input: all exons of
learnt chromosome.
Initial dictionary: all
codons.
•
length of words to be
entered to the
dictionary.
Output I: dictionary
with words that
appeared in exons.
Output II: the code of
the exons by the
dictionary.
48
LZW: Encoding
1) Accum  first input letter
2) If dict.Find(accum) == false
1)
2)
3)
4)
Dict.Add(accum)
Code.Add(index)
Accum  accum(end)
Return to (2)
3) Else:
1) Index = dict.Findwhere(accum)
2) Accum.Add(next letter from input)
3) Return to (2)
49
Dictionary Pruning
• Output LZW dictionary is a tree (TRIE).
• Aim: keep the most popular words, but don’t
•
allow undesired redundancy.
Method:
– Go on every level of the tree (starting in max length
words) and take predefined number of popular words.
– Pass number of appearances (from output code) to
parents: pass the sum of all, OR pass the sum of
untaken. More variations: multiply by the entropy.
50
Using Entropy for Better Pruning
24*log(4)
= 48
40*log(1) = 0
[31 45 1]
6
[31 45 1 60]
[31 45 1]
6
[31 45 1 31]
6
[31 45 1 30]
6
40
[31 45 1 13]
[31 45 1 30]
20*(-1)*[5/6*log(5/6) + 2*1/24*log(1/24) +
1/16*log(1/16)] = 20*0.8513 = 17.0255
[31 45 1]
1
[31 45 1 60]
20
[31 45 1 31]
1
[31 45 1 30]
2
[31 45 1 13]
51
Compression Rates Classification
2. 6 codons
sequences for
the 6 different
reading frames
3. Compressing with genes
based dictionary
1. Input:
DNA of a
chromosome
and gene based
dictionary
8.
6 binary vectors
– the final
classification
7. Post Processing
4. 6 compress
rates vectors
5. Rf_wins = Argmax{compress_rate(rf),thresh)
Lowerthresh = Argmax{compress_rate(rf),lower-thresh)
Too_much_stops = 1 if window has more than 1 stop codon
6.
6 binary vectors
+ post
processing data
52
Post Processing
• Lower threshold technique: tag as true
every window that is between close
already-tagged windows, if value larger
than the lower threshold.
• Stop codons quantity in the window: more
than one => not an exon-window (which
is larger than analysis window size).
53
Compression Rates: Example
54
Stop Codons Usage
• 100,000b of 2nd
•
chromosome
1 where there is one
stop codon in the
window, at most
55
Post Processing: Stop-codon Usage
• Stop codon
usage cleans up
many potential
false positives,
without
damaging any
success measure
• Hence, a lower
principal
threshold can be
determined and
we’ll get better
performance
Without stop codon usage
56
Compression Rates: Results
• Learnt chromosome = 1st , window size = 100c, dictionary size = 1381 (32
codons, branching = 3)
• After choosing best configuration, going over all the chromosomes:
#
f_p
f_n
rf_true
f_n_exons
# exons
# miss
THRESH
2
0.10442
0.13809
0.93866
0.046875
384
18
0.457
3
0.10015
0.16098
0.92234
0.03871
155
6
0.457
4
0.084271
0.14014
0.93809
0.036986
730
27
0.457
5
0.090556
0.13763
0.92723
0.034483
261
9
0.457
6
0.13909
0.14274
0.92495
0.041667
120
5
0.457
7
0.12053
0.14733
0.93927
0.027723
505
14
0.457
8
0.15057
0.14538
0.93362
0.059925
267
16
0.457
9
0.13161
0.13816
0.92458
0.045
200
9
0.457
10
0.12222
0.12411
0.93447
0.03207
343
11
0.457
11
0.07833
0.14575
0.93712
0.069069
333
23
0.457
12
0.14106
0.13654
0.9405
0.064777
494
32
0.457
13
0.11051
0.14338
0.92814
0.040816
441
18
0.457
14
0.15044
0.15475
0.93434
0.026525
377
10
0.457
15
0.089995
0.14578
0.9357
0.044231
520
23
0.457
16
0.12039
0.13794
0.93657
0.033784
444
15
0.457
total
0.0423394
5574
236
57
Compression Rates: Improving
• Use non-exon dictionary, or prune exon-
dictionary considering non-exon common
words.
• Adaptive dictionary: when detecting an
exon, use its common words to update the
current dictionary.
58