Geometric Crossovers for Supervised Motif Discovery

Download Report

Transcript Geometric Crossovers for Supervised Motif Discovery

Geometric
Crossovers for
Supervised Motif
Discovery
Rolv Seehuus
NTNU
Motivation and Scope

Try out the applicability of the geometric framework, on a
supervised motif discovery problem


In practice, we test on a very easy problem



Compare its merits to a previously used operator.
that existing software can solve easily
Value as test case
Building block for more complex motif discovery
problems, that current algorithms can not solve
satisfactory
Motif Discovery




Has become a standard problem in bioinformatics
Given a set of sequences, figure out what is
special with it
…by eliciting motifs in the dataset
Differing by…



Motif model
Learning algorithms
Scoring functions
The Standard Approach

Do analysis of the positive set of sequences




…background distribution…
…information content…
…statistical significance…
Report motif
Motif discovery as a
classification problem






Always at least two datasets: The positive, and “the rest”
Choose a negative dataset
Report motifs best suited to discriminate
No need to learn a background model
The statistical significance of the motif can be given
Discriminative motif discovery has received increased
attention lately
Classification problem




Protein sequences, from the SwissProt database
Classified according to protein family (as
specified in the Prosite database)
Selected six families, that previously have been
shown to be hard to classify under similar
circumstances.
Some of the families can be said to have an
overrepresented motif as the ones we can train on
The Potential Negative Data Set



Huge, compared to the negative
Quite common in bioinformatics, and an interesting
problem to cope with in its own right
In field:
randomly generated sequences
 one set of randomly selected sequences
 random rearrangement of the positive sequences (data
not shown)


The “best practice” was to select the samples randomly
from the negative set each generation, so that their size
matches the positive set.
Motif Model


Twenty amino acids
Wildcard
C...C.C..C
DMEGACGGSCACSTCHVIVDP
Motif match, positive sequence
Operators on Motifs

Unit edit move as mutation
Mut(A) = {Insert, Delete or Replace a token}


Substring Swapping Crossover (for comparison)
Two-point Geometric Crossover
Geometric Crossover




Search space have a metric
Mutation is a move in search space
Crossover yield children found on the shortest
path between the parents in search space
Successfully applied to other problems
Geometric Crossover for Motifs




View motifs as sequences
Basic assumption: The edit distance is a good way
to move around in motif space
A crossover based on the edit distance, should
yield a good crossover for motif discovery
We (arbitrarily) choose unit costs for insertions,
deletions and substitutions
Sequence Alignment





Alignment: put spaces (-) in both sequences such as they
become of the same length
Seq1’= agcacac-a
Seq2’= a-cacacta
Score: 2
An Optimal alignment is an alignment with minimal
score
The score of the optimal alignment of two sequences
equals their edit distance
There often are multiple optimal alignments
Homologous Crossover
1.
2.
3.
4.
Pick an optimal alignment for two parent sequences
Generate a crossover mask as long as the alignment
Recombine as traditional crossover
Remove dashes from offspring
Mask =
1101100
Seq1’=
BANANASeq2’= -
Child1’=
BANANAS
Child2’=
ANANA
Experiments


Two crossovers with same parameters, and mutation only
Ten fold cross validation:
Partitioned datasets in ten pieces
 Trained on 9/10ths
 Tested the best motif on the remaining test set




Trained on randomly selected subset of SwissProt
Tested on entire SwissProt
Fitness: Scaled Pearson correlation of confusion matrix
Dynamic behavior during
evolution
Maximum Values
Max
Cytochrome





Include the following fragment of a highly
conserved motif: C…CH
Which geometric crossover find
While substring swapping finds: CH
Conservation of length keeps us in the correct
ballpark
CH representa local maximum for substring swap
Ferredoxin




Contains the following motif: C..C..C...C[PH]
Which Substring Swap finds
While Geometric Crossover don’t
Conservation of length keeps us from finding the correct
motif
Population Means
Means
Classification Performance
Medians, of 10 experiments, for each family
Name
Carbamoyl
Dna J
Cytochrome
Ferredoxin
HLH
EF Hand
SubSwap
Geometric
Mutation
Train Test Train Test Train Test
0.99 0.57 0.99 0.56 0.99 0.56
0.9 0.23 0.91 0.23 0.92 0.23
0.8
0.1 0.98 0.36 0.97 0.36
1 0.87 0.81 0.13 0.96 0.39
0.5 0.07 0.52 0.06 0.59 0.08
0.62 0.15 0.67 0.19 0.66 0.19
Classification Performance - II


Similar for all operators
Maybe a slight advantage, for the geometric
crossover if we have



A highly conserved motif exist
A “ballpark guess” on motif length
Surprisingly, mutation frequently outperforms the
other operators
Concluding remarks




The geometric operator is promising - need work
It is more length preserving than substring swap
The geometric operator need a good guess on motif
length
Edit move might not be optimal for motif discovery?


even though, it for some problems shows merit.
Our initial assumption imply an insertion/deletion equally
often as replacement in sequence data

we are WAY off on that parameter
Future Work




Synthetic data with known parameters
Include character classes and within motif gaps in
representation
Modules (composite motifs)
Expand to position weight matrixes