Logic-statistic modeling and analysis of biological sequence data: A

Download Report

Transcript Logic-statistic modeling and analysis of biological sequence data: A

Logic-statistic modeling and analysis
of biological sequence data:
A research agenda
Henning Christiansen
Roskilde University, Denmark
[email protected], http://www. ruc.dk/~henning
International Workshop on Abduction and Induction in AI and Bioinformatics
Aix-en-Provence, 15 sep 2007
Motivation and overall goal
Computational analysis of biological sequence data
traditionally based on
– HMM, SCFG, ad hoc techniques
– Each system has its particular type of models
A bottle-neck which cannot be remedied (only) by faster
and parallel computers
We want to promote the application of more expressive
and flexible models: logic-statistic methods a la
PRISM (Sato, Kameya)
I.e. stepping from regular and context-free languages to
Turing-complete language
To reach this goal, we will...
• Approach inherent computational problems
– optimizations by program analysis & transformations
– interface to existing and efficient software
• Develop biologically relevant test cases
– for biologist to learn how to use such models
– to have relevant test cases for the first part
Project setup
Funded by the NABIIT program under the Danish
Strategic Research Council, 2007–2011
Main academic partners:
– Roskilde U. Computer Science: H. Christiansen, J. Gallagher;
1 PhD student (still open!), postdocs
– Roskilde U., Biology: O. Skovgaard; 1 PhD student
– Aalborg U. Computer Science: M. Jaeger
Academic associates:
– Taisuke Sato, Tokyo Inst. of Techn.
– A. Krogh, Copenhagen Univ.
Industrial partners:
– Chr. Hansen, Denmark.
• Wordwide supplier of probiotic products for the dietary supplement
industry
– CLC bio
• Leading supplier of bioinformatics software
PRISM (Sato, Kameya) for sequence
analysis, introduced by a toy example
PRISM extends Prolog with discrete random variables
Includes machine learning and prediction methods:
– learn best probabilities to explain training data
– with learned prob’s, determine best answer to a query
Example: Loop structures - a non-context-free phenomenon
Assume a collection of sequences where loop structures
have been identified in the lab
Task: Build and train model so it can be used for prediction
Example model in PRISM
Assume (arbitrarily):
• ‘noise’ ≈ a 1. order Markov model
• ‘contact zone’ ≈ a 2. order
noise(F,S1,S2):- msw(moreNoise,YN), noise2(F,S1,S2,YN).
noise2(_,S,S,stop).
noise2(F,[B|S1],S2,continue):- msw(which(F),B), noise(B,S1,S2).
contact(K,F1,F2,S1,S2):- msw(moreContact,YN),
contact2(K,F1,F2,S1,S2,YN).
contact2([],_,_,S,S,stop).
contact2([B|K],F1,F2,[B|S1],S2,continue):msw(which(F1,F2),B), contact(K,F2,B,S1,S2).
contactCopy([],S,S).
contactCopy([B|K],[B|S1],S2):- contactCopy(K,S1,S2).
This is the entire model!
values(moreNoise,[stop, continue]).
values(moreContact,[stop,continue]).
values(which(_),[a,c,g,t]).
values(which(_,_),[a,c,g,t]).
sequence(...):sequence(K,S):noise(...),
noise(-,S,S1),
contact(K,....),
contact(K,-,-, S1,S2),
noise(....),
noise(-,S2,S3),
contactCopy(K,
contactCopy(K, ...),
S3,S4),
noise(...).
noise(-,S4,[]).
Training data:
sequence([c,c,g,g,g,t,c,g,c],[a,c,c,g,g,g,t,c,g,c,a,a,t,c,a,a,a,t,c,t,t,t,a,a,c,c,c,g,g,g,t,c,g,c,a,g,a,c,t,a,t,g,t,t,t,a,g,a,a,a,a,c,a,t]).
sequence(......, ......).
sequence(...., .....).
.......
Using a the trained model for prediction
?- viterbig(sequence(K,[t,a,t,a,g,c,g,c,t,a,t,a,g,c,g,c,t,a,t,a]))
K = [g,c,g,c]
The answer to the query with highest probability.
... plus a lot of other facilities
Our first serious application of PRISM:
Testing gene finders (MLDM 2007; with C.M.Dahmcke)
Problems: Test data expensive; available test data already
used for training gene finders; disagreement about what
is a gene, ...
Approach:
– Develop and train PRISM model with known, annotated data
– Use this to create artificial test data,
• i.e., sequences with annotations about where-are-the-genes
– Check if gene finder programs find the same genes
Results:
– Three different gene finders found too many and different
genes ;-(
Overview of the model (intergenic only)
GC-sparse
repeat1
GC-island
repeat2
GC-sparse
repeat3
Target predicate:
sequence(sequence-of-ACGT, GC-islands, repeats)
GC-islands: list of from-no–to-no
repeats: list of from-no–to-no with indication of:
type: simple, low-complexity, named,...
for named: selected from catalogue; which part; forward, backward, transposed,
backward+transposed
plus one detailed description of »mutation«:
[c,c,c,c,i,i,c,c,d,d,c,c,...]
(to suppress complexity in the model;
for training data generated by a best-match algorithm)
...
Implemented as a two-layer model
Top-level: GC-islands/GC-sparse, length 200 +
exponential decay
Underlying layer: Mix of repeaters and coloured
noise
Two-level structure implemented by our own
abstract datatype:
msw(RV: random-var, value, GC-islands, position)
- uses hidden msw’s to control GC-island/sparse
- each RV maintained in two versions (hidden)
- position, counter to produce annot. GC-islands
Lesson learned from gene finder experiment
• A nontrivial model can be organized in a reasonable way
by an experienced logic programmer
• Preprocessing to freeze one mutation set reduced
complexity of learning phase - general technique?
• Model could be trained in minutes from marked up
sequences of total 106 letters.
• With Prolog’s list repr. for sequences we needed 64-bit
architechture (sic!) and lo-o-o-o-ot of RAM
• PRISM is a very flexible tool for combining and varying
different models, inventing a little data structure etc., but
keeping a model with well-defined semantics
• We lacked, and would suggest to add to PRISM
– Distributions over integers (normal d. or “generic smooth”)
– Over-layered, and especially negative criteria
• (ouf: random variables become dependent)
Anticipated problems for prediction
with PRISM and possible solutions
• Storage consumption, the sequence as array +
PRISM’s explanation graphs (???)
• Execution time
– Systematic approach to pruning: Generalize known methods
for semantics-preserving program transformations to
semantics-approximating transformations
– Integrate with existing and efficient software
• Automatically and hidden??? If, e.g., analysis of PRISM
program says “this-looks-like-a-HMM”
• Clean interfaces ????
– Reduce complexity by splitting the sequence (-- how to
integrate this with a nice semantics?)
Biological problems considered
• Gene finding in health promoting bacteria
• Phylogenetic gene prediction
• Prediction of gene function and acquisition by
orthology
• (Gene finding for eukaryotic species)
Project hypotheses summarized
• Logic-statistic models, a la PRISM or similar,
have much higher expressibility and flexibility
than traditional models used for sequence
analysis (in formal as well as practical sense)
• If we can solve some of the computational
problems involved and learn how to use such
powerful modeling tools, there is a potential
for new discoveries in biology.
Thanks for your attention!
PS. We are seeking (desperately) a good PhD student for the
computational issues. Good salary and conditions offered!