Improving scalability in ILP incremental systems

Download Report

Transcript Improving scalability in ILP incremental systems

Improving scalability
in ILP incremental systems
Marenglen Biba, Teresa Maria Altomare Basile,
Stefano Ferilli, Floriana Esposito
Inductive Logic Programming: Why do that?
Can learn sets of rules “if-then” that contain variables
called First-Order Horn clauses.
 Ancestor(x,y) ← Parent(x,y)
Ancestor(x,y) ← Parent(x,z) Λ Ancestor(z,y)
thus
 Complex structural descriptions can be learned such as:

Molecular 2D and 3D structure
 Document structure as a set of logic components
 Protein fold description
and
Any other set of descriptions that need to model complex relations
among objects

E

E
ILP: Formal setting (Muggleton 1994)


set of examples E = E+ ∪ E−
Background knowledge B
Learn: a logic theory T such as
Given:
T ∪ B |─ E+
T ∪B ∪ E−
Under the conditions:
B
E+


Completeness
Consistency
Necessity
B ∪ T ∪ E−
Strong Consistency
B ∪T
Weak Consistency
B ∪ T |─ E+
Sufficiency
Neither Sufficiency nor Strong Consistency are required for systems
that deal with noise.
How to check Completeness and Consistency: Theorem prover
Example: Describing protein folds
 fold(’Four-helical up-and-down
bundle’,P) :- helix(P,H1),
length(H1,hi),
position(P,H1,Pos),
interval(1<Pos<3),
adjacent(P,H1,H2),
helix(P,H2).
A positive and a negative example of the
protein fold “4-helical-up-and-down-bundle”.
 The protein P has fold class “Fourhelical up-and-down bundle” if it
contains a long helix H1 at a
secondary structure position
between 1 and 3, and H1 is followed
by a second helix H2.
3-D arrangement of secondary structure units
shown for -helices (cylinders) and -sheets (arrows).
Learning logic programs: proving theorems



T, B, and E are each logic programs.
A logic program is a set of definite clauses each having the form: h ← b1,..,bn
where h is an atom and b1,..,bn are atoms. E+ and E− consist of ground
clauses.
Procedure Tuning (E: example; T: theory; B: background knowledge;
M: historical memory);
If E is a positive example AND ¬ covers(T,E) then
Generalize(T,E,B,M−)
else
if E is a negative example AND covers(T,E) then
Specialize(T,E,B,M+)
end if
end if
M := M U E.
What we need is nothing more than a Prolog interpreter as theorem prover
Learning logic programs: doing it incrementally





How to learn? Batch or Incremental?
Batch: Learning starts always from an empty theory and the refinement of the
theory continues until all the examples are explained. Whenever a problematic
example appears, everything starts again from the beginning without using the
previously generated theory. Since everything must be learned again the
computational effort is relevant.
Incremental: Why learn from scratch? What if we use the theory learned till
now and try to refine it? We could save computational resources learning in
an incremental fashion.
INTHELEX (INcremental THEory Learner from EXamples) is an ILP system
that adopts incremental refinement of the theories.
INTHELEX adopts multistrategic learning, can start from scratch, learns
simultanously various concepts. It’s a close loop learning system.
ILP applications: What has successfully been done
 Relational Data Mining and Knowledge Discovery for:
Drug Design/Toxicology,
SAR (Structure Activity Relationship),
Protein Structure Prediction,
Functional Genomics,
Systems Biology
 Learning rules for early diagnosis of rheumatic diseases
 Learning qualitative models of dynamic systems
 Applications in finite mesh design.
 Applications in Natural Language Processing
 Document Classification and Understanding
ILP future applications: one of the challenges
 Computational complexity remains a major challenge for
practical applications of ILP, for example:
In Systems Biology the scientific knowledge discovery is
made possible by the potentiality of ILP, but millions of
records that contain 3-D molecule descriptions must be
elaborated leading therefore to extremely expensive
computional processes.
 The huge space of hypotheses to search makes this tasks
NP-Hard and thus the need for scalable ILP system that
are able to process in reasonable limits of time great
quantities of data.
ILP scalability issue: what has been done
 Subsampling methods focus on a smaller set of examples for
processing very large datasets (Srinivasan 99) .
 Local coverage test within a learning from interpretations setting where
independence assumptions among examples are reasonable and
coverage of each example can be tested without considering the entire
background knowledge. (De Raedt 99)
 Ad-hoc theta-subsumption techniques to speed-up the coverage test
have been also investigated (Di Mauro 2003)
 However the NP-Hard nature of problems with a huge space of
hypotheses to search motivates for further efforts in implementing
scalable ILP system that are able to process very large datasets.
 How “large” large datasets are? Do we need scalability?
 Immagine millions of 3-D sequences of proteins folds to be learned…
ILP systems: their drawbacks
 Almost all the ILP systems are internal memory systems in that they imply
Prolog interpreters for proving theorems using a main memory. Unfortunately
these interpreters are not efficient in that they load all the examples in main
memory and whenever the theory must be refined, they try to prove goals
scanning the entire main module where thousands of examples may be
present.
 Prolog interpreters do not maintain any information useful during the theorem
proving process. What does it mean? It means no clause knows what
examples it covers and whenever it is refined an overall check of the entire
memory must be done.
 What if every clause knows what examples to check when a cover test should
be done?
 Prolog interpreters do not maintain any information useful for understanding
better the learning process and maybe try other directions in the search of
hypotheses.
 What if we maintain the various versions of the clauses with relative statistics
and use these whenever the search process must be redirected?
Improving scalability: First Step
 Basic idea: Eliminate main memory handling of terms. Instead of loading them in
main memory, use an external storage that can provide fast access and can
eliminate the sequential nature of the Prolog module.
 Make it faster, what to choose? Integrate commercial RDBMS or implement a
tailored solution for external terms handling?
 BerkeleyDB is an Open Source project under the supervision of Sleepycat. It is a
database library that provides a powerfull API for data management. It is an
“embedded” data engine, in that the libraries are executed in the space of
addresses of the hosting application, so no communication between processes is
required as an external RDBMS would do (MySql or SQL Server).
 The API are available in various languages C, C++, Java, Python, Prolog etc.
 The underlying structures are hash structures, btrees, record-number-based
storage and queues.
 Portable: 16,32,64-bit machines, Windows, Unix, Linux and any other OS. The
abstraction layer is about 2500 lines of C code, thus any OS is possible.
 No query language, only function-calls for data retrieval thus no parsing of queries.
 Indexing, Page cache management, Transactions and logging, Locking and
Concurrent Access Methods are provided
 Adopted by HP, SUN, Google, Motorola, Amazon.com for their embedded
systems. (The library is only 300KB but can handle 256 Terabytes of data)
Improving scalability: First step
 The interface we exploit uses the Concurrent Access Methods product of Berkeley
DB. This means that multiple processes can open the same database, but
transactions and disaster recovery are not supported. The environment and the
database files are ordinary Berkeley DB entities that use a custom hash function.
 We developed a tailored embedded solution using the Sicstus Prolog interface to
the Berkeley DB toolset for supporting persistent storage of Prolog terms. The idea
is to obtain a behavior similar to the built-in Prolog predicates such as assert/1,
retract/1 and clause/2.
 DB-Specification:
 speclist = [spec1, . . . , specM]
spec = functor(argspec1, . . . , argspecN)
argspec = + | -
where functor is a Prolog atom. A spec F(argspec1, . . . ,argspecN) is
applicable to any ground term with principal functor F/N.
Improving scalability: First step
 The functors and the indexing specifications of the terms to be stored have to
be given when the database is created.
 The indexing is specified when the database is created. It is possible to index
on other parts of the term than just the functor and first argument.
 Changes affect the database immediately.
 The database can store variables (Other DBMSs do not).
 Example: Storing and fetching examples in Prolog
 db-spec: examples(+,-,-,-,-). Indexed on the argument signed with “+”
db_create(examples(+,-,-,-,-), MyDb)
db_store(MyDb,examples(KeyExample,Concect,Head,Body,Constraints,Symbolic,
Numeric)
db_fetch(MyDb,examples(KeyExample,_,_,_,_,_);
db_iterator(MyDb, examples(KeyExample,_,_,_,_,_);
db_iterator_next(My_Db); db_open(My_Db); db_close(My_Db)
All these operators call C language API functions of Berkeley DB
Improving scalability: Second step
 Basic idea: having eliminated main memory of Prolog we can link clauses to
examples.
 Implement a relational schema that keeps links between clauses and examples
they cover.
 What is this useful for? When a clause is specialized only the examples previously
covered by the clause should be checked. Thus picking up all the examples and all
the coverage tests that in a Prolog fashion correspond to proving all these goals,
are replaced by an external fast check of only the examples related to the clause.
 When a clause is generalized it is useless to check all the past positive examples.
The positive examples previously covered will continue to be covered since the
clause has been generalized. Through our schema only negative examples of the
same concept (clauses and examples indexed also on their concept) are checked
saving unnecessary checks that Prolog interpreter would do by checking again the
entire main memory.
 The expensive cover test is done only the first time an example enters the
database. Sucessively it is sufficient to check a hash table with the keys of the
clauses and examples.
Improving scalability: Is it all?
 What if we maintain versions of the theory step after step? At a certain
moment of the search in the hypotheses space, if we would want to
backtrack, how can we do it if we don’t know what has happened to the
theory (the Prolog case) and how this has evolved in time?
 What if we could know which example caused a certain refinement? In a
Prolog setting this is not possible. Instead with our solution we can adopt
partial memory techniques discriminating the examples on the basis of
what kind of modifications they have been able to do to the theory during
the learning process.
 To understand better the learning process we can maintain statistics about
the learning task, such as number of refinements caused by an example,
number of refinement of a clause, times of refinement and other precious
information that in a Prolog setting are not possible.
Relational schema: Link it
Experiments 1
Document understanding domain:
learn theories to classify
documents and recognize the
logical components in them such
as absctract, title, page number,
affiliation.
Experiments between two scalable
versions of INTHELEX developed.
Experiments 2
Document understanding domain:
learn theories to classify
documents and recognize the
logical components in them such
as absctract, title, page number,
affiliation.
Experiments between initial
version and the final version of
INTHELEX
Results interpretation
 External storage of terms and fast-access through
function-call techniques resulted in important savings
in times of elaboration.
 Linking clauses and examples improved significanlty
the learning process in terms of efficiency.
Future work
 Investigate the use of the information preserved in
the database such as the versions of the theory for
backtracking strategies.
 Investigate the statistics about the examples for
implementing partial memory learning methods.
Thank you for your attention
Marenglen Biba, PhD Student
[email protected]