n - RILA:Pattern finder for relational data

Download Report

Transcript n - RILA:Pattern finder for relational data

Supervised Rule Induction
for Relational Data
Mahmut Uludağ
Supervisor: Prof. Dr. Mehmet R. Tolun
Ph.D. Jury Presentation
Eastern Mediterranean University
Computer Engineering Department
February 25, 2005
Outline
Introduction
 ILA and ILA-2 algorithms
 Overview of the RILA system
 Query generation
 Optimistic estimate pruning
 Rule selection strategies
 Experiments and results
 Conclusion

Motivation for relational data mining

Traditional work in statistics and knowledge discovery
assume data instances form a single table

Not always practical to represent complex objects in one
Cites
single table

RDBMS are widely used

Efficient management of data



Author
Indexing and query services, transaction and security support
Can store complex data
Data mining without transferring to a new location
Paper
Previous work – ILP based algorithms

Prolog is the main language to represent objects
and relations between the objects


Incremental learning, incorporation of
background knowledge
Initial research: deterministic rules

Recent research: statistical learning

Main obstacle to widespread acceptance;
dependency on a Prolog server
DMax; a modern ILP based data mining system

Client-server architecture; java client, ILProlog server
Example output rule:
Source: www.pharmadm.com
Previous work – relational data
mining framework

(Knobbe et al, 1999)
child

Client – server architecture

Selection graphs

Algorithm to translate selection graphs into SQL

MRDTL and MRDTL-2 algorithms, Iowa State
parent
age>30
University

M.Sc. Study in METU, Serkan Toprak, 2004
toy
Previous work – graph mining

Typical inputs are labelled graphs

Efficient tools in describing objects and the way they are connected

Subgraph isomorphism

Scalability problems


Avoid loading complete graph data into the main memory; partitioning
Nearly equivalent formalisms:

Graphs ≈ Database tables ≈ Prolog statements
ILA

Levelwise search
 construct
hypotheses in the order of the increasing
number of conditions (i.e. at first, building hypotheses
with one condition, then building hypotheses with two
conditions, and so on)

Finds the smallest and completely accurate rule
set that represents the training data
ILA-2

Noise-tolerant evaluation function

score(hypothesis) = tp - pf * fn

tp is the number of true positive examples

fn is the number of false negative examples

pf stands for penalty factor, a user-defined minimum for the proportion
of tp to fn

not sensitive to distribution of false values

Multiple rule selection after a learning loop  redundant rules

Implemented by modifying the source code of the C4.5
algorithm; some features inherited from C4.5
RILA

What is new when compared to ILA and ILA-2

Architecture

Performance

Internal representation of rules

Construction of hypotheses
RILA – what is new
Select late rule selection strategy; as an
alternative to the select early strategy
 An efficient implementation


hypotheses can be refined by adding new
conditions, they do not need to be generated from
scratch in each learning loop
Optimistic estimate pruning (beam search)
 Normalized hypotheses evaluation
function

Hypotheses
SQL, meta data queries
Discovery system
Rules
Result sets, meta data
•Traversing relational schema
•Hypothesis construction
•Conversions to SQL
•Rule selection
•Pruning
JDBC driver
Architecture of the system
DBMS
How tables are visited?
Junction table
Interaction
Geneid1 Geneid2 Type Expression corr
First level – stops in the junction table?
Extension levels – extends complex hypotheses only
by using attributes from the other side of the
junction relation
Target table
Gene
Geneid Essential Chromosome
Composition
Geneid Phenotype
Class Motif Function Complex
Example hypotheses that can be generated:
-If a gene has a relation r then its class is c
-If a gene has a property p and relation r then its class is c
-If a gene has a relation r to a gene having property p then its class is c
Localization
Internal representation of an example rule
Composition
Interaction
Composition
Conditions:
Class=‘Nucleases’
composition1.id=gene1.id
Gene
Localization=‘extracellular…’
type=‘Genetic’
Complex=‘Intracellular transport’
interaction.id1=gene1.id
Gene
composition2.id=gene2.id
Gene
Localization=‘extracellular…’
interaction.id2=gene2.id
IF gene1.Composition.Class = ‘Nucleases’ AND
Interaction.Type = ‘Genetic’ AND
gene2.Composition.Complex = ‘ Intracellular transport’
THEN gene1.Localization = extracellular…
gene1.id=interaction.id1
Gene
Localization=‘extracellular…’
Query generation

SQL template for building size one hypotheses

Numeric attributes

Refinement of hypotheses

How a hypothesis is represented in SQL?

How a hypothesis is extended by a condition from the other side of
an many-to-many relation?
SQL template for building size one
hypotheses
Select attr, count(distinct targetTable.pk)
from covered, path.getTable_list()
where path.getJoins() and
targetTable.classAttr = currentClass and
covered.id = targetTable.pk and
covered.mark=0
group by attr
Numeric attributes

Discretization results are stored in a
temporary table
 Columns:
table_name, attribute_name,
interval_name, min_value, max_value
disc.table_name = ‘table’ and
SQL:
disc.attribute_name = ‘attr’ and
attr > disc.min_val and
attr < disc.max_val
Refinement of hypotheses
Select attr, count(distinct targetTable.pk)
from covered, table_list, hypothesis.table_list()
where targetAttr = currentClass and
join_list and
hypothesis.join_list()
covered.id = targetTable.pk and
covered.mark=0
group by attr;
How a hypothesis is extended by a condition from
the other side of a many-to-many relation?
Select GENE_B.CHROMOSOME, count (distinct
GENE.GENEID) from COMPOSITION, GENE, GENE
GENE_B, INTERACTION where
INTERACTION.GENEID2=GENE_B.GENEID and
INTERACTION.GENEID1=GENE.GENEID and
INTERACTION.EXPR > 0.026 and
INTERACTION.EXPR < 0.513 and
COMPOSITION.PHENOTYPE = 'Auxotrophies' and
COMPOSITION.GENEID=GENE.GENEID and
GENE.LOCALIZATION = 'ER'
group by GENE_B.CHROMOSOME
Optimistic estimate pruning

Avoid working on hypotheses which are unlikely to result
in satisfactory rules

F-measure criteria to assess hypotheses


2 * recall * precision / ( recall + precision )
Two types of pruning

Extend only n best hypotheses (beam search)

Minimum required f value in order a hypothesis to take place in
the hypothesis pool (similar to minimum support pruning)
Rule selection strategies
 Select
early strategy
 Why
do we need another
strategy?
 Select
late strategy
Learning algorithm when using the select early strategy
size=1
If size is 1 then build initial hypotheses
otherwise extend current hypotheses
select p rule(s)
yes
any
rules
selected
?
yes
no,
size++
size is
smaller
than m?
no
mark covered objects
no
all examples
covered?
yes
end
Example training data to demonstrate the case where the select late
strategy performs better than the select early strategy
Attribute A
Attribute B
Attribute C
Class
a1
b1
c1
A
a1
b1
c2
A
a2
b2
c3
A
a3
b2
c3
A
a4
b1
c3
B
a5
b1
c3
B
a1
b2
c4
B
a1
b2
c5
B
Learning algorithm when using the select late strategy
start
Build initial hypothesis set
Extend hypothesis set
size < max
size?
Select Rules
no
end
yes,
size++
Rule selection algorithm when using the select late strategy
star
t
Is the score
positive?
Select hypothesis
with the highest score
no
yes
- Mark examples covered by this hypothesis
- If no positive examples covered then return
- Recalculate the score using the effective cover
- If the new score is higher than the score of the next hypothesis or
score of the hypothesis was previously reduced more than l then assert
the hypothesis as a new rule otherwise undo markings and set the score
to the new score calculated
no
all examples
covered?
yes
end
Experiments

Summary of the parameters

The genes data set

The mutagenesis data set
Summary of the parameters




Parameters applicable both to the select late and to the select
early strategies
 pf is a user-defined minimum for the proportion of the true
positives to the false negatives
 m is the maximum size for hypotheses
Parameter applicable only for the select early strategy
 p is the maximum number of hypotheses that can be selected
as new rules after a search iteration
Parameter applicable only for the select late strategy
 l is the limit on rule selection recursion
Optimistic estimate pruning parameters
 f is the minimum acceptable F-measure value
 n is maximum number of hypotheses that can be extended in
each level during the candidate rules generation phase of the
mining processes
The ‘genes’ dataset of KDD Cup 2001
INTERACTION
GENEID1
GENE
COMPOSITION
GENEID
GENEID
Essential
Class
Chromosome
Complex
Localization
Phenotype
GENEID2
Type
Expression
Motif
910 rows
Junction table
Many-to-many relation between genes
Function
862 rows
4346 rows
Test results for the localization attribute using
the select early strategy, pf=2, m=3
f=0.0
n=10000
p=1
f=0.001
n=500
p=1
f=0.001
n=500
p=5
f=0.01
n=1
p=1
f=0.01
n=1
p=5
training time (seconds)
117
61
67
67
37
Number of rules
90
90
161
78
122
number of conditions
133
133
233
103
144
training set coverage (%)
60.56
60.56
70.19
59.16
65.43
training set accuracy (%)
95
95
96
95
95
test set accuracy (%)
83.19
83.19
83.61
84.37
85.78
test set coverage (%)
59.32
59.32
62.47
58.79
60.89
Test results for the localization attribute
using the select late strategy
n
1
2
3
4
5
training time (seconds)
37
53
69
72
73
number of rules
126
140
147
150
157
number of conditions
152
188
206
213
229
training set coverage (%)
67.17
70.30
71.46
72.16
72.97
training set accuracy (%)
94
94
93
93
93
test set accuracy (%)
82.85
81.48
81.63
81.53
81.12
test set coverage (%)
62.73
63.78
64.30
65.35
65.35
pf=2, m=3, l=0, f=0.01
Test results for the localization attribute
using the select late strategy
n
1
2
3
4
5
training time (seconds)
65
69
69
89
102
number of rules
126
139
144
147
153
number of conditions
154
188
200
205
216
training set coverage (%)
65.55
68.68
69.37
69.84
70.65
training set accuracy (%)
96
96
96
96
96
test set accuracy (%)
84.96
83.91
83.91
83.62
83.27
test set coverage (%)
59.32
60.37
60.37
60.89
61.15
pf=2, m=3, l=100, f=0.01
Why we did not have better results
on the genes data set?








Cup winner’s accuracy 72.1%
MRDTL 76.1% accuracy
Serkan 59.5% accuracy
rila best accuracy 85.8% with 60.9%
coverage
rila best coverage 65.3% with 81.5%
accuracy
Missing values? no
Default class selection? no
Deteriorated performance when the
number of class values is high


Distribution of false values among classes
not taken into account
Problem when number of examples in
different classes are not evenly distributed
Attribute1
Attribute2
Class
1
5
pink
1
5
pink
1
5
pink
1
5
yellow
2
5
yellow
2
6
yellow
1
6
blue
2
7
blue
3
7
blue
Schema of the mutagenesis database
BOND
ATOM_ID1
ATOM
ATOM_ID
MOLECULE
Molecule_id
ATOM_ID2
Type
Molecule_id
Element
Type
Log_mut
Logp
Lugmo
Ind1
Charge
Inda
Label
Cross validation test results using the select early strategy on the
mutagenesis data for different p* values
p
1
2
3
4
5
6
7
time (in seconds)
305
260
225
215
215
215
210
# rules
103
103
103
104
104
104
104
# conditions
120
120
120
122
121
121
121
accuracy (%)
98.82
97.06
97.06
97.06
97.06
97.06
97.06
coverage (%)
89.89
90.43
90.43
90.43
90.43
90.43
90.43
*maximum number of rules selected when each time the rule selection step is executed
Cross validation test results using the select early strategy and OEP on
the mutagenesis data for different n values
n
1
2
3
10
15
20
30
40
time (seconds)
134
144
159
221
325
397
311
311
# rules
118
120
120
142
123
119
104
103
# conditions
165
171
171
210
166
161
123
120
accuracy (%)
98.26
98.26
98.26
98.83
98.25
98.25
98.82
98.82
coverage (%)
91.49
91.49
91.49
90.95
90.96
90.96
89.89
89.89
Cross validation test results using the select late strategy on the
mutagenesis data
n
1
2
3
10
time (seconds)
76
123
169
486
# rules
122
137
145
151
# conditions
165
204
224
248
accuracy (%)
96.49
94.77
93.68
93.68
coverage (%)
90.96
91.49
92.55
92.55
p =1, f=0.01, l=0
Cross validation test results using the select late strategy
on the mutagenesis data
n
1
2
3
10
time (seconds)
84
135
183
490
# rules
105
107
109
112
# conditions
131
137
145
158
accuracy (%)
98.26
98.26
98.24
98.25
coverage (%)
91.49
91.49
90.43
90.96
p =1, f=0.01, l=10
Cross validation test results using the select late strategy on the
mutagenesis data
n
1
2
3
10
time (seconds)
80
130
178
483
# rules
103
104
106
109
# conditions
127
131
137
153
accuracy (%)
98.26
98.26
98.26
98.26
coverage (%)
91.49
91.49
91.49
91.49
p =1, f=0.01, l=100
Comparison to others results on mutagenesis data


The best results by RILA (Table 2 and Table 5)

accuracy 98.26%

coverage 91.49%
The best results reported in (Atramentov et al. 2003)


accuracy 87.5%
The best results reported by the originators (King et al.
1996) of the data set

accuracy 89.4%, (number of correct predictions divided by the
number of predictions)
Conclusion

A new relational rule learning algorithm has been
developed with two different rule selection strategies

Several techniques used to have reasonable
performance; refinement of hypotheses, pruning

The results on the mutagenesis data are better than
other results cited in the literature

Compared to traditional algorithms, there is no need to
move relational data to another location; scalability, practicality

Techniques employed can be used to develop relational
versions of other traditional learning algorithms
Thanks!
FOIL, a set-covering approach





[Cameron Jonaes and Quinlan 1994]
Begins with the most general theory
Repeatedly adds a clause to the theory that
covers some of the positive examples and few
negative examples
Covered examples are removed
Continue until the theory covers all positive
examples
Previous work – unsupervised
algorithms


WARMR [Dehaspe et al., 1998] finds relational association rules (query extensions)

Input – Prolog database

Specification in the WARMODE language, limits the format of possible query extensions
SUBDUE [Cook and Holder, 1994] discovers substructures in a graph



Output – the substructure selected at each iteration as the best to compress the graph
PRM [Getoor et al., 2002] reinterpret Bayesian networks in a relational setting

Captures the probabilistic dependence between the attributes of interrelated objects

Link analysis
Models generated by some unsupervised learning algorithms can be used for
prediction tasks; WARMR, PRM, not SUBDUE
Relational rule induction

Schema graph represents structure of the data
 tables
= nodes
 foreign
keys = edges

Multiple tables can represent several objects
and relations between the objects

Users should select tables that represent the
objects they are interested in
An example relational rule
Composition
Geneid Phenotype
Class Motif Function Complex
IF Composition.Class = ‘ATPases’ AND
Composition.Complex = ‘ Intracellular transport’
THEN Gene.Localization = extracellular..
Gene
Geneid Essential Chromosome
Localization
Many-to-many relations
Junction
table
Junction tables
 Between different
classes
 Between objects of
the same class

 Recursive
queries
are needed to
extract data
between different classes
Junction
table
between objects of
the same class
Example rule having a many-to-many
relation
Composition
Geneid Phenotype Class Motif Function Complex
IF gene1.Composition.Class = ‘Nucleases’ AND
Interaction.Type = ‘Genetic’ AND
gene2.Composition.Complex = ‘ Intracellular transport’
THEN gene1.Localization = extracellular…
Interaction
Geneid1 Geneid2 Type
Gene
Geneid
Essential Chromosome
Localization
Expression
corr
Performance

Dynamic programming
 refinement

of hypotheses
Pruning
 Minimum
support pruning
 Optimistic
estimate pruning

Avoid redundant hypotheses

Smart data structures
Tabular representation of the links in the example rule
Conditions
composition.class
=
nucleases
interaction.type
=
Genetic
composition.complex
=
intracellular transport
Paths
composition1.id =
gene1.id
interaction.id1 =
gene1.id
composition2.id =
gene2.id
interaction.id2 =
gene2.id
IF gene1.Composition.Class = ‘Nucleases’ AND
Interaction.Type = ‘Genetic’ AND
gene2.Composition.Complex = ‘ Intracellular transport’
THEN gene1.Localization = extracellular…
gene1.id =
ineratction.geneid1
Building size one hypotheses
Vector buildSizeOneHypotheses(String class, String table name, Path path){
For each column in the selected table {
If ( column is not the target attribute and not a primary key and not a foreign key) {
Check whether the table is the target table
Check whether the column is numeric
Select the proper SQL template and generate SQL(path)
result set = execute generated SQL
hypotheses += generated hypotheses using the result set
}
}
For each table linked by a foreign key relation{
If the linked table was not visited before(check the path)
hypotheses += buildSizeOneHypotheses(class, linked table name, updated path)
}
return the hypotheses
}
Refinement of hypotheses
Vector extendHypotheses(String class, Vector currentHypotheses, String tableName, Path path){
For each column in the current table{
If ( column is not the target attribute and not a primary key and not a foreign key) {
For each hypothesis in the current hypothesis set {
If the hypothesis does not include the current feature {
Check whether the table is the target table
Check whether the column is numeric
Select the proper SQL template and generate SQL(path, hypothesis)
result set = execute generated SQL
extended hypotheses += generated hypotheses using the result set
}
}
}
For each table linked by a foreign key relation{
If the linked table was not visited before(check the path)
extended hypotheses += extendHypotheses(class, currentHypotheses,
linkedTableName, updated path)
}
return extended hypotheses;
}
Relational rule induction

Basic components are the same as in propositional rule induction
algorithms



Hypothesis construction

Rule selection
Interpretation of relational schema

Traversal of schema

Detection of cycles and handling of cycles
Communication with RDBMS

Expressing internal hypotheses in SQL

Understanding results returned from the RDBMS
Relational rule induction

Target table
Cites
 Primary
key for the main
objects being analyzed
Author
Paper
INTERACTION
GENEID1
GENE
COMPOSITION
GENEID
GENEID
Essential
Class
Chromosome
Complex
Phenotype
Motif
Function
GENEID2
Type
Expression
Localization
How a hypothesis is represented in SQL?
Rule:
INTERACTION.EXPR = (0.026658-0.513329] AND
COMPOSITION.PHENOTYPE = Auxotrophies
INTERACTION.GENEID1=GENE.GENEID and
INTERACTION.EXPR > 0.026 and
SQL:
INTERACTION.EXPR <= 0.513 and
COMPOSITION.PHENOTYPE = 'Auxotrophies' and
COMPOSITION.GENEID=GENE.GENEID and
GENE.LOCALIZATION = 'ER'
Many to many relations

Search Algorithm

Recognition of junction tables

Extension of hypotheses having a
feature from a junction table

Automatic conversion of hypotheses to
SQL using table aliases
Ila evaluation function
Attribute A
Attribute B
Class C
A
B
C
Figure to show how the ila evaluation function performance decreases when number of class examples are unevenly distributed