n - RILA:Pattern finder for relational data
Download
Report
Transcript n - RILA:Pattern finder for relational data
Supervised Rule Induction
for Relational Data
Mahmut Uludağ
Supervisor: Prof. Dr. Mehmet R. Tolun
Ph.D. Jury Presentation
Eastern Mediterranean University
Computer Engineering Department
February 25, 2005
Outline
Introduction
ILA and ILA-2 algorithms
Overview of the RILA system
Query generation
Optimistic estimate pruning
Rule selection strategies
Experiments and results
Conclusion
Motivation for relational data mining
Traditional work in statistics and knowledge discovery
assume data instances form a single table
Not always practical to represent complex objects in one
Cites
single table
RDBMS are widely used
Efficient management of data
Author
Indexing and query services, transaction and security support
Can store complex data
Data mining without transferring to a new location
Paper
Previous work – ILP based algorithms
Prolog is the main language to represent objects
and relations between the objects
Incremental learning, incorporation of
background knowledge
Initial research: deterministic rules
Recent research: statistical learning
Main obstacle to widespread acceptance;
dependency on a Prolog server
DMax; a modern ILP based data mining system
Client-server architecture; java client, ILProlog server
Example output rule:
Source: www.pharmadm.com
Previous work – relational data
mining framework
(Knobbe et al, 1999)
child
Client – server architecture
Selection graphs
Algorithm to translate selection graphs into SQL
MRDTL and MRDTL-2 algorithms, Iowa State
parent
age>30
University
M.Sc. Study in METU, Serkan Toprak, 2004
toy
Previous work – graph mining
Typical inputs are labelled graphs
Efficient tools in describing objects and the way they are connected
Subgraph isomorphism
Scalability problems
Avoid loading complete graph data into the main memory; partitioning
Nearly equivalent formalisms:
Graphs ≈ Database tables ≈ Prolog statements
ILA
Levelwise search
construct
hypotheses in the order of the increasing
number of conditions (i.e. at first, building hypotheses
with one condition, then building hypotheses with two
conditions, and so on)
Finds the smallest and completely accurate rule
set that represents the training data
ILA-2
Noise-tolerant evaluation function
score(hypothesis) = tp - pf * fn
tp is the number of true positive examples
fn is the number of false negative examples
pf stands for penalty factor, a user-defined minimum for the proportion
of tp to fn
not sensitive to distribution of false values
Multiple rule selection after a learning loop redundant rules
Implemented by modifying the source code of the C4.5
algorithm; some features inherited from C4.5
RILA
What is new when compared to ILA and ILA-2
Architecture
Performance
Internal representation of rules
Construction of hypotheses
RILA – what is new
Select late rule selection strategy; as an
alternative to the select early strategy
An efficient implementation
hypotheses can be refined by adding new
conditions, they do not need to be generated from
scratch in each learning loop
Optimistic estimate pruning (beam search)
Normalized hypotheses evaluation
function
Hypotheses
SQL, meta data queries
Discovery system
Rules
Result sets, meta data
•Traversing relational schema
•Hypothesis construction
•Conversions to SQL
•Rule selection
•Pruning
JDBC driver
Architecture of the system
DBMS
How tables are visited?
Junction table
Interaction
Geneid1 Geneid2 Type Expression corr
First level – stops in the junction table?
Extension levels – extends complex hypotheses only
by using attributes from the other side of the
junction relation
Target table
Gene
Geneid Essential Chromosome
Composition
Geneid Phenotype
Class Motif Function Complex
Example hypotheses that can be generated:
-If a gene has a relation r then its class is c
-If a gene has a property p and relation r then its class is c
-If a gene has a relation r to a gene having property p then its class is c
Localization
Internal representation of an example rule
Composition
Interaction
Composition
Conditions:
Class=‘Nucleases’
composition1.id=gene1.id
Gene
Localization=‘extracellular…’
type=‘Genetic’
Complex=‘Intracellular transport’
interaction.id1=gene1.id
Gene
composition2.id=gene2.id
Gene
Localization=‘extracellular…’
interaction.id2=gene2.id
IF gene1.Composition.Class = ‘Nucleases’ AND
Interaction.Type = ‘Genetic’ AND
gene2.Composition.Complex = ‘ Intracellular transport’
THEN gene1.Localization = extracellular…
gene1.id=interaction.id1
Gene
Localization=‘extracellular…’
Query generation
SQL template for building size one hypotheses
Numeric attributes
Refinement of hypotheses
How a hypothesis is represented in SQL?
How a hypothesis is extended by a condition from the other side of
an many-to-many relation?
SQL template for building size one
hypotheses
Select attr, count(distinct targetTable.pk)
from covered, path.getTable_list()
where path.getJoins() and
targetTable.classAttr = currentClass and
covered.id = targetTable.pk and
covered.mark=0
group by attr
Numeric attributes
Discretization results are stored in a
temporary table
Columns:
table_name, attribute_name,
interval_name, min_value, max_value
disc.table_name = ‘table’ and
SQL:
disc.attribute_name = ‘attr’ and
attr > disc.min_val and
attr < disc.max_val
Refinement of hypotheses
Select attr, count(distinct targetTable.pk)
from covered, table_list, hypothesis.table_list()
where targetAttr = currentClass and
join_list and
hypothesis.join_list()
covered.id = targetTable.pk and
covered.mark=0
group by attr;
How a hypothesis is extended by a condition from
the other side of a many-to-many relation?
Select GENE_B.CHROMOSOME, count (distinct
GENE.GENEID) from COMPOSITION, GENE, GENE
GENE_B, INTERACTION where
INTERACTION.GENEID2=GENE_B.GENEID and
INTERACTION.GENEID1=GENE.GENEID and
INTERACTION.EXPR > 0.026 and
INTERACTION.EXPR < 0.513 and
COMPOSITION.PHENOTYPE = 'Auxotrophies' and
COMPOSITION.GENEID=GENE.GENEID and
GENE.LOCALIZATION = 'ER'
group by GENE_B.CHROMOSOME
Optimistic estimate pruning
Avoid working on hypotheses which are unlikely to result
in satisfactory rules
F-measure criteria to assess hypotheses
2 * recall * precision / ( recall + precision )
Two types of pruning
Extend only n best hypotheses (beam search)
Minimum required f value in order a hypothesis to take place in
the hypothesis pool (similar to minimum support pruning)
Rule selection strategies
Select
early strategy
Why
do we need another
strategy?
Select
late strategy
Learning algorithm when using the select early strategy
size=1
If size is 1 then build initial hypotheses
otherwise extend current hypotheses
select p rule(s)
yes
any
rules
selected
?
yes
no,
size++
size is
smaller
than m?
no
mark covered objects
no
all examples
covered?
yes
end
Example training data to demonstrate the case where the select late
strategy performs better than the select early strategy
Attribute A
Attribute B
Attribute C
Class
a1
b1
c1
A
a1
b1
c2
A
a2
b2
c3
A
a3
b2
c3
A
a4
b1
c3
B
a5
b1
c3
B
a1
b2
c4
B
a1
b2
c5
B
Learning algorithm when using the select late strategy
start
Build initial hypothesis set
Extend hypothesis set
size < max
size?
Select Rules
no
end
yes,
size++
Rule selection algorithm when using the select late strategy
star
t
Is the score
positive?
Select hypothesis
with the highest score
no
yes
- Mark examples covered by this hypothesis
- If no positive examples covered then return
- Recalculate the score using the effective cover
- If the new score is higher than the score of the next hypothesis or
score of the hypothesis was previously reduced more than l then assert
the hypothesis as a new rule otherwise undo markings and set the score
to the new score calculated
no
all examples
covered?
yes
end
Experiments
Summary of the parameters
The genes data set
The mutagenesis data set
Summary of the parameters
Parameters applicable both to the select late and to the select
early strategies
pf is a user-defined minimum for the proportion of the true
positives to the false negatives
m is the maximum size for hypotheses
Parameter applicable only for the select early strategy
p is the maximum number of hypotheses that can be selected
as new rules after a search iteration
Parameter applicable only for the select late strategy
l is the limit on rule selection recursion
Optimistic estimate pruning parameters
f is the minimum acceptable F-measure value
n is maximum number of hypotheses that can be extended in
each level during the candidate rules generation phase of the
mining processes
The ‘genes’ dataset of KDD Cup 2001
INTERACTION
GENEID1
GENE
COMPOSITION
GENEID
GENEID
Essential
Class
Chromosome
Complex
Localization
Phenotype
GENEID2
Type
Expression
Motif
910 rows
Junction table
Many-to-many relation between genes
Function
862 rows
4346 rows
Test results for the localization attribute using
the select early strategy, pf=2, m=3
f=0.0
n=10000
p=1
f=0.001
n=500
p=1
f=0.001
n=500
p=5
f=0.01
n=1
p=1
f=0.01
n=1
p=5
training time (seconds)
117
61
67
67
37
Number of rules
90
90
161
78
122
number of conditions
133
133
233
103
144
training set coverage (%)
60.56
60.56
70.19
59.16
65.43
training set accuracy (%)
95
95
96
95
95
test set accuracy (%)
83.19
83.19
83.61
84.37
85.78
test set coverage (%)
59.32
59.32
62.47
58.79
60.89
Test results for the localization attribute
using the select late strategy
n
1
2
3
4
5
training time (seconds)
37
53
69
72
73
number of rules
126
140
147
150
157
number of conditions
152
188
206
213
229
training set coverage (%)
67.17
70.30
71.46
72.16
72.97
training set accuracy (%)
94
94
93
93
93
test set accuracy (%)
82.85
81.48
81.63
81.53
81.12
test set coverage (%)
62.73
63.78
64.30
65.35
65.35
pf=2, m=3, l=0, f=0.01
Test results for the localization attribute
using the select late strategy
n
1
2
3
4
5
training time (seconds)
65
69
69
89
102
number of rules
126
139
144
147
153
number of conditions
154
188
200
205
216
training set coverage (%)
65.55
68.68
69.37
69.84
70.65
training set accuracy (%)
96
96
96
96
96
test set accuracy (%)
84.96
83.91
83.91
83.62
83.27
test set coverage (%)
59.32
60.37
60.37
60.89
61.15
pf=2, m=3, l=100, f=0.01
Why we did not have better results
on the genes data set?
Cup winner’s accuracy 72.1%
MRDTL 76.1% accuracy
Serkan 59.5% accuracy
rila best accuracy 85.8% with 60.9%
coverage
rila best coverage 65.3% with 81.5%
accuracy
Missing values? no
Default class selection? no
Deteriorated performance when the
number of class values is high
Distribution of false values among classes
not taken into account
Problem when number of examples in
different classes are not evenly distributed
Attribute1
Attribute2
Class
1
5
pink
1
5
pink
1
5
pink
1
5
yellow
2
5
yellow
2
6
yellow
1
6
blue
2
7
blue
3
7
blue
Schema of the mutagenesis database
BOND
ATOM_ID1
ATOM
ATOM_ID
MOLECULE
Molecule_id
ATOM_ID2
Type
Molecule_id
Element
Type
Log_mut
Logp
Lugmo
Ind1
Charge
Inda
Label
Cross validation test results using the select early strategy on the
mutagenesis data for different p* values
p
1
2
3
4
5
6
7
time (in seconds)
305
260
225
215
215
215
210
# rules
103
103
103
104
104
104
104
# conditions
120
120
120
122
121
121
121
accuracy (%)
98.82
97.06
97.06
97.06
97.06
97.06
97.06
coverage (%)
89.89
90.43
90.43
90.43
90.43
90.43
90.43
*maximum number of rules selected when each time the rule selection step is executed
Cross validation test results using the select early strategy and OEP on
the mutagenesis data for different n values
n
1
2
3
10
15
20
30
40
time (seconds)
134
144
159
221
325
397
311
311
# rules
118
120
120
142
123
119
104
103
# conditions
165
171
171
210
166
161
123
120
accuracy (%)
98.26
98.26
98.26
98.83
98.25
98.25
98.82
98.82
coverage (%)
91.49
91.49
91.49
90.95
90.96
90.96
89.89
89.89
Cross validation test results using the select late strategy on the
mutagenesis data
n
1
2
3
10
time (seconds)
76
123
169
486
# rules
122
137
145
151
# conditions
165
204
224
248
accuracy (%)
96.49
94.77
93.68
93.68
coverage (%)
90.96
91.49
92.55
92.55
p =1, f=0.01, l=0
Cross validation test results using the select late strategy
on the mutagenesis data
n
1
2
3
10
time (seconds)
84
135
183
490
# rules
105
107
109
112
# conditions
131
137
145
158
accuracy (%)
98.26
98.26
98.24
98.25
coverage (%)
91.49
91.49
90.43
90.96
p =1, f=0.01, l=10
Cross validation test results using the select late strategy on the
mutagenesis data
n
1
2
3
10
time (seconds)
80
130
178
483
# rules
103
104
106
109
# conditions
127
131
137
153
accuracy (%)
98.26
98.26
98.26
98.26
coverage (%)
91.49
91.49
91.49
91.49
p =1, f=0.01, l=100
Comparison to others results on mutagenesis data
The best results by RILA (Table 2 and Table 5)
accuracy 98.26%
coverage 91.49%
The best results reported in (Atramentov et al. 2003)
accuracy 87.5%
The best results reported by the originators (King et al.
1996) of the data set
accuracy 89.4%, (number of correct predictions divided by the
number of predictions)
Conclusion
A new relational rule learning algorithm has been
developed with two different rule selection strategies
Several techniques used to have reasonable
performance; refinement of hypotheses, pruning
The results on the mutagenesis data are better than
other results cited in the literature
Compared to traditional algorithms, there is no need to
move relational data to another location; scalability, practicality
Techniques employed can be used to develop relational
versions of other traditional learning algorithms
Thanks!
FOIL, a set-covering approach
[Cameron Jonaes and Quinlan 1994]
Begins with the most general theory
Repeatedly adds a clause to the theory that
covers some of the positive examples and few
negative examples
Covered examples are removed
Continue until the theory covers all positive
examples
Previous work – unsupervised
algorithms
WARMR [Dehaspe et al., 1998] finds relational association rules (query extensions)
Input – Prolog database
Specification in the WARMODE language, limits the format of possible query extensions
SUBDUE [Cook and Holder, 1994] discovers substructures in a graph
Output – the substructure selected at each iteration as the best to compress the graph
PRM [Getoor et al., 2002] reinterpret Bayesian networks in a relational setting
Captures the probabilistic dependence between the attributes of interrelated objects
Link analysis
Models generated by some unsupervised learning algorithms can be used for
prediction tasks; WARMR, PRM, not SUBDUE
Relational rule induction
Schema graph represents structure of the data
tables
= nodes
foreign
keys = edges
Multiple tables can represent several objects
and relations between the objects
Users should select tables that represent the
objects they are interested in
An example relational rule
Composition
Geneid Phenotype
Class Motif Function Complex
IF Composition.Class = ‘ATPases’ AND
Composition.Complex = ‘ Intracellular transport’
THEN Gene.Localization = extracellular..
Gene
Geneid Essential Chromosome
Localization
Many-to-many relations
Junction
table
Junction tables
Between different
classes
Between objects of
the same class
Recursive
queries
are needed to
extract data
between different classes
Junction
table
between objects of
the same class
Example rule having a many-to-many
relation
Composition
Geneid Phenotype Class Motif Function Complex
IF gene1.Composition.Class = ‘Nucleases’ AND
Interaction.Type = ‘Genetic’ AND
gene2.Composition.Complex = ‘ Intracellular transport’
THEN gene1.Localization = extracellular…
Interaction
Geneid1 Geneid2 Type
Gene
Geneid
Essential Chromosome
Localization
Expression
corr
Performance
Dynamic programming
refinement
of hypotheses
Pruning
Minimum
support pruning
Optimistic
estimate pruning
Avoid redundant hypotheses
Smart data structures
Tabular representation of the links in the example rule
Conditions
composition.class
=
nucleases
interaction.type
=
Genetic
composition.complex
=
intracellular transport
Paths
composition1.id =
gene1.id
interaction.id1 =
gene1.id
composition2.id =
gene2.id
interaction.id2 =
gene2.id
IF gene1.Composition.Class = ‘Nucleases’ AND
Interaction.Type = ‘Genetic’ AND
gene2.Composition.Complex = ‘ Intracellular transport’
THEN gene1.Localization = extracellular…
gene1.id =
ineratction.geneid1
Building size one hypotheses
Vector buildSizeOneHypotheses(String class, String table name, Path path){
For each column in the selected table {
If ( column is not the target attribute and not a primary key and not a foreign key) {
Check whether the table is the target table
Check whether the column is numeric
Select the proper SQL template and generate SQL(path)
result set = execute generated SQL
hypotheses += generated hypotheses using the result set
}
}
For each table linked by a foreign key relation{
If the linked table was not visited before(check the path)
hypotheses += buildSizeOneHypotheses(class, linked table name, updated path)
}
return the hypotheses
}
Refinement of hypotheses
Vector extendHypotheses(String class, Vector currentHypotheses, String tableName, Path path){
For each column in the current table{
If ( column is not the target attribute and not a primary key and not a foreign key) {
For each hypothesis in the current hypothesis set {
If the hypothesis does not include the current feature {
Check whether the table is the target table
Check whether the column is numeric
Select the proper SQL template and generate SQL(path, hypothesis)
result set = execute generated SQL
extended hypotheses += generated hypotheses using the result set
}
}
}
For each table linked by a foreign key relation{
If the linked table was not visited before(check the path)
extended hypotheses += extendHypotheses(class, currentHypotheses,
linkedTableName, updated path)
}
return extended hypotheses;
}
Relational rule induction
Basic components are the same as in propositional rule induction
algorithms
Hypothesis construction
Rule selection
Interpretation of relational schema
Traversal of schema
Detection of cycles and handling of cycles
Communication with RDBMS
Expressing internal hypotheses in SQL
Understanding results returned from the RDBMS
Relational rule induction
Target table
Cites
Primary
key for the main
objects being analyzed
Author
Paper
INTERACTION
GENEID1
GENE
COMPOSITION
GENEID
GENEID
Essential
Class
Chromosome
Complex
Phenotype
Motif
Function
GENEID2
Type
Expression
Localization
How a hypothesis is represented in SQL?
Rule:
INTERACTION.EXPR = (0.026658-0.513329] AND
COMPOSITION.PHENOTYPE = Auxotrophies
INTERACTION.GENEID1=GENE.GENEID and
INTERACTION.EXPR > 0.026 and
SQL:
INTERACTION.EXPR <= 0.513 and
COMPOSITION.PHENOTYPE = 'Auxotrophies' and
COMPOSITION.GENEID=GENE.GENEID and
GENE.LOCALIZATION = 'ER'
Many to many relations
Search Algorithm
Recognition of junction tables
Extension of hypotheses having a
feature from a junction table
Automatic conversion of hypotheses to
SQL using table aliases
Ila evaluation function
Attribute A
Attribute B
Class C
A
B
C
Figure to show how the ila evaluation function performance decreases when number of class examples are unevenly distributed