Searching for Credible Relations in Machine Learning

Download Report

Transcript Searching for Credible Relations in Machine Learning

Searching for Credible Relations in Machine Learning
Doctoral Dissertation
Vedrana Vidulin
Supervisor: prof. dr. Matjaž Gams
Co-supervisor: prof. dr. Bogdan Filipič
Ljubljana, 3 February 2012
Introduction
• Task: domain analysis of complex domains
• Problem:
– When DM methods construct models on complex domains, the
models often contain parts (relations) that are less-credible from
the perspective of human analyst.
– Less-credible parts can:
• Lead to wrong conclusions about the most important relations in the
domain
• Undermine user’s trust in DM methods (Stumpf et al., 2009).
• Proposed solution: a new method that in algorithmic way
combines human understanding and raw computer power in
order to extract credible relations – supported by data and
meaningful for the human.
Searching for Credible Relations in Machine Learning
2 of 20
An Example
• A decision-tree model is constructed:
167 examples: Countries
– With J48 algorithm in Weka,
– From a data set that represents the impact of R&D sector
on economic welfare of a country
Class: Economic
37 attributes: R&D sector
welfare
Country
GERD per
capita
(PPP$)
Researchers
per million
inhabitants (HC)
…
7.6
1,660
…
Government
low
Latvia
37.1
2,455
…
Government
middle
Japan
813.7
6,227
Business enterprise
high
…
…
Armenia
…
…
…
…
Sector investing
the most in R&D
GNI per
capita
Searching for Credible Relations in Machine Learning
3 of 20
An Example (2)
GERD per capita (PPP$)
<= 105.5
Sector employing the most researchers
= Higher
education
= N/A
middle (49.0/20.92) middle (42.87/13.15)
= Goverment
GERD per capita (PPP$)
= Business
enterprise
middle (5.0)
> 105.5
Sector investing the most in R&D
= N/A = Government = Business
enterprise
middle (16.7/8.77) high (6.57/1.28) high (24.0/1.0)
= Higher
= Private
= Abroad
education
non-profit
high (0.0)
high (0.0)
high (0.0)
<= 10.8 > 10.8
low (12.58/0.39)
middle (10.29/4.29)
Searching for Credible Relations in Machine Learning
4 of 20
Outline
• Definition of credible relation
• Human-Machine Data Mining (HMDM) method
• Experimental evaluation
• Conclusions and contributions
Searching for Credible Relations in Machine Learning
5 of 20
Credible Relation
• Relation – a pattern that connects a set of attributes that
describe the properties of a concept underlying the data
and a class/target attribute that represents the concept.
• Credible relation – of great meaning and of high quality:
– Meaning – a subjective criterion attributed by the human
based on the common sense, an informal knowledge about
the domain, observed frequency and stability of the relation.
– Quality – an objective criterion that indicates a support of
the selected quality measures.
• Credible model – composed only of credible relations.
Searching for Credible Relations in Machine Learning
6 of 20
How to Establish Credible Relations?
The relation is composed of
attributes A1 and A2.
Re-examine relation’s credibility by:
1) Removing attributes A1 and A2
from data set
2) Adding attributes A1 and A2 to ∅
If the relation is supported by evidence, add it
to the list of candidates for credible relations.
Searching for Credible Relations in Machine Learning
7 of 20
The HMDM Algorithm
Repeat
Create several models (e.g., trees)
Choose most interesting models
For each interesting model
Examine credibility of relations in the model
by adding and removing attributes from the data set
Merge candidate relations with the output list of credible relations
Until no new interesting relations
Searching for Credible Relations in Machine Learning
8 of 20
The HMDM Algorithm (2)
HMDM (data set)
REPEAT
Select DM method
Select parameters and their ranges, define constraints
Perform INITIAL_DM creating a list of models LM:
FOR each interesting model M from LM, reexamine M:
REPEAT
Perform any of the following: {
ADD_ATTRIBUTES
REMOVE_ATTRIBUTES
Expand credibility indicator }
Evaluate the results with several quality measures and for meaning
UNTIL no more interesting relations are found in the search space near the initial model
Store credible relations and integrate conclusions
END FOR
UNTIL no more new interesting relations are found anywhere in the data set
Searching for Credible Relations in Machine Learning
9 of 20
HMDM: ADD_ATTRIBUTES
ATTRIBUTES
A1
A2
A3
C
1
1
1
0
1
0
1
1
1
0
1
1
0
0
0
0
1
1
0
1
0
1
1
0
0
1
0
0
Candidates for credible relations
NO ATTRIBUTES
Model: J48 trees
A1 | 71.43
A2 | 100
A2 | 85.71
…
Quality: Accuracy (%)
A1 & A2 – combination
Searching for Credible Relations in Machine Learning
10 of 20
HMDM: REMOVE_ATTRIBUTES
Quality: Accuracy (%)
ATTRIBUTES
A1
A2
A3
C
1
0
0
1
1
1
1
0
1
1
0
0
1
1
1
0
0
1
1
1
1
1
0
0
1
1
1
1
ALL ATTRIBUTES | 100
Model: J48 trees
A3 | 100
A1 | 71.43
Candidates for credible relations
…
A1 || A3 – redundancy
Searching for Credible Relations in Machine Learning
11 of 20
Type-Credibility Scheme
• Three levels of credibility:
1. Frequent and stable relations
• Often appear in models
• When added improve quality
• When removed reduce quality
2. Frequent and less-stable relations
• Often appear in models
• When added sometimes improve quality and sometimes not
• When removed sometimes reduce quality and sometimes not
3. Not supported by evidence
Searching for Credible Relations in Machine Learning
12 of 20
Quality Measures
• The decision trees are evaluated according to:
– Accuracy
– Corrected class probability estimate (CCPE)
– Kappa
• The regression trees are evaluated according to:
– Correlation coefficient
– Relative absolute accuracy (RAA)
• In addition, trees are evaluated according to 𝑞∆ – the total
change in quality caused by adding and removing attributes:
𝑞∆ = 𝐴𝐶𝐶∆ + 𝐶𝐶𝑃𝐸∆ + 𝐾𝑎𝑝𝑝𝑎∆
Searching for Credible Relations in Machine Learning
13 of 20
Experimental Evaluation
• Performed on three domains:
1.
2.
3.
Research and development (R&D)
Higher education
Automatic web genre identification
Searching for Credible Relations in Machine Learning
14 of 20
R&D Domain: Remove Attributes Graph
GERD-PC || GERD-GDP
RES-HC || RES-FTE
APP-NON-RES
Searching for Credible Relations in Machine Learning
15 of 20
Domains
• Higher education
– Goal: An analysis of the impact of higher education sector
on economic welfare of a country
– DM methods: J48 and M5P trees
– Data: 60 attributes; 167 examples: countries; class: GNI per
capita
• Automatic web genre identification
– Goal: Improve predictive performance by eliminating lesscredible relations from J48 decision-tree models
– Data: 500 attributes: words; 1,539 examples: web pages;
class: 20 genres
Searching for Credible Relations in Machine Learning
16 of 20
R&D and Higher Education
Domains – Credible Relations
R&D
• First level: increase the level of investment in R&D sector
• Second level:
– Increase the number of patents
– Increase the number of researchers
– Develop business enterprise sector as the key leader in R&D activities
Higher education
• First level: stimulate participation in higher education and improve
student exchange programs
• Second level:
– Increase the level of investment in all levels of education (“low”)
– Increase number of graduates in science programs (“middle”)
– Attract more foreign students (“middle”)
Searching for Credible Relations in Machine Learning
17 of 20
Evaluation
Accuracy (%)
Data
J48
HI-EDU
R&D
Correlation coefficient
HMDM
Data
M5P
71.86
HI-EDU
0.681
63.47
R&D
0.722
Data:
HMDM
0.787
Genres
F-Measure
J48
HMDM
Micro-AVG
0.280
0.370
Macro-AVG
0.284
0.377
• User study on 22 participants:
– 64% of participants did not recognize less-credible relations in the
single model
– When presented with credible models all accepted credible models
as better
Searching for Credible Relations in Machine Learning
18 of 20
Conclusions
• A novel method Human-Machine Data Mining (HMDM)
was designed that combines human understanding and
raw computer power to extract credible relations from
data.
• The HMDM method was evaluated on three complex
domains showing that:
– the method is able to find important relations in data
– credible models are better in quality than the models
constructed by automatic DM methods
– humans accept credible models
Searching for Credible Relations in Machine Learning
19 of 20
Contributions
• The main contributions:
– A new method Human-Machine Data Mining (HMDM) was
designed for extracting credible relations from data
– The CCPE statistical measure, originally conceived for
classification rules, was extended for decision trees
– Interactive explanation structures in the form of added and
removed attributes graphs were designed, conceived to
facilitate the extraction of credible relations
• Additional contributions:
– A computer program was developed to support the HMDM
method
– The analysis of three real-life domains
Searching for Credible Relations in Machine Learning
20 of 20