+ p - Fizyka UMK

Download Report

Transcript + p - Fizyka UMK

Understanding of data using
Computational Intelligence methods
Włodzisław Duch
Dept. of Informatics,
Nicholas Copernicus University,
Toruń, Poland
http://www.phys.uni.torun.pl/~duch
IEA/AIE Cairns, 17-20.06.2002
What am I going to say
•
•
•
•
•
•
•
•
•
Data and CI
What we hope for.
Forms of understanding.
Visualization.
Prototypes.
Logical rules.
Some knowledge discovered.
Expert system for psychometry.
Conclusions, or why am I saying this?
Types of Data
• Data was precious! Now it is overwhelming ...
• Statistical data – clean, numerical, controlled
experiments, vector space model.
• Relational data – marketing, finances.
• Textual data – Web, NLP, search.
• Complex structures – chemistry, economics.
• Sequence data – bioinformatics.
• Multimedia data – images, video.
• Signals – dynamic data, biosignals.
• AI data – logical problems, games, behavior …
Computational Intelligence
Neural
networks
Evolutionary
algorithms
Fuzzy
logic
Soft computing
Pattern
Recognition
Expert
systems
Computational Intelligence
Data => Knowledge
Artificial Intelligence
Machine
learning
Probabilistic
methods
Visualization
Multivariate
statistics
CI & AI definition
• Computational Intelligence is concerned with
solving effectively non-algorithmic problems.
This corresponds to all cognitive processes,
including low-level ones (perception).
• Artificial Intelligence is a part of CI concerned
with solving effectively non-algorithmic
problems requiring systematic reasoning and
symbolic knowledge representation.
Roughly this corresponds to high-level
cognitive processes.
Turning data into knowledge
What should CI methods do?
• Provide descriptive and predictive nonparametric models of data.
• Allow to classify, approximate, associate,
correlate, complete patterns.
• Allow to discover new categories and
interesting patterns.
• Help to visualize multi-dimensional
relationships among data samples.
• Allow to understand the data in some way.
• Facilitate creation of ES and reasoning.
Forms of useful knowledge
AI/Machine Learning camp:
Neural nets are black boxes.
Unacceptable! Symbolic rules forever.
But ... knowledge accessible to humans is in:
• symbols,
• similarity to prototypes,
• images, visual representations.
What type of explanation is satisfactory?
Interesting question for cognitive scientists.
Different answers in different fields.
Data understanding
• Humans remember examples of each
category and refer to such examples –
as similarity-based or nearestneighbors methods do.
• Humans create prototypes out of many
examples – as Gaussian classifiers,
RBF networks, neurofuzzy systems do.
• Logical rules are the highest form of
summarization of knowledge.
Types of explanation:
• visualization-based: maps, diagrams, relations ...
• exemplar-based: prototypes and similarity;
• logic-based: symbols and rules.
Visualization: dendrograms
All projections (cuboids) on 2D subspaces are
identical, dendrograms do not show the structure.
Normal and malignant lymphocytes.
Visualization: 2D projections
All projections (cuboids) on 2D subspaces are
identical, dendrograms do not show the structure.
3-bit parity + all 5-bit combinations.
Visualization: MDS mapping
Results of pure MDS mapping + centers of
hierarchical clusters connected.
3-bit parity + all 5-bit combinations.
Visualization: 3D projections
Only age is continuous, other values are binary
Fine Needle Aspirate of Breast Lesions, red=malignant, green=benign
A.J. Walker, S.S. Cross, R.F. Harrison, Lancet 1999, 394, 1518-1521
Visualization: MDS mappings
Try to preserve all distances in 2D nonlinear mapping
MDS large sets using LVQ + relative mapping.
Prototype-based rules
C-rules (Crisp), are a special case of F-rules (fuzzy rules).
F-rules (fuzzy rules) are a special case of P-rules (Prototype).
P-rules have the form:
IF P = arg minR D(X,R) THAN Class(X)=Class(P)
D(X,R) is a dissimilarity (distance) function, determining
decision borders around prototype P.
P-rules are easy to interpret!
IF
X=You are most similar to the P=Superman
THAN You are in the Super-league.
IF
X=You are most similar to the P=Weakling
THAN You are in the Failed-league.
“Similar” may involve different features or D(X,P).
P-rules
Euclidean distance leads to a Gaussian fuzzy
membership functions + product as T-norm.
D  X, P    d  X i , Pi   Wi  X i - Pi 
i
mP  X  e
- D X,P 
2
i
e
-
 d  X i , Pi 
i
 e
-Wi  X i - Pi 
2
  mi  X i , Pi 
i
Manhattan function => m(X;P)=exp{-|X-P|}
i
Various distance functions lead to different MF.
Ex. data-dependent distance functions, for symbolic data:


DVDM  X, Y      p  C j | X i  - p  C j | Yi  
i  j



DPDF  X, Y      p  X i | C j  - p  C j | Yi  
i  j

Crisp P-rules
New distance functions from info theory > interesting MF.
Membership Functions > new distance function, with local
D(X,R) for each cluster.
Crisp logic rules: use L norm:
D(X,P) = ||X-P|| = maxi Wi |Xi-Pi|
D(X,P) = const => rectangular contours.
L (Chebyshev) distance with thresholds P
IF D(X,P)  P THEN C(X)=C(P)
is equivalent to a conjunctive crisp rule
IF X1[P1-P/W1,P1+P/W1]  …
… XN [PN -P/WN,PN+P/WN] THEN C(X)=C(P)
Decision borders
D(P,X)=const and decision borders D(P,X)=D(Q,X).
Euclidean distance from 3
prototypes, one per class.
Minkovski a=20 distance
from 3 prototypes.
P-rules for Wine
L distance (crisp rules):
15 prototypes kept,
5 errors,
f2, f8, f10 removed
Euclidean distance:
11 prototypes kept,
7 errors
Manhattan distance:
6prototypes kept,
4 errors,
f2 removed
Many other solutions.
Prototypes: SV & clusters.
Complex objects
Vector space concept is not sufficient for
complex object. A common set of features is
meaningless.
AI: complex objects, states, subproblems.
General approach: sufficient to evaluate similarity D(Oi,Oj).
Compare Oi, Oj: define transformation
ˆ O O
T Oi   W
k i
j
k
Elementary operators Wk, eg. substring’s substitutions.
Many T connecting a pair of objects Oi and Oj objects exist.
Cost of transformation = sum of Wk costs.
Similarity: lowest transformation costs.
Bioinformatics: sophisticated similarity functions for sequences.
Dynamic programming finds similarities in reasonable time.
Use adaptive costs and general framework for SBM methods.
Promoters
DNA strings, 57 aminoacids, 53 + and 53 - samples
tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt
Euclidean distance, symbolic
s =a, c, t, g replaced by
x=1, 2, 3, 4
PDF distance, symbolic s=a,
c, t, g replaced by p(s|+)
Connection of CI with AI
AI/CI division is harmful for science!
GOFAI: operators, state transformations and search
techniques are basic tools in AI solving problems requiring
systematic reasoning.
CI methods may provide useful heuristics for AI and define
metric relations between states, problems or complex objects.
Example: combinatorial productivity in AI systems and FSM.
Later: decision tree for complex structures.
Electric circuit example
Answering questions in complex domains requires reasoning.
Qualitative behavior of electric circuit:
7 variables, but Ohm’s law V=I*R, or Kirhoff’s law Vt=V1+V2
Train a NeuroFuzzy system on Ohm’s and Kirhoff’s laws.
Without solving equations; answer questions of the type:
If R2 grows, R1 & Vt are constant, what will happen with the
current I and voltages V1, V2 ?
(taken from the PDP book, McClleland, Rumelhart, Hinton)
Electric circuit search
AI: create search tree, CI: provide guiding intuition.
Any law of the form A=B*C or A=B+C, ex: V=I*R, has 13 true
facts, 14 false facts and may be learned by NF system.
Geometrical representation:
+ increasing, - decreasing, 0 constant
Find combination of
Vt, Rt, I, V1, V2, R1, R2
for which all 5 constraints are fulfilled.
For 111 cases put of 37=2187
5
F (Vt , Rt , I ,V1 ,V2 , R1 , R2 )   Fi ( Ai , Bi , Ci )
i 1
Search and check if X can be +, 0, -, laws are not satisfied
if F(Vt=0, Rt, I, V1, V2, R1=0, R2=+) =0
Heuristic search
If R2 grows, R1 & Vt are constant, what will happen
with the current I and voltages V1, V2 ?
We know that:
R2 =+, R1 =0, Vt =0,
V1=?, V2=?, Rt=?, I =?
Take V1=+ and check if:
F(Vt=0, Rt=?, I=?, V1=+, V2=?, R1=0, R2=+) >0
Since for all V1=+, 0 and – the function is F()>0 take variable
that leads to unique answer, Rt
Single search path solves the
problems.
Useful also in approximate
reasoning where only some
conditions are fulfilled.
Logical rules
Crisp logic rules: for continuous x use linguistic
variables (predicate functions).
sk(x)  True [Xk x  X'k], for example:
small(x)
= True{x|x < 1}
medium(x) = True{x|x  [1,2]}
large(x)
= True{x|x > 2}
Linguistic variables are used in crisp
(prepositional, Boolean) logic rules:
IF small-height(X) AND has-hat(X) AND hasbeard(X)
THEN (X is a Brownie)
ELSE IF ... ELSE ...
Crisp logic decisions
Crisp logic is based on rectangular
membership functions:
True/False values jump from 0 to 1.
Step functions are used for
partitioning of the feature space.
Very simple hyper-rectangular
decision borders.
Severe limitation on the expressive
power of crisp logical rules!
DT decisions borders
Decision trees lead to specific decision borders.
SSV tree on Wine data, proline + flavanoids content
Decision tree forests: many decision trees of similar
accuracy, but different selectivity and specificity.
Logical rules - advantages
Logical rules, if simple enough, are preferable.
• Rules may expose limitations of black box
solutions.
• Only relevant features are used in rules.
• Rules may sometimes be more accurate than
NN and other CI methods.
• Overfitting is easy to control, rules usually
have small number of parameters.
• Rules forever !?
A logical rule about logical rules is:
IF
the number of rules is relatively small
AND the accuracy is sufficiently high.
THEN rules may be an optimal choice.
Logical rules - limitations
Logical rules are preferred but ...
• Only one class is predicted p(Ci|X,M) = 0 or 1
black-and-white picture may be inappropriate in
many applications.
• Discontinuous cost function allow only nongradient optimization.
• Sets of rules are unstable: small change in the
dataset leads to a large change in structure of
complex sets of rules.
• Reliable crisp rules may reject some cases as
unclassified.
• Interpretation of crisp rules may be misleading.
• Fuzzy rules are not so comprehensible.
Rules - choices
Simplicity vs. accuracy.
Confidence vs. rejection rate.
 p++
p  true | predicted   
 p-+
p+p--
p+ r  p+
p- r  p-
p++ is a hit; p-+ false alarm; p+- is a miss.
Accuracy (overall)
Error rate
A(M) = p+++ p-L(M) = p-++ p+-
Rejection rate
R(M)=p+r+p-r= 1-L(M)-A(M)
Sensitivity
S+(M)= p+|+ = p++ /p+
Specificity
S-(M)= p-|- = p-- /p-
Neural networks and rules
~ p(MI|X) 0.7 Myocardial Infarction
Output
weights
Input
weights
Inputs:
-1
65
Sex
Age
1
5
3
1
Smoking Pain
Elevation
Pain
Intensity Duration ECG: ST
Knowledge from networks
Simplify networks: force most weights to 0,
quantize remaining parameters, be constructive!
• Regularization: mathematical technique
improving predictive abilities of the network.
• Result: MLP2LN neural networks that are
equivalent to logical rules.
MLP2LN
Converts MLP neural networks into a network
performing logical operations (LN).
Input
layer
Output:
one node
per class.
Aggregation:
Linguistic units:
better features windows, filters
Rule units:
threshold logic
Learning dynamics
Decision regions shown every 200 training epochs in x3, x4
coordinates; borders are optimally placed with wide margins.
Neurofuzzy systems
Fuzzy: mx0,1 (no/yes) replaced by a degree
mx[0,1]. Triangular, trapezoidal, Gaussian ... MF.
M.f-s in many
dimensions:
Feature Space Mapping (FSM) neurofuzzy system.
Neural adaptation, estimation of probability density
distribution (PDF) using single hidden layer network
(RBF-like) with nodes realizing separable functions:
G  X ; P    Gi  X i ; Pi 
i 1
Heterogeneous systems
Homogenous systems: one type of “building blocks”, same
type of decision borders.
Ex: neural networks, SVMs, decision trees, kNNs ….
Committees combine many models together, but lead to
complex models that are difficult to understand.
Discovering simplest class structures, its inductive bias:
requires heterogeneous adaptive systems (HAS).
Ockham razor: simpler systems are better.
HAS examples:
NN with many types of neuron transfer functions.
k-NN with different distance functions.
DT with different types of test criteria.
GhostMiner Philosophy
GhostMiner, data mining tools from our lab.
http://www.fqspl.com.pl/ghostminer/
• Separate the process of model building and
knowledge discovery from model use =>
GhostMiner Developer & GhostMiner Analyzer.
• There is no free lunch – provide different type of tools
for knowledge discovery.
Decision tree, neural, neurofuzzy, similarity-based,
committees.
• Provide tools for visualization of data.
• Support the process of knowledge discovery/model
building and evaluating, organizing it into projects.
Recurrence of breast cancer
Data from: Institute of Oncology, University
Medical Center, Ljubljana, Yugoslavia.
286 cases, 201 no recurrence (70.3%),
85 recurrence cases (29.7%)
no-recurrence-events, 40-49, premeno, 25-29,
0-2, ?, 2, left, right_low, yes
9 nominal features:
age (9 bins), menopause, tumor-size (12 bins),
nodes involved (13 bins), node-caps, degreemalignant (1,2,3), breast, breast quad, radiation.
Recurrence of breast cancer
Data from: Institute of Oncology, University
Medical Center, Ljubljana, Yugoslavia.
Many systems used, 65-78% accuracy reported.
Single rule:
IF (nodes-involved  [0,2]  degree-malignant = 3
THEN recurrence, ELSE no-recurrence
76.2% accuracy, only trivial knowledge in the data:
“Highly malignant breast cancer involving many
nodes is likely to strike back.”
Recurrence - comparison.
Method
MLP2LN 1 rule
SSV DT stable rules
10xCV accuracy
76.2
75.7  1.0
k-NN, k=10, Canberra
74.1 1.2
MLP+backprop.
CART DT
FSM, Gaussian nodes
Naive Bayes
73.5  9.4 (Zarndt)
71.4  5.0 (Zarndt)
71.7  6.8
69.3  10.0 (Zarndt)
Other decision trees
< 70.0
Breast cancer diagnosis.
Data from University of Wisconsin Hospital,
Madison, collected by dr. W.H. Wolberg.
699 cases, 9 cell features quantized from 1 to 10:
clump thickness, uniformity of cell size,
uniformity of cell shape, marginal adhesion,
single epithelial cell size, bare nuclei,
bland chromatin, normal nucleoli, mitoses.
Tasks: distinguish benign from malignant cases.
Breast cancer rules.
Data from University of Wisconsin Hospital,
Madison, collected by dr. W.H. Wolberg.
Simplest rule from MLP2LN, large regularization:
If
uniformity of cell size < 3
Then benign
Else malignant
Sensitivity=0.97, Specificity=0.85
More complex solutions (3 rules) give in 10CV:
Sensitivity =0.95, Specificity=0.96, Accuracy=0.96
Breast cancer comparison.
Method
10xCV accuracy
k-NN, k=3, Manh
FSM, neurofuzzy
97.0  2.1 (GM)
96.9  1.4 (GM)
Fisher LDA
MLP+backprop.
LVQ
IncNet (neural)
Naive Bayes
SSV DT, 3 crisp rules
LDA (linear discriminant)
Various decision trees
96.8
96.7 (Ster, Dobnikar)
96.6 (Ster, Dobnikar)
96.4  2.1 (GM)
96.4
96.0  2.9 (GM)
96.0
93.5-95.6
SSV HAS Wisconsin
Heterogeneous decision tree that searches not only for logical
rules but also for prototype-based rules.
Single P-rule gives simplest known description of this data:
IF ||X-R303|| < 20.27 then malignant
else benign
18 errors, 97.4% accuracy. Good prototype for malignant!
Simple thresholds, that’s what MDs like the most!
Best L1O error
best 10CV around
C 4.5 gives
SSV without distances:
98.3% (FSM),
97.5% (Naïve Bayes + kernel, SVM)
94.7±2.0%
96.4±2.1%
Several simple rules of similar accuracy in CV tests exist.
Melanoma skin cancer

Collected in the Outpatient Center of
Dermatology in Rzeszów, Poland.

Four types of Melanoma: benign,
blue, suspicious, or malignant.

250 cases, with almost equal class distribution.

Each record in the database has 13 attributes:
asymmetry, border, color (6), diversity (5).

TDS (Total Dermatoscopy Score) - single index

Goal: hardware scanner for preliminary
diagnosis.
Melanoma results
Method
Rules Training %
Test %
MLP2LN, crisp rules
4
98.0 all
100
SSV Tree, crisp rules
4
97.5±0.3
100
FSM, rectangular f.
7
95.5±1.0
100
knn+ prototype selection
13
97.5±0.0
100
FSM, Gaussian f.
15
93.7±1.0
95±3.6
knn k=1, Manh, 2 features --
97.4±0.3
100
--
96.2
LERS, rough rules
21
Antibiotic activity of
pyrimidine compounds.
Pyrimidines: which compound
has stronger antibiotic activity?
Common template, substitutions
added at 3 positions, R3, R4 and R5.
27 features taken into account: polarity, size,
hydrogen-bond donor or acceptor, pi-donor or
acceptor, polarizability, sigma effect.
Pairs of chemicals, 54 features, are compared,
which one has higher activity?
2788 cases, 5-fold crossvalidation tests.
Antibiotic activity - results.
Pyrimidines: which compound
has stronger antibiotic activity?
Mean Spearman's rank correlation coefficient
used:
-1< rs < +1
Method
Rank correlation
FSM, 41 Gaussian rules
Golem (ILP)
Linear regression
CART (decision tree)
0.77±0.03
0.68
0.65
0.50
Thyroid screening.
Garavan Institute,
Sydney, Australia
15 binary, 6 continuous
Training: 93+191+3488
Validate: 73+177+3178

Determine important
clinical factors

Calculate prob. of
each diagnosis.
Clinical
findings
Age
sex
…
…
TSH
T4U
T3
TT4
TBG
Final
Hidden diagnoses
units
Normal
Hypothyroid
Hyperthyroid
Thyroid – some results.
Accuracy of diagnoses obtained with different systems.
Method
Rules/Features Training % Test %
MLP2LN optimized
4/6
99.9
99.36
CART/SSV Decision Trees
3/5
99.8
99.33
Best Backprop MLP
-/21
100
98.5
Naïve Bayes
-/-
97.0
96.1
k-nearest neighbors
-/-
-
93.8
Psychometry
MMPI (Minnesota Multiphasic Personality
Inventory) psychometric test.
Printed forms are scanned or computerized
version of the test is used.
• Raw data: 550 questions, ex:
I am getting tired quickly: Yes - Don’t know - No
• Results are combined into 10 clinical scales and
4 validity scales using fixed coefficients.
• Each scale measures tendencies towards
hypochondria, schizophrenia, psychopathic
deviations, depression, hysteria, paranoia etc.
Psychometry
• There is no simple correlation between single
values and final diagnosis.
• Results are displayed in form of a histogram,
called ‘a psychogram’. Interpretation depends
on the experience and skill of an expert, takes
into account correlations between peaks.
Goal: an expert system providing evaluation and
interpretation of MMPI tests at an expert level.
Problem: agreement between experts only 70% of
the time; alternative diagnosis and personality
changes over time are important.
Psychometric data
1600 cases for woman, same number for men.
27 classes:
norm, psychopathic, schizophrenia, paranoia,
neurosis, mania, simulation, alcoholism, drug
addiction, criminal tendencies, abnormal
behavior due to ...
Extraction of logical rules: 14 scales = features.
Define linguistic variables and use FSM,
MLP2LN, SSV - giving about 2-3 rules/class.
Psychometric data
Method
Data
N. rules Accuracy
+Gx%
C 4.5
♀
55
93.0
93.7
♂
61
92.5
93.1
♀
69
95.4
97.6
♂
98
95.9
96.9
FSM
10-CV for FSM is 82-85%, for C4.5 is 79-84%.
Input uncertainty +Gx around 1.5% (best ROC)
improves FSM results to 90-92%.
Psychometric Expert
Probabilities for different classes.
For greater uncertainties more
classes are predicted.
Fitting the rules to the conditions:
typically 3-5 conditions per rule,
Gaussian distributions around
measured values that fall into the
rule interval are shown in green.
Verbal interpretation of each
case, rule and scale dependent.
Visualization
Probability of classes versus
input uncertainty.
Detailed input probabilities
around the measured values
vs. change in the single scale;
changes over time define
‘patients trajectory’.
Interactive multidimensional
scaling: zooming on the new
case to inspect its similarity to
other cases.
Conclusions
Data understanding is challenging problem.
• Classification rules are frequently only the first step and
may not be the best solution.
• Visualization is always helpful.
• P-rules may be competitive if complex decision borders
are required, providing different types of rules.
• Understanding of complex objects is possible, although
difficult, using adaptive costs and distance as least
expensive transformations (action principles in physics).
• Great applications are coming!
Challenges
Fully automatic universal data analysis systems:
press the button and wait for the truth …
•
•
•
•
Discovery of theories rather than data models
Integration with image/signal analysis
Integration with reasoning in complex domains
Combining expert systems with neural networks
….
We are slowly getting there.
More & more computational intelligence tools
(including our own) are available.