Metody Inteligencji Obliczeniowej

Download Report

Transcript Metody Inteligencji Obliczeniowej

Meta-Learning and learning in
highly non-separable cases
Włodzisław Duch
Department of Informatics,
Nicolaus Copernicus University, Toruń, Poland
Google: W. Duch
CIMMACS’07
Current projects
• Learning data with inherent complex logic,
general theory of CI and meta-learning.
• I Do Care Project: infant lab for developing perfect babies,
and other neuroengineering projects – many jobs,
searching for students and project leaders!
• Understanding real brains, breaking neural code, brain
stem model, priming in cortex, generative disease.
• Brain-inspired cognitive architectures, avatars with
artificial minds, emotions, creativity & hi-level cognition.
• Neurocognitive inspirations in natural language
processing, large scale semantic memories, word games.
• Interactive art projects.
Plan
• Problems with Computational intelligence (CI)
•
•
•
•
•
•
•
•
What can we learn?
Why solid foundations are needed.
Similarity based framework.
Transformations and heterogeneous systems.
Meta-learning.
Hard problems.
Beyond pattern recognition.
Scaling up intelligent systems to human level
competence?
What is Computational Intelligence?
The Field of Interest of the Society shall be the theory, design,
application, and development of biologically and linguistically motivated
computational paradigms emphasizing neural networks, connectionist
systems, genetic algorithms, evolutionary programming, fuzzy systems,
and hybrid intelligent systems in which these paradigms are contained.
Artificial Intelligence (AI) was established in 1956!
AI Magazine 2005, Alan Mackworth:
In AI's youth, we worked hard to establish our paradigm by vigorously
attacking and excluding apparent pretenders to the throne of intelligence,
pretenders such as pattern recognition, behaviorism, neural networks,
and even probability theory. Now that we are established, such
ideological purity is no longer a concern. We are more catholic, focusing
on problems, not on hammers. Given that we do have a comprehensive
toolbox, issues of architecture and integration emerge as central.
CI definition
Computational Intelligence. An International Journal (1984)
+ 10 other journals with “Computational Intelligence”,
D. Poole, A. Mackworth & R. Goebel,
Computational Intelligence - A Logical Approach.
(OUP 1998), GOFAI book, logic and reasoning.
CI should:
• be problem-oriented, not method oriented;
• cover all that CI community is doing now, and is likely to do in future;
• include AI – they also think they are CI ...
CI: science of solving (effectively) non-algorithmizable
problems.
Problem-oriented definition, firmly anchored in computer sci/engineering.
AI: focused problems requiring higher-level cognition, the rest of CI is
more focused on problems related to perception/action/control.
The future of computational intelligence ...
Wisdom in computers?
What can we learn?
Good part of CI is about learning.
What can we learn?
Neural networks are universal approximators and evolutionary
algorithms solve global optimization problems – so everything
can be learned? Not quite ...
Duda, Hart & Stork, Ch. 9, No Free Lunch + Ugly Duckling Theorems:
• Uniformly averaged over all target functions the expected error for all
learning algorithms [predictions by economists] is the same.
• Averaged over all target functions no learning algorithm yields
generalization error that is superior to any other.
• There is no problem-independent or “best” set of features.
“Experience with a broad range of techniques is the best insurance for
solving arbitrary new classification problems.”
What is there to learn?
Brains ... what is in EEG? What happens in the brain?
Industry: what happens?
Genetics, proteins ...
Data mining packages
GhostMiner, data mining tools from our lab + Fujitsu:
http://www.fqspl.com.pl/ghostminer/
• Separate the process of model building (hackers) and knowledge
discovery, from model use (lamers) => GM Developer & Analyzer
• No free lunch => provide different type of tools for knowledge
•
discovery: decision tree, neural, neurofuzzy, similarity-based, SVM,
committees, tools for visualization of data.
Support the process of knowledge discovery/model building and
evaluating, organizing it into projects.
• Many other interesting DM packages of this sort exists:
Weka, Yale, Orange, Knime ...
168 packages on the-data-mine.com list!
• We are building Intemi, completely new tools.
Are we really
so good?
Surprise!
Almost nothing
can be learned
using such tools!
What DM packages do?
Hundreds of components ... transforming, visualizing ...
Visual “knowledge flow” to
link components, or script
languages (XML) to define
complex experiments.
Yale 3.3: type
# components
Data preprocessing
74
Experiment operations
35
Learning methods
114
Metaoptimization schemes
17
Postprocessing
5
Performance validation
14
Visualization, presentation, plugin extensions ...
Why solid foundations are needed
Hundreds of components ... thousands of combinations ...
Our treasure box is full! We can publish forever!
But what would we really like to have?
Press the button and wait for the truth!
Computer power is with us, meta-learning should
find all interesting data models
= sequences of transformations/procedures.
Many considerations: optimal cost solutions, various costs of using
feature subsets; models that are simple & easy to understand; various
representation of knowledge: crisp, fuzzy or prototype rules,
visualization, confidence in predictions ...
Computational
learning
approach:
let there be
light!
Making things easy: principles
Principles: information compression
Neural information processing in perception and cognition: information
compression, or algorithmic complexity.
In computing: minimum length (message, description) encoding.
Wolff (2006): all cognition and computation as compression!
Analysis and production of natural language, fuzzy pattern recognition,
probabilistic reasoning and unsupervised inductive learning.
Talks about multiple alignment, unification and search, but
so far only models for sequential data and 1D alignment.
Information compression: encoding new information
in terms of old has been used to define the measure of
syntactic and semantic information (Duch, Jankowski
1994); based on the size of the minimal graph
representing a given data structure or knowledge-base
specification, thus it goes beyond alignment.
Graphs of consistent concepts
Brains learn new
concepts in terms
of old; use large
semantic network
and add new
concepts linking
them to the
known.
Disambiguate
concepts by
spreading
activation and
selecting those
that are
consistent with
already active
subnetworks.
Similarity-based framework
(Dis)similarity:
• more general than feature-based description,
• no need for vector spaces (structured objects),
• more general than fuzzy approach (F-rules are reduced to P-rules),
• includes nearest neighbor algorithms, MLPs, RBFs, separable
function networks, SVMs, kernel methods and many others!
Similarity-Based Methods (SBMs) are organized in a framework:
p(Ci|X;M) posterior classification probability or y(X;M) approximators,
models M are parameterized in increasingly sophisticated way.
A systematic search (greedy, beam, evolutionary) in the space of all
SBM models is used to select optimal combination of parameters and
procedures, opening different types of optimization channels, trying to
discover appropriate bias for a given problem.
Results: several candidate models are created, even very limited version
gives best results in 7 out of 12 Stalog problems.
SBM framework components
•
•
•
•
•
•
•
•
•
Pre-processing: objects O => features X, or (diss)similarities D(O,O’).
Calculation of similarity between features d(xi,yi) and objects D(X,Y).
Reference (or prototype) vector R selection/creation/optimization.
Weighted influence of reference vectors G(D(Ri,X)), i=1..k.
Functions/procedures to estimate p(C|X;M) or y(X;M).
Cost functions E[DT;M] and model selection/validation procedures.
Optimization procedures for the whole model Ma.
Search control procedures to create more complex models Ma+1.
Creation of ensembles of (local, competent) models.
• M={X(O), d(.,.), D(.,.), k, G(D), {R}, {pi(R)}, E[.], K(.), S(.,.)}, where:
• S(Ci,Cj) is a matrix evaluating similarity of the classes;
a vector of observed probabilities pi(X) instead of hard labels.
The kNN model p(Ci|X;kNN) = p(Ci|X;k,D(.),{DT});
the RBF model: p(Ci|X;RBF) = p(Ci|X;D(.),G(D),{R}),
MLP, SVM and many other models may all be “re-discovered”.
Meta-learning in SBM scheme
k-NN 67.5/76.6%
67.5/76.6%
+selection,
67.5/76.6 %
+d(x,y);
Canberra 89.9/90.7 %
+k opt; 67.5/76.6 %
+ si =(0,0,1,0,1,1);
71.6/64.4 %
+d(x,y) + si=(1,0,1,0.6,0.9,1);
Canberra 74.6/72.9 %
sel. or opt k;
+d(x,y) + selection;
Canberra 89.9/90.7 %
Start from kNN, k=1, all data & features, Euclidean distance, end with a
model that is a novel combination of procedures and parameterizations.
What NN components really do?
Vector mappings from the input space to hidden space(s) and to the
output space + adapt parameters to improve cost functions.
Hidden-Output mapping done by MLPs:
T = {Xi}
H = {hj(T)}
training data, N-dimensional.
X image in the hidden space, j =1 .. NH-dim.
... more transformations in hidden layers
Y = {yk(H )}
X image in the output space, k =1 .. NC-dim.
ANN goal:
data image H in the last hidden space should be linearly separable;
internal representations will determine network generalization.
But we never look at these representations!
Learning trajectories
• Take weights Wi from iterations i =1..K;
PCA on Wi covariance matrix usually captures 95-98% variance,
so error function in 2D shows realistic learning trajectories.
M. Kordos
& W. Duch
Instead of local minima large flat valleys are seen – why?
Data far from decision borders has almost no influence, the main reduction
of MSE is achieved by increasing ||W||, sharpening sigmoidal functions.
Transformation-based framework
Extend SBM adding fine granulation of methods and relations between
them to enable meta-learning by search in the model space.
For example, the first transformation (layer) may be a:
•
•
•
•
PCA network, with each node computing principal component.
LDA network, each node computes LDA direction (including FDA).
ICA network, nodes computing independent components.
KL, or Kullback-Leibler network with orthogonal or non-orthogonal
components; max. of mutual information is a special case
• c2 and other statistical tests for dependency to aggregate features.
• Factor analysis network, computing common and unique factors.
• Matching pursuit network for signal decomposition.
Evolving Transformation Systems (Goldfarb 1990-2006), unified
paradigm for inductive learning and structural representations.
Heterogeneous systems
Problems requiring different scales (multiresolution).
2-class problems, two situations:
C1 inside the sphere, C2 outside.
MLP: at least N+1 hyperplanes, O(N2) parameters.
RBF: 1 Gaussian, O(N) parameters.
C1 in the corner defined by (1,1 ... 1) hyperplane, C2 outside.
MLP: 1 hyperplane, O(N) parameters.
RBF: many Gaussians, O(N2) parameters, poor approx.
Combination: needs both hyperplane and hypersphere!
Logical rule: IF x1>0 & x2>0 THEN C1 Else C2
is not represented properly neither by MLP nor RBF!
Different types of functions in one model, first step beyond inspirations
from single neurons => heterogeneous models.
Heterogeneous everything
Homogenous systems: one type of “building blocks”, same type of
decision borders, ex: neural networks, SVMs, decision trees, kNNs
Committees combine many models together, but lead to complex
models that are difficult to understand.
Ockham razor: simpler systems are better.
Discovering simplest class structures, inductive bias of the data,
requires Heterogeneous Adaptive Systems (HAS).
HAS examples:
NN with different types of neuron transfer functions.
k-NN with different distance functions for each prototype.
Decision Trees with different types of test criteria.
1. Start from large networks, use regularization to prune.
2. Construct network adding nodes selected from a candidate pool.
3. Use very flexible functions, force them to specialize.
Taxonomy of NN activation functions
Taxonomy of NN output functions
Perceptron: implements logical rule x>q for x with Gaussian uncertainty.
Taxonomy - TF
HAS decision trees
Decision trees select the best feature/threshold value for univariate
and multivariate trees:
X i  qk or T  X; W,qk   Wi X i  qk
i
Decision borders: hyperplanes.
Introducing tests based on La Minkovsky metric.
T  X; R,q R   X  R a   X i  Ri
1/ a
 qR
i
For L2 spherical decision border are produced.
For L∞ rectangular border are produced.
Many choices, for example Fisher Linear Discrimination decision trees.
For large databases first clusterize data to get candidate references R.
SSV HAS DT example
SSV HAS tree in GhostMiner 3.0, Wisconsin breast cancer (UCI)
699 cases, 9 features (cell parameters, 1..10)
Classes: benign 458 (65.5%) & malignant 241 (34.5%).
Single rule gives simplest known description of this data:
IF ||X-R303|| < 20.27 then malignant
else benign
coming most often in 10xCV
97.4% accuracy; good prototype for malignant case!
Gives simple thresholds, that’s what MDs like the most!
Best 10CV around
97.5±1.8% (Naïve Bayes + kernel, or SVM)
SSV without distances: 96.4±2.1%
C 4.5 gives
94.7±2.0%
Several simple rules of similar accuracy but different specificity or
sensitivity may be created using HAS DT.
Need to select or weight features and select good prototypes.
More meta-learning
Meta-learning: learning how to learn, replace experts who search for
best models making a lot of experiments.
Search space of models is too large to explore it exhaustively, design
system architecture to support knowledge-based search.
•
•
•
•
Abstract view, uniform I/O, uniform results management.
Directed acyclic graphs (DAG) of boxes representing scheme
placeholders and particular models, interconnected through I/O.
Configuration level for meta-schemes, expanded at runtime level.
An exercise in software engineering for data mining!
Intemi, Intelligent Miner
Meta-schemes: templates with placeholders
•
•
•
•
•
May be nested; the role decided by the input/output types.
Machine learning generators based on meta-schemes.
Granulation level allows to create novel methods.
Complexity control: Length + log(time)
A unified meta-parameters description, defining the range of
sensible values and the type of the parameter changes.
Advanced meta-learning
• Finding the correlations of occurring different items in most
•
•
•
•
•
accurate results, and different machine structures of similar
behavior in an area of the model space.
Finding new successful complex structures and converting
them into meta-schemes (which we call meta abstraction) by
replacing proper substructures by placeholders.
Extracting meta-rules, describing interesting search directions.
Depositing the knowledge they gain in a reusable meta-knowledge
repository (for meta-learning experience exchange between
different meta-learners).
A uniform representation of the meta-knowledge should enable
extending expert knowledge, adjusting the prior knowledge
according to performed tests, etc.
Beyond transformations & feature spaces: actively search for info.
Intemi software incorporating these ideas coming “soon” ...
How much can we learn?
Linearly separable or almost separable problems are relatively
simple – deform or add dimensions to make data separable.
How to define “slightly non-separable”?
There is only separable and the vast realm of the rest.
Difficult case: complex logic
For n bits there are 2n nodes; in extreme cases such as parity all
neighbors are from the wrong class, so localized networks will fail.
Achieving linear separability without special architecture may be
impossible.
Projection on 111 ... 111 gives clusters with 0, 1, 2 ... n bits.
Easy and difficult problems
Linear separation: good goal if simple topological
deformation of decision borders is sufficient.
Linear separation of such data is possible in higher dimensional
spaces; this is frequently the case in pattern recognition problems.
RBF/MLP networks with one hidden layer solve such problems.
Difficult problems: disjoint clusters, complex logic.
Continuous deformation is not sufficient; networks with localized
functions need exponentially large number of nodes.
Boolean functions: for n bits there are K=2n binary vectors that can be
represented as vertices of n-dimensional hypercube.
Each Boolean function is identified by K bits.
BoolF(Bi) = 0 or 1 for i=1..K, leads to the 2K Boolean functions.
Ex: n=2 functions, vectors {00,01,10,11},
Boolean functions {0000, 0001 ... 1111}, ex. 0001 = AND, 0110 = OR,
each function is identified by number from 0 to 15 = 2K-1.
Boolean functions
n=2, 16 functions, 12 separable, 4 not separable.
n=3, 256 f, 104 separable (41%), 152 not separable.
n=4, 64K=65536, only 1880 separable (3%)
n=5, 4G, but << 1% separable ... bad news!
Existing methods may learn some non-separable functions,
but most functions cannot be learned !
Example: n-bit parity problem; many papers in top journals.
No off-the-shelf systems are able to solve such problems.
For all parity problems SVM is below base rate!
Such problems are solved only by special neural architectures or
special classifiers – if the type of function is known.
 n 
But parity is still trivial ... solved by y  cos    bi 
 i 1 
RBF for XOR
Is RBF solution with 2 hidden Gaussians nodes possible?
Typical architecture: 2 input – 2 Gaussians – 1 linear output, ML training
50% errors, but there is perfect separation - not a linear separation!
Network knows the answer, but cannot say it ...
Single Gaussian output node may solve the problem.
Output weights provide reference hyperplanes (red and green
lines), not the separating hyperplanes like in case of MLP.
3-bit parity
For RBF parity problems are difficult; 8 nodes solution:
1) Output activity;
2) reduced output,
summing activity of 4
nodes.
3) Hidden 8D space
activity, near ends of
coordinate versors.
4) Parallel coordinate
representation.
8 nodes solution has zero generalization, 50% errors in L1O.
3-bit parity in 2D and 3D
Output is mixed, errors are at base level (50%), but in the
hidden space ...
Conclusion: separability in the hidden space is perhaps too much to
desire ... inspection of clusters is sufficient for perfect classification;
add second Gaussian layer to capture this activity;
train second RBF on the data (stacking), reducing number of clusters.
Goal of learning
If simple topological deformation of decision borders is sufficient
linear separation is possible in higher dimensional spaces,
“flattening” non-linear decision borders; this is frequently the case
in pattern recognition problems.
RBF/MLP networks with one hidden layer solve the problem.
For complex logic this is not sufficient; networks with localized
functions need exponentially large number of nodes.
Such situations arise in AI reasoning problems, real perception, object
recognition, text analysis, bioinformatics ...
Linear separation is too difficult, set an easier goal.
Linear separation: projection on 2 half-lines in the kernel space:
line y=WX, with y<0 for class – and y>0 for class +.
Simplest extension: separation into k-intervals.
For parity: find direction W with minimum # of intervals, y=W.X
3D case
3-bit functions: X=[b1b2b3], from [0,0,0] to [1,1,1]
f(b1,b2,b3) and f(b1,b2,b3) are symmetric (color change)
8 cube vertices, 28=256 Boolean functions.
0 to 8 red vertices: 1, 8, 28, 56, 70, 56, 28, 8, 1 functions.
For arbitrary direction W index projection W.X gives:
k=1 in 2 cases, all 8 vectors in 1 cluster (all black or all white)
k=2 in 14 cases, 8 vectors in 2 clusters (linearly separable)
k=3 in 42 cases, clusters B R B or W R W
k=4 in 70 cases, clusters R W R W or W R W R
Symmetrically, k=5-8 for 70, 42, 14, 2.
Most logical functions have 4 or 5-separable projections.
Learning = find best projection for each function.
Number of k=1 to 4-separable functions is: 2, 102, 126 and 26
126 of all functions may be learned using 3-separability.
4D case
4-bit functions: X=[b1b2b3b4], from [0,0,0,0] to [1,1,1,1]
16 cube vertices, 216=65636=64K functions.
Random initialization of a single perceptron has 39.2% chance of
creating 8 or 9 clusters for the 4-bit data.
Learning optimal directions W finds:
k=1 in 2 cases, all 16 vectors in 1 cluster (all black or all white)
k=2 in 2.9% cases (or 1880), 16 vectors in 2 clusters (linearly sep)
k=3 in 22% of all cases, clusters B R B or W R W
k=4 in 45% of all cases, clusters R W R W or W R W R
k=5 in 29% of all cases.
Hypothesis: for n-bits highest k=n+1 ?
For 5-bits there are 32 vertices and already 232=4G=4.3.109 functions.
Most are 5-separable, less than 1% is linearly separable!
Biological justification
•
•
•
•
•
•
•
Cortical columns may learn to respond to stimuli with complex logic
resonating in different way.
The second column will learn without problems that such different
reactions have the same meaning: inputs xi and training targets yj.
are same => Hebbian learning DWij ~ xi yj => identical weights.
Effect: same line y=W.X projection, but inhibition turns off one
perceptron when the other is active.
Simplest solution: oscillators based on combination of two neurons
s(W.X-b) – s(W.X-b’) give localized projections!
We have used them in MLP2LN architecture for extraction of logical
rules from data.
Note: k-sep. learning is not a multistep output neuron, targets are
not known, same class vectors may appear in different intervals!
We need to learn how to find intervals and how to assign them to
classes; new algorithms are needed to learn it!
Network solution
Can one learn a simplest model for arbitrary Boolean function?
2-separable (linearly separable) problems are easy;
non separable problems may be broken into k-separable, k>2.
s(by+q1)
X1
X2
y=W.X
X3
X4
Blue: sigmoidal
neurons with threshold,
brown – linear neurons.
+
1

1
s(by+q2)
+
1
+
1
+
1
+
1
+
1

1
s(by+q4)
Neural architecture for
k=4 intervals, or
4-separable problems.
k-sep learning
Try to find lowest k with good solution:
• Assume k=2 (linear separability), try to find a good solution;
•
MSE error criterion
E  W,q     y  X; W   C  X  
2
X
• if k=2 is not sufficient, try k=3; two possibilities are C+,C,C+ and
C, C+, C this requires only one interval for the middle class;
• if k<4 is not sufficient, try k=4; two possibilities are C+, C, C+, C
and C, C+, C, C+ this requires one closed and one open interval.
Network solution  to minimization of specific cost function.
E  W, 1 , 2     y  X; W   C  X    1  1  C  X   y  X; W 
2
X
X
2  C  X  y  X; W 
X
First term = MSE, second penalty for “impure” clusters, third term =
reward for the large clusters.
A better solution?
What is needed to learn data with complex logic?
• cluster non-local areas in the X space, use W.X
• capture local clusters after transformation, use G(W.Xq)
SVMs fail because the number of directions W that should be
considered grows exponentially with the size of the problem n.
What will solve it?
1. A class of constructive neural network solution with G(W.Xq)
functions with special training algorithms.
2. Maximize the leave-one-out error after projection: take localized
function G, count in a soft way cases from the same class as X.
Q  W    G  W   X  X '    2  CX , CX'   1
X
X'
Projection may be done directly to 1D, 2D or higher.
Examples: parity, monks.
Parity n=9
Simple gradient learning; quality index shown below.
Learning hard functions
Training almost perfect for parity, with linear growth in the number of
vectors for k-sep. solution created by the constructive neural algorithm.
Real data
Simple data – similar results, but much simpler models.
Its good to have wide interests ...
You know,
Christopher,
I am fed up
with honey
Mental models
Easy reasoning A=>B, B=>C, so A=>C
• All mammals suck milk.
• Humans are mammals.
• => Humans suck milk.
... but almost no-one can draw conclusion from:
• All academics are scientist.
• No wise men is an academic.
• What can we say about wise men and scientists?
Surprisingly only ~10% of students get it right.
No simulations explaining why some mental models are difficult?
Scaling-up to human level ...
Recent discussions:
• Roadmap to human level intelligence + session – WCCI 2006/08
• Cognitive Systems, IJCNN panel 2007; AGI conference 2008 ...
• Books: Challenges to CI (with J Mandziuk); Roadmap (with J Taylor);
Meta-learning (with Jankowski/Grabczewski);
neurocognitive informatics.
Neuromorphic, mesoscopic, hybrid neuro-symbolic architectures?
Designs for artificial brains, blueprints for billion neuron cortex models,
scalable neuromorphic approaches.
What is the role of CI in brain-like computing systems?
Is human-style creativity using CI possible?
We are not far from cognitive/affective architectures, artificial minds
with human characteristics integrating perception, affect and cognition,
large-scale semantic memories, implementing control/attention.
Work like a horse
but never loose your enthusiasm!
Summary
• CI is a branch of science dealing with problems for which effective
•
•
•
•
•
•
•
algorithms do not exist; it includes AI, machine learning and all the rest.
Similarity-based framework enables meta-learning as search in the model
space, heterogeneous systems add fine granularity.
Transformation-based learning extends that, formalizing component-based
approach to DM, automating discovery of interesting models.
Many known and new learning methods result from such framework, but
neurocognitive inspirations extend it much further.
No off-shelf classifiers are able to learn difficult Boolean functions.
Visualization of activity of the hidden neurons shows that frequently perfect
but non-separable solutions are found despite base-rate outputs.
Linear separability is not the best goal of learning, other targets that allow
for easy handling of final non-linearities should be defined.
Simplest extension is to isolate non-linearity in form of k intervals, breaking
the non-separable class of problems into k-separable classes.
Many interesting new ideas arise from this line of thinking.
Thank
you
for
lending
your
ears
...
Google: W. Duch => Papers & presentations
See also http:www.e-nns.org