Transcript star ugly

A very naïve
introduction to few
methods of Machine
Learning
For students to remember about your projects
1. You have to understand a method in order to verify correct
behavior of your tool
2. You have to explain the method that you use to have good eval or
paper published.
3. There is little value to be only a “blind user“ of a ML tool.
4. This can lead to errors, completely nonsensical behavior of your
tool. Can be dangereous.
5. We found that some university tools give sometimes bad solutions.
6. We have to check the commercial tools , as I am not skeptical.
MACHINE
LEARNING
• Orange is a component-based data
mining and machine learning software
that has a visual programming front for
explorative data analysis, visualization,
and Python libraries for scripting.
• It has components for
•
•
•
•
•
•
data preprocessing,
feature scoring and filtering,
modeling,
model evaluation,
and exploration techniques
and is implemented in C++ and Python.
• Orange is maintained and developed at
the Bioinformatics Laboratory of the
Faculty of Computer and Information
Science, University of Ljubljana, Slovenia.
Orange System
• To illustrate the method of machine learning, I have made-up the following story.
• Example Problem Statement:
• Creatures are being separated into two categories:
• creatures from the Moon and creatures from Mars.
•
•
•
•
•
•
•
•
Sixteen Thousand Moonians and Martians together landed on Earth at a stadium.
Not everyone understands their language.
To easily communicate with them, they have to be separated.
The people at the stadium gates understand neither language, so they take pictures so that
they will be able to understand what creatures they are.
We have only met six of these creatures in the past: three Moonians and three Martians.
Using their features, I will create a machine that will be able to create decision boundaries
to separate Martians and Moonians. In Figure 6 there are six pictures of these creatures.
Based on their features I am going to separate them into Martians and Moonians.
I used a Karnaugh Map (K-map) to find a function so that I would be able to separate
Martians from the Moonians
Principle of Most
Methods for
Machine Learning
1.
You have to understand the relation
between logic circuits and Machine
Learning
2.
You have to understand generalization
from 2-valued logic to n-valued logic.
3.
You have to appreciate the general
format for Machine Learning with
Supervisor and its uses.
4.
You have to learn how to formulate
problems for Supervised Machine
Learning.
5.
Learn how to use easy tools such as
Orange.
Our goal is to combine ideas from traditional machine learning
and modern machine learning to create superior classifiers!
Modern Machine Learning
Traditional Machine Learning
• Apply Occam Razor
Principle
• Simple is better if
explains the same
• Predict where the
unknown zeros and ones
are located
• Learn for this prediction
Learn for the best prediction of
zeros and ones
But still use the Occam Razor
Visualization
b
ugly
3
c
beautiful
2
a=1,b=1,c=1
1
0
Generalization from binary to
four-valued attributes
a=0,b=0,c=0
3
2
1
1
1
2
3
a
Data are represented as a table or a
truth table.
We illustrate them as a Karnaugh Map.
And next, as a three dimensional
Euclidean Space.
Can we separate ones from zeros using only one variable, like b=1 in Kmap or b=0.5 in Euclidean space on right?
ugly
b
b=1
a=1,b=1,c=1
1
beautiful
1
•
•
•
•
•
•
•
This slide illustrates the 3D Euclidean space for
Kmap from the left.
Values 0 and 1 are shown.
Minterms in Hamming Distance 1 from 0 and
from 1 are shown
The minterms close to a 1 are shown in blue ,
they have Hamming Distance 1.
The minterms close to a 0 are shown in white.
The cutting hyperplane b=0.5 is shown that does
not separate all blue minterms from all white
minterms.
White minterm (point) above hyperplane is an
error and the blue point below the hyperplane is
an error.
c
This plane is a
one
dimensional cut
0
1
a=0,b=0,c=0
a
a=1,b=0,c=1
a=1,b=0,c=0
Conclusions about the previous slide
• We can separate the single 1 from the single 0 which are
our data samples. We use a single Boolean function a=1
or b=1 or c=1.
• This would however mean that minterms (cells) that have
Hamming Distance 1 from 1 will have value 0 and
minterms that have Hamming Distance 1 from 0 will have
value 1.
• This would be contrary to assumption (Vapnik) that we
have ones close to ones and zeros close to zeros. Or that
people similar to ugly people are ugly or people similar to
beautiful people are beautiful.
• Thus, Occam Razor is satisfied by solutions a=1, b=1 or
c=1 but the statistical assumption of Vapnik will be not
satisfied.
• If we want to satisfy Vapnik , we need to use the
separation function from next slide.
This slide shows that based on
Hamming Distance and
assuming that the number of
0s and 1s is equal we get the
function majority to separate
ugly people from beautiful
people
ugly
HD1 zero
HD1 zero
HD1 one
beautiful
HD1 one
HD1 zero
HD1 one
Treat HDi ones
as ones
Treat HDi zeros
as zeros
0
0
0
1
1
1
0
1
If Majority
(a,b,c)=1 then
beautiful,
otherwise ugly
Next slide will illustrate that
we can create a cut with a
hyperplane of more
dimension to separate ones
from zeros of this majority
function.
ugly
a=1,b=1,c=1
b
Perhaps ugly
1
1
c
beautiful
This plane is a threedimensional cut
Perhaps beautiful
0
0
0
1
1
1
0
1
0
1
a=0,b=0,c=0
a
a=1,b=0,c=1
a=1,b=0,c=0
ugly
Suppose now that we found another beautiful
person with attribute values a=1, b=1, c=0.
Assuming that the number of ugly people is
approximately the same as beautiful people and
using Hamming Distance we get the following
map:
beautiful
1
1
ugly
HD=1 from 0, HD=2
from 1. So we assume
0 here
1
1
beautiful
ugly
HD=2 from 0, HD=1
from 1. So we assume
1 here
0
1
beauti
HD=1 from 0, HD=1
from 1. So we assume
– here (don’t know)
1
Now the
solution is a+b
1
1
ugly
a=1,b=1,c=1
a=1,b=1,c=0 b
1
c
beautiful
1
1
1
This plane is a
two
dimensional cut
However if we assume that
probability of zeros is 6/8 and
probability of ones is 2/8 then the
good solution is shown in right,
which corresponds to f = ab.
0
1
a=0,b=0,c=0
a
a=1,b=0,c=1
a=1,b=0,c=0
It is useful to be able to visualize data and methods in
more dimensions to create new algorithms
• For more than 2 dimensions it is easy to draw 3 Dimensional Space
• Not possible for more dimensions
• But we can draw Marquand Charts for few values of logic and few
dimensions, 3 - 5
• Then we can derive formulas and create methods which we cannot
check visually on figures.
• But we can analyze the quality of their behavior on real Machine
Learning Data.
1. A so-so student project is to use existing tools and data
2. A better student project is to invent a new tool or find/create new ML data
3. The best is to find a new topic and create a new method (rare)
Conclusion on visualization
1. Kmaps are good for many Boolean variables.
2. Marquand charts are good for 4-6 ternary variables or mixed
variables with small number of values.
3. Euclidean spaces, two- or three- dimensional are good to illustrate
geometrical properties of separating planes.
4. You should mix all these types of representations to explain
algorithms and illustrate data.
DECISION RULES
EXTRACTION
ugly
WHAT CAN I LEARN FROM THESE DATA?
beautiful
1. Assume that I know nothing about what is
hidden under don’t cares.
2. Assuming that there is only one beautiful person
as below is statistically wrong.
3. But if I know that 7/8 people are ugly, this is a
good choice
0
0
0
beautiful
0
0
0
ugly
All other ugly,
not very
reasonable!
Try to apply Occam Razor,
simple separation rules.
ugly
beautiful
We predict 1 here
but this is in HD 1 to
0 and in HD 2 to 1
If c=1 then
beautiful,
otherwise ugly
beautiful
ugly
Not very good because ones
should be close to ones and
zeros to zeros
We predict a 0 here but it
is in HD 1 from a 1 and HD
2 to a 0
Try to apply Occam Razor,
simple separation rules.
ugly
beautiful
We predict 1 here
but this is in HD 1 to
0 and in HD 2 to 1
If c=1 then
beautiful,
otherwise ugly
beautiful
Conclusion here is that I have three
ways to satisfy Occam Razor with
the same cost, but all of them are
bad from the point of view of
predicting zeros and ones
ugly
Not very good because ones
should be close to ones and
zeros to zeros
We predict a 0 here but it
is in HD 1 from a 1 and HD
2 to a 0
Now let us try to predict
zeros and ones
If I have no additional information, I assume that zeros
and ones are in equal numbers.
If I have no additional information, I assume that zeros
are close to zeros and ones are close to ones.
Close = small Hamming Distance.
Now try to see what will happen if I will add more ones and zeros to the train data
ugly
0
HD2 one
HD1 zero
0
HD1 one
HD1 zero
Zero wins
No winner
beautiful
1
1
HD1 one
HD1 zero
HD1 one
HD2 zero
No winner
Treat HDi ones
as ones
Treat HDi zeros
as zeros
According to
winners
One wins
0
0
0
-
1
1
-
1
Now still use Occam
Razor
If a=1 then
beautiful,
otherwise ugly
This method can be improved in may ways
1. If you know percent of ones and zeros, you can create better rules of creating
HDi zeros and ones.
2. If you know correlation of variable values = you can create better rules of
creating HDi zeros and ones.
You can create new Machine Learning methods by creating more complex rules to
create HDi zeros and HDi ones.
• CONCLUSION.
• The methods should be created that repeat calculation probabilistically and many
times.
• These methods should have some statistical assumptions about numbers of zeros
and ones and their locations
• See next slides
Probabilistic SOP
repeated many
times
Comparison of two approaches to Machine Learning, traditional and
modern
For a
•
We can use classical logic synthesis tools to solve the new problem
•
The tool should create optimal or good solutions, but make random choices if
there are many solutions of the same cost.
•
Suppose we selected to run the tool for the total of 18 times.
•
We run SOP minimizer that gives us solution F=c seven times, solution F=b five
times and solution F=a six times
•
For every cell we add number of ones from each solution – see Figures B and C
1
1
0
0
1
0
0
0
0
-
1
0
5
ones
zeros
7+5+6
Total Number of ones
-
5
5
5+6
7+6
6
7
0
7+5
5
Total number of ones
5
-
75
6
zeros
Learned function
For b
1
1
0
6
6
-
6
zeros
ones
0
1
-6
6
Total
Number
of
7
7
7
7
7
7
7
7
6
ones
For c
1
1
6
-
5
5
5
zeros
0
11> 7
Zero
wins
Zero
wins
13>5
12>6
Zero
wins
one
wins
11>7
18>0
one
wins
one wins
12>6
13>5
Zero
wins
one wins
Total number of zeros
Comparison of two approaches to Machine
Learning, traditional and modern
Number of ones
• Running probabilistic tool many times we
use voting for every cell (minterm) and
we select 0 or 1 for it.
• Next we minimize the new function.
• In this case the function is majority.
• We minimize it and we have solution that
satisfies both Occam and Vapnik.
5
Total number of ones
Total number of zeros
SUPPORT VECTOR
MACHINES
If we run SVM on the function from left we
should get majority on right shown in red.
• We found that
several tools like
Matlab and
Orange will not
correctly
minimize/classify
small functions
like this one.
• This shows that
SVM should be
replaced with
something else in
these cases.
Real data are
like this
The concept of statistical
learning, another explanation
0
0
1
1
0
1
Assuming
Occam
Razor
These are learning data
We select randomly 2
out of eight samples
for learning
Assuming that:
1. there are half
zeros and half
ones
2. Zeros are
close to zeros,
ones are close
to ones.
K-NN
(Nearest Neighbor
Clustering)
This is again a voting method
We want to learn value of this cell
• If the smallest neighborhood
only counts the value in
point star should be violet
• If the neighborhood k=6 is
taken the value of the point
star should be yellow as
there are four yellows and
two violets in this
neighborhood.
• In k-NN algorithm we look
only to the smallest
neighborhood. So point star
is classified as violet.
If there is more ones in the neighborhood, it should be a one.
If there is equal number of ones and zeros, let us look to larger neighborhood
We want to learn value of this cell
Result of
learning
Here we can make all decisions
based only on HD =1
When HD=1 we have the same number of
ones and zeros, equal 1, so there is a tie
-
We want to learn value of this cell
When HD=2 we have the same number of
ones and zeros, equal 3, so there is a tie
VARIANT WITHOUT PRUNNING
• Observe that we learned the
function only in one point (minterm)
• We have no formula for the learned
function here.
• Not good for interpretability
When HD=3 we have three zeros and 5
ones, so ones win and we put a one to the
cell in question.
The learned function
1
• There is no pruning in this method.
• All positive samples remain positive
and negative samples remain
negative (zeros and ones of examples
are preserved)
When HD=1 we have the same number of
ones and zeros, equal 1, so there is a tie
-
This can be generalized
to Vapnik-like Theory
We want to learn value of this cell
When HD=2 we have the same number of
ones and zeros, equal 3, so there is a tie
VARIANT WITH PRUNNING
• There is pruning in this method.
• Three zeros ( negative samples )
become positive (in general, zeros
and ones of examples are NOT
preserved)
When HD=3 we have three zeros and 5
ones, so ones win and we put a one to the
cell in question.
1
1
The learned function
1
1
1
1
1 1
1 1
• Observe that the
learned function
violates values of three
zeros
• Three zeros from data
were treated as noise.
What will happen if we
use a SOP method for the
same function?
• SOP gives the same results in this case as K-Nearest Neighbor.
Check Decision Tree.
If many methods give the same solution (or its part) it gets higher
confirmation
What did we learn about Machine
Learning in the most general terms?
Learning is a replacement of
unknown with known
Data before learning
Data after learning
Learning is a replacement of don’t
cares (don’t knows) with cares
Based on Occam Razor
Based on Vapnik theory
Based on Vapnik theory
AND Occam Razor
Classification
Accuracy
Classification error and confusion matrix
• 10+706 = 716 incorrectly classified
706 were high
predicted as low
706 were high
SVM
2804 were low
Total number
of samples
3500 predicted as low
10 low predicted as high
Naïve
Bayes
Experiment
Models in Orange
Using various Machine Learning techniques
Random
Forest
Critical Attributes
11 years old
girl classified
data for
invasive
species in
Oregon
She used
Orange and
EXCEL
Random Forest
What was discussed so far.
• We covered so far the following Machine Learning methods:
1.
2.
3.
4.
5.
6.
Genetic Algorithm and Genetic Programming.
Decision Trees
SOP, DNF or Rule Induction
Random Forests (briefly).
K-NN (nearest neighbor)
Probabilistic SOP (every method above can be extended to its probabilistic
counterpart).
7. Methods can be expanded to multi-valued logic and fuzzy logic.
8. Vapnik-Cervonenkis versus Occam Razor – many methods can be modified
according to Vapnik’s principles.
9. Machine Learning methods can be combined one with another. Think how and give
examples.
Questions and Problems
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
How to visualize space of all possible examples?
Karnaugh Maps, Marquand Charts and Euclidean Spaces. Compare them, how useful they are?
What is the role of visualization in Machine Learning, give examples.
Decision rules extraction based on Hamming Distance or other neighborhood measures.
Give example of conflict between Occam Razor and Vapnik’s Theory.
Orange tool, what to do with it?
Probabilistic SOP versus Occam versus Vapnik.
Hamming Distance, versus Occam versus Vapnik.
Support Vector Machines versus Vapnik.
Explain k-NN.
How the k-NN method be generalized? Give examples.
Classification Accuracy and Confusion Matrix. Give your own examples.
Decision Trees and Random Forests versus Vapnik’s Theory. Give your ideas.