Machine Leanring Topics and Weka Software

Download Report

Transcript Machine Leanring Topics and Weka Software

Computational Intelligence in
Biomedical and Health Care Informatics
HCA 590 (Topics in Health Sciences)
Rohit Kate
Machine Learning: Some Topics
and Weka Software
1
Learning Curves
• Train the classifier with increasing amount of training
examples and plot accuracy vs. size of training set
• Helps to answer:
– Whether maximum accuracy has nearly been reached or will more
training examples help?
– Is one technique better when training data is limited?
• Most learners eventually converge to the maximum accuracy
given sufficient training examples
Test Accuracy
100%
Maximum Accuracy
Method 1
Method 2
# Training examples
2
Comparing Learning Curves
• Gap usually has a “banana shape”
• Often a better picture emerges if learning
curves are compared “horizontally” instead of
“vertically”
100%
Maximum Accuracy
Method 1
Method 2
Test Accuracy
85%
Method 1 can achieve 85% accuracy with
half the training data needed by method 2!
300
600
# Training examples
3
Datasets
• Datasets are important for empirically
evaluating machine learning techniques
• It is important to test them on a variety of
domains. Testing on 20+ data sets is
common.
• Variety of freely available datasets
– UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– KDD Cup (large data sets for data mining)
http://www.kdnuggets.com/datasets/kddcup.html
4
Which is the Best Machine
Learning Technique?
• There is no single machine learning technique that
performs better than every other technique on every
dataset
• One can always come up with a dataset on which a
particular machine learning technique will do
miserably
– Flip its predictions and call them the correct answers
• As such there is no basis for preferring one label over
another for classifying a never before seen test
example even after seeing a lot of training data
– It is unknown so it could be anything!
• Hence every machine learning technique makes some
assumptions (“bias”) which helps it generalize from
training data to test data
5
Which is the Best Machine
Learning Technique?
• Depending upon how the assumptions of a
machine learning technique hold in a given
dataset, some techniques perform better than
others
Assumptions:
– Naïve Bayes & Bayesian networks: Conditional
independence assumptions
– SVM & NN: A hyperplane can separate the
examples
– Decision Trees: Some feature values separate the
examples
6
Training Data
• Training data is critical for applying any machine
learning technique
• Obtaining it is often the most difficult part
– Availability of data, particularly medical data
– Obtaining correct labels, often manually done by
experts, expensive and labor intensive
• As learning curves show, “more data is better
data”
– But it is expensive to get more training data
• Some approaches have been designed to
compensate for the lack of training data
7
Various Forms of Supervision
• If all the training data have correct labels then it is
called supervised learning
• Some methods also utilize unlabeled training data in
addition to the labeled data and are called semisupervised learning
– Most learning methods can be extended to leverage
unlabeled training data
– Predict labels for unlabeled examples and take them as the
correct labels and train again; iterate a few times
• Often helps as if by magic!
• Some methods, like clustering examples into groups,
learn completely unsupervised, but they are useful only
in limited situations
8
Weka: The Most Well-Known
Machine Learning Software
• Freely available
• Includes several machine learning techniques
• Download from the web-site:
http://www.cs.waikato.ac.nz/ml/weka/
• A tutorial (only classification part):
http://prdownloads.sourceforge.net/weka/weka.ppt
9
ARFF Format for Data
• Once the data is in the ARFF format (attribute-relation file
format), you can play with several machine learning
techniques using Weka!
• See Weka tutorial slides 5 & 6
• More description of the ARFF format:
http://weka.wikispaces.com/ARFF+%28book+version%29
• Plain text file (use notepad etc. to open or create)
• Save with .arff extension
• See several examples:
http://repository.seasr.org/Datasets/UCI/arff/
• Comments after ‘%’ character
• Unknowns marked by ‘?’
• If the last attribute is nominal then it is a classification task,
if it is numeric then it is a regression task
10