MBG404_LS_07

Download Report

Transcript MBG404_LS_07

Computational Biology
Classification
(some parts taken from
Introduction to Data Mining
by
Tan, Steinbach, Kumar)
Lecture Slides Week 10
MBG404 Overview
Processing
Pipelining
Generation
Data
Storage
Mining
Data Mining
• Data mining
– noun Digital Technology.
The process of collecting, searching through,
and analyzing a large amount of data in a database,
as to discover patterns or relationships:
e.g.: the use of data mining to detect fraud.
• Machine learning
a branch of artificial intelligence, concerning the construction and study
of systems that can learn from data. For example, a machine learning
system could be trained on email messages to learn to distinguish
between spam and non-spam messages. After learning, it can then be
used to classify new email messages into spam and non-spam folders.
Application in Biology
• Data exploration
– Microarrays
– Next generation sequencing
• Prediction
– microRNAs
– Protein secondary structure
Parametrization
Types of Attributes
• There are different types of attributes
– Nominal
•
Examples: ID numbers, eye color, zip codes
– Ordinal
•
Examples: rankings (e.g., taste of potato chips on a scale from 110), grades, height in {tall, medium, short}
– Interval
•
Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio
•
Examples: temperature in Kelvin, length, time, counts
Data Type
• Distinctness (nominal)
– Equal or unequal
• Order (ordinal)
– >,<,>=,<=
• Addition (interval)
– +,-
• Multiplication (ratio)
– *,/
Data Quality
•
•
•
•
•
Missing data
Noise
False measurements
Outliers
Duplicate data
• Precision
• Bias
• Accuracy
Data Preprocessing
•
•
•
•
•
•
•
Aggregation
Sampling
Dimensionality Reduction
Feature subset selection
Feature creation
Discretization and Binarization
Attribute Transformation
Aggregation
Variation of Precipitation in Australia
Standard Deviation of Average
Monthly Precipitation
Standard Deviation of Average
Yearly Precipitation
Dimensionality Reduction: PCA
Dimensions
Dimensions==206
120
160
10
40
80
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
Similarity
• Eucledian distance
• Simple matching coefficient
– Jaccard coefficient
• Correlation
• Cosine similarity
• ...
Sampling
• Curse of dimensionality
–
–
–
–
–
Feature selection
Dimensionality reduction
Principal component analysis
Aggregation
Mapping of data to different space
Sampling
• Dividing samples into
– Training set
– Test set
• Using not all samples from both sets
Classification
• Examples with known classes (labels)
• Learn rules of how the attributes define the classes
• Classify unknown samples into the appropriate class
Classification Workflow
End Theory I
• 5 min Mindmapping
• 10 min Break
Practice I
Exploring Data (Irises)
• Download the file Iris.txt
• Follow along
Exploring Data
• Frequencies
• Percentiles
• Mean, Median
• Visualizations
Data Selection
• Selecting columns
• Filtering rows
Data Transformation
• Discretize
• Continuize
• Feature construction
Visualizations
End Practice I
• Break 15 min
Theory II
Classification Workflow
Illustrating Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Married
Model: Decision Tree
General Structure of Hunt’s Algorithm
• Let Dt be the set of training records that
reach a node t
• General Procedure:
–
–
–
If Dt contains records that belong the
same class yt, then t is a leaf node
labeled as yt
If Dt is an empty set, then t is a leaf
node labeled by the default class, yd
If Dt contains records that belong to
more than one class, use an attribute
test to split the data into smaller
subsets. Recursively apply the
procedure to each subset.
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Dt
?
60K
Hunt’s Algorithm
Don’t
Cheat
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
2
No
Married
100K
No
Don’t
Cheat
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Refund
Yes
Don’t
Cheat
Refund
Refund
Yes
Yes
No
Don’t
Cheat
Don’t
Cheat
Marital
Status
Single,
Divorced
Cheat
Married
No
Marital
Status
Single,
Divorced
Don’t
Cheat
10
Married
Don’t
Cheat
Taxable
Income
< 80K
>= 80K
Don’t
Cheat
Cheat
60K
How to Find the Best Split
Before Splitting:
C0
C1
N00
N01
M0
A?
B?
Yes
No
Node N1
C0
C1
Node N2
N10
N11
C0
C1
N20
N21
M2
M1
Yes
No
Node N3
C0
C1
Node N4
N30
N31
C0
C1
M3
M12
M4
M34
Gain = M0 – M12 vs M0 – M34
N40
N41
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting due to Noise
Decision boundary is distorted by noise point
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Cost Matrix
PREDICTED CLASS
C(i|j)
ACTUAL
CLASS
Class=Yes
Class=No
Class=Yes
C(Yes|Yes)
C(No|Yes)
Class=No
C(Yes|No)
C(No|No)
C(i|j): Cost of misclassifying class j example as class i
Cost-Sensitive Measures
a
Precision (p) 
ac
a
Recall (r) 
ab
2rp
2a
F - measure (F) 

r  p 2a  b  c



Precision is biased towards C(Yes|Yes) & C(Yes|No)
Recall is biased towards C(Yes|Yes) & C(No|Yes)
F-measure is biased towards all except C(No|No)
wa  w d
Weighted Accuracy 
wa  wb wc  w d
1
1
4
2
3
4
Receiver Operating Characteristic (ROC) Curve
(TP,FP):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal
• Diagonal line:
– Random guessing
– Below diagonal line:
• prediction is opposite of the true
class
Using ROC for Model Comparison

No model consistently
outperform the other
 M1 is better for
small FPR
 M2 is better for
large FPR

Area Under the ROC
curve

Ideal:
 Area

=1
Random guess:
 Area
= 0.5
End Theory II
• 5 min Mindmapping
• 10 min Break
Practice II
Learning
• Supervised (Classification)
– Classification
• Decision tree
• SVM
Classification
• Use the iris.txt file for classification
• Follow along as we classify
Classification
• Use the orangeexample file for classification
• We are interested if we can distinguish between miRNAs
and random sequences with the selected features
• Try yourself