Artificial Intelligence Project #3 : Analysis of Decision Tree Learning

Download Report

Transcript Artificial Intelligence Project #3 : Analysis of Decision Tree Learning

Artificial Intelligence Project #3
: Analysis of Decision Tree
Learning Using WEKA
May 23, 2006
Introduction
Decision tree learning is a method for
approximating discrete-valued target
function
The learned function is represented by a
decision tree
Decision tree can also be re-represented
as if-then rules to improve human
readability
An Example of Decision Tree
Decision Tree Representation (1/2)
Decision tree classify instances by sorting
them down the tree from the root to some
leaf node
Node
Specifies test of some attribute
Branch
Corresponds to one of the possible values for
this attribute
Decision Tree Representation (2/2)
Each path corresponds to a
conjunction of attribute tests
Outlook
Sunny
(Outlook=sunny, Temperature=Hot,
Humidity=high, Wind=Strong)
Rain
(Outlook=Sunny ∧ Humidity=High) so NO
Overcast
Humidity
High Normal
No
Yes
Yes
Wind
Strong Weak
No
Yes
Decision trees represent a
disjunction of conjunction of
constraints on the attribute values of
instances
(Outlook=Sunny ∧Humidity=normal)
∨(Outlook=Overcast)
∨(Outlook=Rain ∧Wind=Weak)
What is the merit of tree representation?
Appropriate Problems for Decision
Tree Learning
Instances are represented by attribute-value
pairs
The target function has discrete output values
Disjunctive descriptions may be required
The training data may contain errors
Both errors in classification of the training
examples and errors in the attribute values
The training data may contain missing attribute
values
Suitable for classification
Study
Treatment-specific changes in gene expression
discriminate in vivo drug response in human
leukemia cells, MH Cheok et al., Nature Genetics 35, 2003.
60 leukemia
patients
Bone marrow
samples
Affymetrix
GeneChip arrays
Gene expression
data
Gene Expression Data
# of data examples
120 (60: before treatment, 60: after treatment)
# of genes measured
12600 (Affymetrix HG-U95A array)
Task
Classification between “before treatment” and
“after treatment” based on gene expression
pattern
Affymetrix GeneChip Arrays
Use short oligos to detect gene expression level.
Each gene is probed by a set of short oligos.
Each gene expression level is summarized by
Signal: numerical value describing the abundance of
mRNA
A/P call: denotes the statistical significance of signal
Preprocessing
Remove the genes having more than 60 ‘A’
calls
# of genes: 12600  3190
Discretization of gene expression level
Criterion: median gene expression value of
each sample
0 (low) and 1 (high)
Gene Filtering
Using mutual information
P(G, C )
I (G; C )   P(G, C ) log
P(G ) P(C )
G ,C
Estimated probabilities were used.
# of genes: 3190  1000
Final dataset
# of attributes: 1001 (one for the class)
Class: 0 (after treatment), 1 (before
treatment)
# of data examples: 120
Final Dataset
1000
120
Materials for the Project
Given
Preprocessed microarray data file: data2.txt
Downloadable
WEKA
(http://www.cs.waikato.ac.nz/ml/weka/)
Analysis of Decision Tree Learning
Analysis of Decision Tree Learning
Submission
Due date: June 15 (Thu.), 12:00(noon)
Report: Hard copy(301-419) & e-mail.
ID3, J48 and another decision tree algorithm
with learning parameter.
Show the experimental results of each
algorithm. Except for ID3, you should try to
find out better performance, changing learning
parameter.
Analyze what makes difference between
selected algorithms.
E-mail : [email protected]