Transcript Yes
Introduction to data mining
Peter van der Putten
Leiden Institute of Advanced Computer Science
Leiden University
[email protected]
Transcriptomics and Proteomics in Zebrafish workshop, Leiden
University
March 9, 2006
Presentation Outline
• Objective
– Present the basics of data mining
– Gain understanding of the potential for applying it in
the bioinformatics domain
Agenda Today
• Data mining definitions
• Before Starting to Mine….
• Descriptive Data Mining
– Dimension Reduction & Projection
– Clustering
– Association rules
• Predictive data mining concepts
– Classification and regression
– Bioinformatics applications
• Predictive data mining techniques
–
–
–
–
–
Logistic Regression
Nearest Neighbor
Decision Trees
Naive Bayes
Neural Networks
• Evaluating predictive models
• Demonstration (optional)
The Promise….
.
The Promise….
.
The Promise….
.
• The Solution….
• NCBI Tools for data mining:
–
–
–
–
–
Nucleotide sequence analysis
Proteine sequence analysis
Structures
Genome analysis
Gene expression
• Data mining or not?.
What is data mining?
Sources of (artificial)
intelligence
• Reasoning versus learning
• Learning from data
–
–
–
–
–
–
–
–
–
Patient data
Genomics, protemics
Customer records
Stock prices
Piano music
Criminal mug shots
Websites
Robot perceptions
Etc.
Some working definitions….
• ‘Data Mining’ and ‘Knowledge Discovery in
Databases’ (KDD) are used interchangeably
• Data mining =
– The process of discovery of interesting, meaningful
and actionable patterns hidden in large amounts of
data
• Multidisciplinary field originating from artificial
intelligence, pattern recognition, statistics,
machine learning, bioinformatics, econometrics,
….
Some working definitions….
• Bioinformatics =
– Bioinformatics is the research, development, or
application of computational tools and approaches for
expanding the use of biological, medical, behavioral
or health data, including those to acquire, store,
organize, archive, analyze, or visualize such data
[http://www.bisti.nih.gov/].
– Or more pragmatic: Bioinformatics or computational
biology is the use of techniques from applied
mathematics, informatics, statistics, and computer
science to solve biological problems [Wikipedia Nov
2005]
Bio informatics and data mining
• From sequence to structure to function
• Genomics (DNA), Transcriptomics (RNA), Proteomics
(proteins), Metabolomics (metabolites)
• Pattern matching and search
• Sequence matching and alignment
• Structure prediction
– Predicting structure from sequence
– Protein secondary structure prediction
• Function prediction
– Predicting function from structure
– Protein localization
• Expression analysis
– Genes: micro array data analysis etc.
– Proteins
• Regulation analysis
Bio informatics and data mining
•
•
•
•
•
•
Classical medical and clinical studies
Medical decision support tools
Text mining on medical research literature (MEDLINE)
Spectrometry, Imaging
Systems biology and modeling biological systems
Population biology & simulation
• Spin Off: Biological inspired computational learning
– Evolutionary algorithms, neural networks, artificial immune
systems
Genomic Microarrays – Case Study
• Problem:
– Leukemia (different types of Leukemia cells look very
similar)
– Given data for a number of samples (patients), can
we
• Accurately diagnose the disease?
• Predict outcome for given treatment?
• Recommend best treatment?
• Solution
– Data mining on micro-array data
Microarray data
• 50 most
important genes
• Rows: genes
• Columns:
samples /
patients
Example: ALL/AML data
• 38 training patients, 34 test patients, ~ 7,000 patient attributes (micro
array gene data)
• 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid
Leukemia (AML)
• Use train data to build diagnostic model
ALL
AML
Results on test data:
33/34 correct, 1 error may be mislabeled
Some working definitions….
• ‘Data Mining’ and ‘Knowledge Discovery in Databases’
(KDD) are used interchangeably
• Data mining =
– The process of discovery of interesting, meaningful and
actionable patterns hidden in large amounts of data
• Multidisciplinary field originating from artificial
intelligence, pattern recognition, statistics, machine
learning, bioinformatics, econometrics, ….
The Knowledge Discovery
Process
Data Mining
Objectives
& Design
Problem
Objectives
Deployment,
Application &
Monitoring
Data
Understanding
Data
Preparation
Evaluation
Modeling
Some working definitions….
•
Concepts: kinds of things that can be learned
–
–
•
Instances: the individual, independent examples of
a concept
–
•
Example: a patient, candidate drug etc.
Attributes: measuring aspects of an instance
–
•
Aim: intelligible and operational concept description
Example: the relation between patient characteristics
and the probability to be diabetic
Example: age, weight, lab tests, microarray data etc
Pattern or attribute space
Data mining tasks
• Descriptive data mining
–
–
–
–
–
–
Matching & search: finding instances similar to x
Clustering: discovering groups of similar instances
Association rule extraction: if a & b then c
Summarization: summarizing group descriptions
Link detection: finding relationships
…
• Predictive data mining
– Classification: classify an instance into a category
– Regression: estimate some continuous value
Before starting to mine….
• Pima Indians
Diabetes Data
– X = body mass
index
– Y = age
Before starting to mine….
Before starting to mine….
Before starting to mine….
• Attribute Selection
– This example: InfoGain by Attribute
– Keep the most important ones
Diastolic blood pressure (mm Hg)
Diabetes pedigree function
Number of times pregnant
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Age (years)
Body mass index (weight in kg/(height in m)^2)
Plasma glucose concentration a 2 hours in an oral
glucose tolerance test
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
Before starting to mine….
• Types of Attribute Selection
– Uni-variate versus multivariate (sub set selection)
• The fact that attribute x is a strong uni-variate predictor does
not necessarily mean it will add predictive power to a set of
predictors already used by a model
– Filter versus wrapper
• Wrapper methods involve the subsequent learner (classifier
or other)
Dimension Reduction
• Projecting high dimensional data into a lower
dimension
–
–
–
–
–
Principal Component Analysis
Independent Component Analysis
Fisher Mapping, Sammon’s Mapping etc.
Multi Dimensional Scaling
….
Data Mining Tasks: Clustering
Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
Data Mining Tasks: Clustering
Clustering is the discovery of
groups in a set of instances
Groups are different, instances
in a group are similar
f.e. weight
In 2 to 3 dimensional pattern
space you could just visualise
the data and leave the
recognition to a human end
user
f.e. age
In >3 dimensions this is not
possible
Clustering Techniques
• Hierarchical algorithms
– Agglomerative
– Divisive
• Partition based clustering
– K-Means
– Self Organizing Maps / Kohonen Networks
• Probabilistic Model based
– Expectation Maximization / Mixture Models
Hierarchical clustering
•
Agglomerative / Bottom up
–
–
–
–
•
Divisive / Top Down
–
–
–
•
Start with single-instance clusters
At each step, join the two closest clusters
Method to compute distance between cluster x and y: single
linkage (distance between closest point in cluster x and y),
average linkage (average distance between all points), complete
linkage (distance between furthest points), centroid
Distance measure: Euclidean, Correlation etc.
Start with all data in one cluster
Split into two clusters based on distance measure / split utility
Proceed recursively on each subset
Both methods produce a dendrogram
Levels of Clustering
Agglomerative
Divisive
Dunham, 2003
Hierarchical Clustering Example
• Clustering Microarray Gene Expression Data
– Gene expression measured using microarrays studied under variety of
conditions
– On budding yeast Saccharomyces cerevisiae
– Groups together efficiently genes of known similar function,
•
Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P.,
Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro
Hierarchical Clustering Example
• Method
– Genes are the instances, samples the attributes!
– Agglomerative
– Distance measure = correlation
•
Data taken from: Cluster analysis and display of genome-wide expression patterns. Eisen, M., Spellman, P.,
Brown, P., and Botstein, D. (1998). PNAS, 95:14863-14868; Picture generated with J-Express Pro
Simple Clustering: K-means
•
Pick a number (k) of cluster centers (at random)
•
•
Assign every item to its nearest cluster center
•
•
F.i. Euclidean distance
Move each cluster center to the mean of its
assigned items
Repeat until convergence
•
•
KDnuggets
Cluster centers are sometimes called codes, and the
k codes a codebook
change in cluster assignments less than a threshold
K-means example, step 1
k1
Y
Initially
distribute
codes
randomly
in pattern
space
k2
k3
X
KDnuggets
K-means example, step 2
k1
Y
Assign
each point
to the closest
code
k2
k3
X
KDnuggets
K-means example, step 3
k1
k1
Y
Move
each code
to the mean
of all its
assigned
points
k2
k3
k2
k3
X
KDnuggets
K-means example, step 2
Repeat the
process – Y
reassign the
data points to
the codes
Q: Which
points are
reassigned?
k1
k3
k2
X
KDnuggets
K-means example
Repeat the
process – Y
reassign the
data points to
the codes
Q: Which
points are
reassigned?
k1
k3
k2
X
KDnuggets
K-means example
k1
Y
re-compute
cluster
means
k3
k2
X
KDnuggets
K-means example
k1
Y
move
cluster
centers to
cluster
means
KDnuggets
k2
k3
X
K-means clustering summary
Advantages
• Simple, understandable
• items automatically
assigned to clusters
Disadvantages
• Must pick number of
clusters before hand
• All items forced into a
cluster
• Sensitive to outliers
Extensions
• Adaptive k-means
• K-mediods (based on median instead of mean)
– 1,2,3,4,100 average 22, median 3
Biological Example
• Clustering of yeast cell images
– Two clusters are found
– Left cluster primarily cells with thick capsule, right
cluster thin capsule
• caused by media, proxy for sick vs healthy
Self Organizing Maps
(Kohonen Maps)
• Claim to fame
– Simplified models of cortical maps in the brain
– Things that are near in the outside world link to
areas near in the cortex
– For a variety of modalities: touch, motor, …. up
to echolocation
– Nice visualization
• From a data mining perspective:
– SOMs are simple extensions of k-means
clustering
– Codes are connected in a lattice
– In each iteration codes neighboring
winning code in the lattice are also allowed
to move
SOM
10x10 SOM
Gaussian
Distribution
SOM
SOM
SOM
SOM example
Famous example:
Phonetic Typewriter
• SOM lattice below left is trained on spoken
letters, after convergence codes are labeled
• Creates a ‘phonotopic’ map
• Spoken word creates a sequence of labels
Famous example:
Phonetic Typewriter
• Criticism
– Topology preserving property is not used so why use SOMs and
not adaptive k-means for instance?
• K-means could also create a sequence
• This is true for most SOM applications!
– Is using clustering for classification optimal?
Bioinformatics Example
Clustering GPCRs
• Clustering G Protein Coupled Receptors (GPCRs)
[Samsanova et al, 2003, 2004]
• Important drug target, function often unknown
Bioinformatics Example
Clustering GPCRs
Association Rules Outline
• What are frequent item sets &
association rules?
• Quality measures
– support, confidence, lift
• How to find item sets efficiently?
– APRIORI
• How to generate association rules
from an item set?
• Biological examples
KDnuggets
Market Basket Example
Gene Expression Example
TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL
• What genes are expressed (‘active’)
together?
• Interaction / regulation
• Similar function
• Frequent item set
• {MILK, BREAD} = 4
• Association rule
• {MILK, BREAD} {EGGS}
• Frequency / importance = 2
(‘Support’)
• Quality = 50% (‘Confidence’)
ID
1
2
3
4
5
6
7
8
9
Expressed Genes in Sample
GENE1, GENE2, GENE 5
GENE1, GENE3, GENE 5
GENE2
GENE8, GENE9
GENE8, GENE9, GENE10
GENE2, GENE8
GENE9, GENE10
GENE2
GENE11
Association Rule Definitions
•
•
•
•
Set of items: I={I1,I2,…,Im}
Transactions: D={t1,t2, …, tn}, tj I
Itemset: {Ii1,Ii2, …, Iik} I
Support of an itemset: Percentage of
transactions which contain that itemset.
• Large (Frequent) itemset: Itemset whose
number of occurrences is above a threshold.
Dunham, 2003
Frequent Item Set Example
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support of {Bread,PeanutButter} is 60%
Dunham, 2003
Association Rule Definitions
• Association Rule (AR): implication X Y
where X,Y I and X,Y disjunct;
• Support of AR (s) X Y: Percentage of
transactions that contain X Y
• Confidence of AR (a) X Y: Ratio of
number of transactions that contain X Y to
the number that contain X
Dunham, 2003
Association Rules Ex (cont’d)
Dunham, 2003
Association Rule Problem
• Given a set of items I={I1,I2,…,Im} and a
database of transactions D={t1,t2, …, tn}
where ti={Ii1,Ii2, …, Iik} and Iij I, the
Association Rule Problem is to identify all
association rules X Y with a minimum
support and confidence.
• NOTE: Support of X Y is same as support
of X Y.
Dunham, 2003
Association Rules Example
• Q: Given frequent set {A,B,E}, what
association rules have minsup = 2 and
minconf= 50% ?
A, B => E : conf=2/4 = 50%
A, E => B : conf=2/2 = 100%
B, E => A : conf=2/2 = 100%
E => A, B : conf=2/2 = 100%
Don’t qualify
A =>B, E : conf=2/6 =33%< 50%
B => A, E : conf=2/7 = 28% < 50%
__ => A,B,E : conf: 2/9 = 22% < 50%
KDnuggets
TID List of items
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C
Solution Association Rule Problem
• First, find all frequent itemsets with sup
>=minsup
– Exhaustive search won’t work
• Assume we have a set of m items 2m subsets!
– Exploit the subset property (APRIORI algorithm)
• For every frequent item set, derive rules with
confidence >= minconf
KDnuggets
Finding itemsets: next level
• Apriori algorithm (Agrawal & Srikant)
• Idea: use one-item sets to generate two-item
sets, two-item sets to generate three-item sets, ..
– Subset Property: If (A B) is a frequent item set, then
(A) and (B) have to be frequent item sets as well!
– In general: if X is frequent k-item set, then all (k-1)item subsets of X are also frequent
Compute k-item set by merging (k-1)-item sets
KDnuggets
An example
• Given: five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)
• Candidate four-item sets:
(A B C D)
Q: OK?
A: yes, because all 3-item subsets are frequent
(A C D E)
Q: OK?
A: No, because (C D E) is not frequent
KDnuggets
From Frequent Itemsets to
Association Rules
• Q: Given frequent set {A,B,E}, what are
possible association rules?
–
–
–
–
–
–
–
KDnuggets
A => B, E
A, B => E
A, E => B
B => A, E
B, E => A
E => A, B
__ => A,B,E (empty rule), or true => A,B,E
Example:
‘Generating Rules from an Itemset
• Frequent itemset from golf data:
Humidity = Normal, Windy = False, Play = Yes (4)
• Seven potential rules:
If Humidity = Normal and Windy = False then Play = Yes
If Humidity = Normal and Play = Yes then Windy = False
If Windy = False and Play = Yes then Humidity = Normal
If Humidity = Normal then Windy = False and Play = Yes
If Windy = False then Humidity = Normal and Play = Yes
If Play = Yes then Humidity = Normal and Windy = False
If True then Humidity = Normal and Windy = False and Play = Yes
KDnuggets
4/4
4/6
4/6
4/7
4/8
4/9
4/12
Example:
Generating Rules
• Rules with support > 1 and confidence = 100%:
Association rule
Sup.
Conf.
1
Humidity=Normal Windy=False
Play=Yes
4
100%
2
Temperature=Cool
Humidity=Normal
4
100%
3
Outlook=Overcast
Play=Yes
4
100%
4
Temperature=Cold Play=Yes
Humidity=Normal
3
100%
...
...
...
...
...
58
Outlook=Sunny Temperature=Hot
Humidity=High
2
100%
• In total: 3 rules with support four, 5 with support
three, and 50 with support two
KDnuggets
Biomedical Application
Head and Neck Cancer Example
1. ace27=0 fiveyr=alive 381 tumorbefore=0 372
conf:(0.98)
2. gender=M ace27=0 467 tumorbefore=0 455
conf:(0.97)
3. ace27=0 588 tumorbefore=0 572
conf:(0.97)
4. tnm=T0N0M0 ace27=0 405 tumorbefore=0 391
conf:(0.97)
5. loc=LOC7 tumorbefore=0 409 tnm=T0N0M0 391 conf:(0.96)
6. loc=LOC7 442 tnm=T0N0M0 422
conf:(0.95)
7. loc=LOC7 gender=M tumorbefore=0 374 tnm=T0N0M0 357
conf:(0.95)
8. loc=LOC7 gender=M 406 tnm=T0N0M0 387
9. gender=M fiveyr=alive 633 tumorbefore=0 595
10. fiveyr=alive 778 tumorbefore=0 726
conf:(0.95)
conf:(0.94)
conf:(0.93)
Bioinformatics Application
• The idea of association rules have been
customized for bioinformatics applications
• In biology it is often interesting to find frequent
structures rather than items
– For instance protein or other chemical structures
• Solution: Mining Frequent Patterns
– FSG (Kuramochi and Karypis, ICDM 2001)
– gSpan (Yan and Han, ICDM 2002)
– CloseGraph (Yan and Han, KDD 2002)
FSG: Mining Frequent Patterns
FSG: Mining Frequent Patterns
FSG Algorithm
for finding frequent subgraphs
Frequent Subgraph Examples
AIDS Data
• Compounds are active, inactive or moderately active (CA, CI, CM)
Predictive Subgraphs
• The three most discriminating sub-structures for
the PTC, AIDS, and Anthrax datasets
FSG References
• Frequent Sub-structure Based Approaches for Classifying
Chemical Compounds
Mukund Deshpande, Michihiro Kuramochi, and George Karypis
ICDM 2003
• An Efficient Algorithm for Discovering Frequent Subgraphs
Michihiro Kuramochi and George Karypis
IEEE TKDE
• Automated Approaches for Classifying Structures
Mukund Deshpande, Michihiro Kuramochi, and George Karypis
BIOKDD 2002
• Discovering Frequent Geometric Subgraphs
Michihiro Kuramochi and George Karypis
ICDM 2002
• Frequent Subgraph Discovery
Michihiro Kuramochi and George Karypis
1st IEEE Conference on Data Mining 2001
Recap
• Before Starting to Mine….
• Descriptive Data Mining
– Dimension Reduction & Projection
– Clustering
• Hierarchical clustering
• K-means
• Self organizing maps
– Association rules
•
•
•
•
Frequent item sets
Association Rules
APRIORI
Bio-informatics case: FSG for frequent subgraph discovery
• Next
– Predictive data mining
Data Mining Tasks: Classification
Goal classifier is to seperate
classes on the basis of known
attributes
weight
The classifier can be applied
to an instance with unknow
class
age
For instance, classes are
healthy (circle) and sick
(square); attributes are age
and weight
Data Preparation for Classification
• On attributes
– Attribute selection
– Attribute construction
• On attribute values
–
–
–
–
–
Outlier removal / clipping
Normalization
Creating dummies
Missing values imputation
….
Examples of Classification
Techniques
•
•
•
•
•
•
•
•
Majority class vote
Logistic Regression
Nearest Neighbor
Decision Trees, Decision Stumps
Naive Bayes
Neural Networks
Genetic algorithms
Artificial Immune Systems
Example classification algorithm:
Logistic Regression
• Linear regression
– For regression not classification (outcome numeric, not symbolic
class)
– Predicted value is linear combination of inputs
y ax1 bx2 c
• Logistic regression
– Apply logistic function to linear regression formula
– Scales output between 0 and 1
– For binary classification use thresholding
y
1
1 e ( ax1 bx2 c )
y t c1
y t c2
Example classification algorithm:
Logistic Regression
Classification
fe weight
Linear decision boundaries
can be represented well with
linear classifiers like logistic
regression
fe age
Logistic Regression
in attribute space
Voorspellen
f.e. weight
Linear decision boundaries
can be represented well with
linear classifiers like logistic
regression
f.e. age
Logistic Regression
in attribute space
Voorspellen
f.e. weight
xxxx linear decision
Non
boundaries cannot be
represented well with linear
classifiers like logistic
regression
f.e. age
Logistic Regression
in attribute space
Non linear decision
boundaries cannot be
represented well with linear
classifiers like logistic
regression
Well known example:
f.e. weight
The XOR problem
f.e. age
Example classification algorithm:
Nearest Neighbour
• Data itself is the classification model, so no
model abstraction like a tree etc.
• For a given instance x, search the k instances
that are most similar to x
• Classify x as the most occurring class for the k
most similar instances
Nearest Neighbor in attribute space
Classification
= new instance
Any decision area possible
fe weight
Condition: enough data
available
fe age
Nearest Neighbor in attribute space
Voorspellen
Any decision area possible
bvb. weight
Condition: enough data
available
f.e. age
Example Classification Algorithm
Decision Trees
20000 patients
age > 67
yes
no
1200 patients
Weight > 85kg
yes
400 patients
Diabetic (%50)
18800 patients
gender = male?
no
800 customers
Diabetic (%10)
no
etc.
Building Trees:
Weather Data example
Outlook
Temperature
Humidity
Windy
Play?
sunny
hot
high
false
No
sunny
hot
high
true
No
overcast
hot
high
false
Yes
rain
mild
high
false
Yes
rain
cool
normal
false
Yes
rain
cool
normal
true
No
overcast
cool
normal
true
Yes
sunny
mild
high
false
No
sunny
cool
normal
false
Yes
rain
mild
normal
false
Yes
sunny
mild
normal
true
Yes
overcast
mild
high
true
Yes
overcast
hot
normal
false
Yes
rain
mild
high
true
No
KDNuggets / Witten & Frank, 2000
Building Trees
• An internal node is a test
on an attribute.
• A branch represents an
outcome of the test, e.g.,
Color=red.
• A leaf node represents a
class label or class label
distribution.
• At each node, one
attribute is chosen to split
training examples into
distinct classes as much
as possible
• A new case is classified
by following a matching
path to a leaf node.
KDNuggets / Witten & Frank, 2000
Outlook
sunny
rain
overcast
Yes
Humidity
Windy
high
normal
true
false
No
Yes
No
Yes
Split on what attribute?
• Which is the best attribute to split on?
– The one which will result in the smallest tree
– Heuristic: choose the attribute that produces best
separation of classes (the “purest” nodes)
• Popular impurity measure: information
– Measured in bits
– At a given node, how much more information do you
need to classify an instance correctly?
• What if at a given node all instances belong to one class?
• Strategy
– choose attribute that results in greatest information
gain
KDNuggets / Witten & Frank, 2000
Which attribute to select?
• Candidate: outlook attribute
• What is the info for the leafs?
– info[2,3] = 0.971 bits
– Info[4,0] = 0 bits
– Info[3,2] = 0.971 bits
• Total: take average weighted by nof instances
– Info([2,3], [4,0], [3,2]) = 5/14 * 0.971 + 4/14* 0 + 5/14 *
0.971 = 0.693 bits
• What was the info before the split?
– Info[9,5] = 0.940 bits
• What is the gain for a split on outlook?
– Gain(outlook) = 0.940 – 0.693 = 0.247 bits
Witten & Frank, 2000
Which attribute to select?
Gain = 0.247
Gain = 0.152
Gain = 0.048
Witten & Frank, 2000
Gain = 0.029
Continuing to split
gain(" Humidity" ) 0.971 bits
gain(" Temperatur e" ) 0.571 bits
gain(" Windy" ) 0.020 bits
KDNuggets / Witten & Frank, 2000
The final decision tree
• Note: not all leaves need to be pure; sometimes
identical instances have different classes
Splitting stops when data can’t be split any further
KDNuggets / Witten & Frank, 2000
Computing information
• Information is measured in bits
– When a leaf contains once class only information is 0 (pure)
– When the number of instances is the same for all classes information
reaches a maximum (impure)
• Measure: information value or entropy
entropy( p1 , p2 ,..., pn ) p1 log p1 p2 log p2 ... pn log pn
• Example (log base 2)
– Info([2,3,4]) = -2/9 * log(2/9) – 3/9 * log(3/9) – 4/9 * log(4/9)
KDNuggets / Witten & Frank, 2000
Decision Trees in Pattern Space
Goal classifier is to seperate
classes (circle, square) on the
basis of attribute age and
income
weight
Each line corresponds to a
split in the tree
Decision areas are ‘tiles’ in
pattern space
age
Decision Trees in attribute space
Goal classifier is to seperate
classes (circle, square) on the
basis of attribute age and
weight
Each line corresponds to a
split in the tree
weight
Decision areas are ‘tiles’ in
attribute space
age
Example classification algorithm:
Naive Bayes
• Naive Bayes = Probabilistic Classifier based on
Bayes Rule
• Will produce probability for each target /
outcome class
• ‘Naive’ because it assumes independence
between attributes (uncorrelated)
Bayes’s rule
•
Probability of event H given evidence E :
•
Pr[ E | H ] Pr[ H ]
Pr[ E ]
A priori probability of H : Pr[H ]
Pr[ H | E ]
–
•
Probability of event before evidence is seen
A posteriori probability of H : Pr[ H | E ]
–
Probability of event after evidence is seen
from Bayes “Essay towards solving a problem in the
doctrine of chances” (1763)
Thomas Bayes
Born:
1702 in London, England
Died:
1761 in Tunbridge Wells, Kent, England
KDNuggets / Witten & Frank, 2000
Naïve Bayes for classification
•
Classification learning: what’s the probability of the
class given an instance?
–
–
•
Evidence E = instance
Event H = class value for instance
Naïve assumption: evidence splits into parts (i.e.
attributes) that are independent
Pr[ E1 | H ] Pr[ E1 | H ]Pr[ En | H ] Pr[ H ]
Pr[ H | E ]
Pr[ E ]
KDNuggets / Witten & Frank, 2000
Weather data example
Outlook
Temp.
Humidity
Windy
Play
Sunny
Cool
High
True
?
Evidence E
Pr[ yes | E ] Pr[Outlook Sunny | yes ]
Pr[Temperature Cool | yes ]
Probability of
class “yes”
Pr[ Humidity High | yes ]
Pr[Windy True | yes]
Pr[ yes]
Pr[ E ]
93 93 93 149
Pr[ E ]
2
9
KDNuggets / Witten & Frank, 2000
Probabilities for weather data
Outlook
Temperature
Yes
No
Sunny
2
3
Hot
2
2
Overcast
4
0
Mild
4
2
Rainy
3
2
Cool
3
1
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
KDNuggets / Witten & Frank, 2000
Yes
Humidity
No
Windy
Yes
No
High
3
4
Normal
6
High
Normal
Play
Yes
No
Yes
False
6
2
9
5
1
True
3
3
3/9
4/5
False
6/9
2/5
9/14
5/14
6/9
1/5
True
3/9
3/5
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
No
Probabilities for weather data
Outlook
Temperature
Yes
No
Sunny
2
3
Hot
2
2
Overcast
4
0
Mild
4
2
Rainy
3
2
Cool
3
1
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
•
A new day:
KDNuggets / Witten & Frank, 2000
Yes
Humidity
No
Windy
Yes
No
High
3
4
Normal
6
High
Play
Yes
No
Yes
False
6
2
9
5
1
True
3
3
3/9
4/5
False
6/9
2/5
9/14
5/14
Normal
6/9
1/5
True
3/9
3/5
Outlook
Temp.
Humidity
Windy
Play
Sunny
Cool
High
True
?
Likelihood of the two classes
For “yes” = 2/9 3/9 3/9 3/9 9/14 = 0.0053
For “no” = 3/5 1/5 4/5 3/5 5/14 = 0.0206
Conversion into a probability by normalization:
P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795
No
Naïve Bayes: discussion
• Naïve Bayes works surprisingly well (even if
independence assumption is clearly violated)
• Why? Because classification doesn’t require
accurate probability estimates as long as
maximum probability is assigned to correct
class
• However: adding too many redundant attributes
will cause problems (e.g. identical attributes)
witten&eibe
Naive Bayes in attribute space
Classification
fe weight
NB can model non
fe age
Example classification algorithm:
Neural Networks
• Inspired by neuronal computation in the brain (McCullough & Pitts
1943 (!))
invoer:
bvb. klantkenmerken
uitvoer:
bvb. respons
• Input (attributes) is coded as activation on the input layer neurons,
activation feeds forward through network of weighted links between
neurons and causes activations on the output neurons (for instance
diabetic yes/no)
• Algorithm learns to find optimal weight using the training instances
and a general learning rule.
Neural Networks
• Example simple network (2 layers)
age
weightage
body_mass_index
Weightbody mass index
Probability of being diabetic
• Probability of being diabetic = f (age * weightage + body mass index
* weightbody mass index)
Neural Networks in Pattern
Space
Classification
Simpel network: only a line
available (why?) to seperate
classes
Multilayer network:
f.e. weight
Any classification boundary
possible
f.e. age
Evaluating Classifiers
• Root mean squared error (rmse), Area Under
the ROC Curve (AUC), confusion matrices,
classification accuracy
– Accuracy = 78% on test set 78% of classifications
were correct
• Hold out validation, n fold cross validation, leave
one out validation
– Build a model on a training set, evaluate on test set
– Hold out: single test set (f.e. one thirds of data)
– n fold cross validation
• Divide into n groups
• Perform n cycles, each cycle with different fold as test set
– Leave one out
• Test set of one instance, cycle trough all instances
Evaluating Classifiers
• Investigating the sources of error
– bias variance decomposition
– Informal definition
• Bias: error due to limitations of model representation (eg
linear classifier on non linear problem); even with infinite date
there will be bias
• Variance: error due to instability of classifier over different
samples; error due to sample sizes, overfitting
Example Results
Predicting Survival for Head & Neck Cancer
TNM Symbolic
TNM Numeric
Average and standard deviation (SD) on the classification accuracy for all classifiers
Example Results Head and Neck Cancer:
Bias Variance Decomposition
• Quiz: What could be a strategy to improve these
models?
Agenda Today
• Data mining definitions
• Before Starting to Mine….
• Descriptive Data Mining
– Dimension Reduction & Projection
– Clustering
– Association rules
• Predictive data mining concepts
– Classification and regression
– Bioinformatics applications
• Predictive data mining techniques
–
–
–
–
–
Logistic Regression
Nearest Neighbor
Decision Trees
Naive Bayes
Neural Networks
• Evaluating predictive models
• Demonstration (optional)
General Data Mining Resources
• Best KDD website & mailing lists: KDNuggets
website (www.kdnuggets.com - check the
bionformatics sections)
• Most popular open source data mining tool:
WEKA (
• Most important conference: KDD (see
www.kdd2006.com for instance; check BIOKDD,
bioinformatics data mining workshop)
• You can contact me at [email protected]
Demo: WEKA
Questions and Answers