Introduction to data mining - Laboratoire d`Infochimie

Download Report

Transcript Introduction to data mining - Laboratoire d`Infochimie

Introduction to data mining
G. Marcou+
+Laboratoire
d’infochimie, Université de Strasbourg, 4, rue Blaise
Pascal, 67000 Strasbourg
1
Motivation of data mining
 Discover automatically useful information in large
data repository.
 Extract patterns from experience.
 Predict outcome of future observations.
 Learning:
Experience
Set of Task
Preformance Measure
If Experience increase, performance measure on the set of tasks increases
Organisation of data
 Datasets are organized as instances/attributes
Instances
Synonyms
Data points
Entries
Sample
…
Attributes
Synonyms
Factors
Variables
Measures
...
Nature of data
 Attributes can be:
Atom counts:
O=1
Cl=4
N=6
S=3
Numeric
Nominal Molecule name: (1-methyl)(1,1,1-tributyl)azanium, tetrahexylammonium
Molecular surface:
Continuous
Phase state: solid, amorphous, liquid, gas, ionized
Categorical
Ordered Intestinal absorption: not absorbed, mildly absorbed, completely absorbed
Spectral domains:
Ranges
UV
visible
IR
EC 1. Oxidoreductases
Hierarchical EC numbers: EC 2. Transferases
EC 6.1 Forming Carbon-Oxygen Bonds
EC 3. Hydrolases
EC 4. Lyases
EC 5. Isomerases
EC 6. Ligases
EC 6.2 Forming Carbon-Sulfur Bonds
EC 6.3 Forming Carbon-Nitrogen Bonds
EC 6.4 Forming Carbon-Carbon Bonds
EC 6.5 Forming Phosphoric Ester Bonds
EC 6.6 Forming Nitrogen—Metal Bonds
Nature of learning
 Unsupervised learning
Clustering
Rules
+
 Supervised learning
Classification
Regression
 Other
Reinforcement
First order logic
x, y xRy
Concept in data mining
 A Concept is the target function to be learned
 Concept is learned from
Instance 1
attributes-values
Relations
Sequences
Spatial
Instance 2
Instance 3
…
DB1
DB2
Machine Learning and Statistics
 Data miner point of
view
Any hypothesis
compatible with the
dataset is useful
Search for all
hypothesis compatible
with the dataset
Induction
 Statistician point of
view
Datasets are the
expression of
underlying probability
distributions
Datasets validate or
invalidate prior
hypothesis
Deduction
Validation in Data Mining
 Validation means that a model is build on a
training set of data then applied on a test set of
data.
 Success and failure on the test set must be
estimated.
 The estimate is supposed to be representative of
any new situation.
 Every model must be validated.
Training/Test
 Split the dataset in two
parts:
One part is the training
set
The other is the test set
Bootstrapping
 Draw N instances with
replacement from the
dataset
 Create a training set
with these instances
 Use the dataset as the
test set
Cross-Validation
 Split the dataset in N
subsets
 Use each subset as a
test set while all
others form a training
set
Scrambling
 Reassign at random the
classes to the instances.
 Success and failure are
estimated on the
scrambled data.
 The goal is to estimate
good success
measurement by pure
chance.
Clustering
 Search for an internal organization of the data
 Optimizes relations between instances relative to
an objective function
 Typical objective functions:
Separation
Coherence
Concept
Contiguity
Density
Cluster Evaluation
 Essential because any dataset can be clustered by
not any cluster is meaningful.
 Evaluation can
Unsupervised
Supervised
Relative
Unsupervised Cluster evaluation
Cohesion
Separation
Silhouette
Clustering Tendency
Proximity matrix
CoPhenetic Correlation
For p Nearest Neighbor Distances (NND) between
instances (ωi) and NND between rand points (ui)
Supervised cluster evaluation
Cluster3
Class1
Precision(3,1)
Recall(3,1)
Precision(i,j)
pij
Ni, number of members of cluster i
Recall(i,j)
Relative analysis
 Compare two clustering.
 Supervised cluster analysis is a special case of
relative analysis
The reference clustering is the set of classes
Rand statistics
Jaquard statistics
N00: number of instances couple in different clusters for both clustering
N11: number of instances couple in same clusters for both clusters
N01: number of instances couple in different clusters for the first clustering and in the same clusters for the second
N10: number of instances couple in the same clusters for the first clustering and in different one for the second.
A simple clustering algorithm: k-mean
1. Select k points as
centroids
2. Form k clusters: each
point is assigned to its
closest centroid
3. Reset each centroid to
the (geometric) center
of its cluster
4. Repeat from point 2
until no change is
observed
5. Repeat from point 1
until stable average
clusters are obtained.
X
X
XX
X
X
Classification
 Definition
Assign one or several objects to predefined categories
The target function maps a set of attributes x to a set of
classes y.
 Learning scheme
Supervised learning
Attribute-value
 Goal
Predict the outcome of future observations
Probabilities basics
 Conditional probabilities
Independence of random events:
Probability of realization of event A knowing that B has
occurred
The Bayes equation for independent events xi
Statistical approach to classification
Class2
Class 1
 Estimate the probability of an instance {x1,x2}
being of Class1 or Class2.
The Naive Bayes assumption
 The probability that an instance {x1,x2,…} belongs to class
A is difficult to estimate.
Poor statistics
 Consider the Bayes Equation:
With the naive assumption that {x1,x2,…} are independent
Likelihood
Posterior Probability
Prior Probability
Evidence
 The prior probability, the evidence and the likelihood
have better estimates
Good statistics
The Naive Bayes Classifier
1. Estimate the prior probability, P(A), for each
class.
2. Estimate the likelihood, P(x|A), of each attribute
for each class.
3. For a new instance, estimate the Bayes Score for
each class:
4. Assign the instance to the class which possesses
the highest score
•
The value of C can be optimized
Success and failure
 For N instance and a give classifier, for each class I
 NTP(i):
True Positives
• Number of instances of class i correctly classified.
 NFP(i):
False Positives
• Number of instances incorrectly assigned to class i.
 NTN(i):
True Negatives
• Number of instances of other classes correctly classified.
 NFN(i):
False Negatives
• Number of instances of class i incorrectly assigned to other classes.
Confusion Matrix
 For N instances, K classes
and a classifier
 Nij, the number of
instances of class i
classified as j
Class1
Class2
… ClassK
Class1
N11
N12
… N1K
Class2
N21
N22
… N2K
…
…
…
… …
ClassK
NK1
NK2
… NKK
Classification Evaluation
Global measures of success
Measures are estimated on all classes
Local measures of success
Measures are estimated for each class
Ranking success evaluation
Recall
 Receiver Operator
Curve (ROC)
 Receiver Operator
Curve Area Under the
Curve (ROC AUC)
1-Specificity
Losses and Risks
 Errors on a different class
prediction has different
costs
What does it cost to
mistakenly assign an
instance of one class to
another?
 Normalized Expected Cost
 Probability Cost Function
Cost matrix Cij
Class1
Class2
…
ClassK
Class1
0
C12
…
C1K
Class2
C21
0
…
C2K
…
…
…
…
…
ClassK
CK1
CK2
…
0
Asymmetric matrix
Cost Curve
Normalized expected cost
Worse classifier
NTP
Reject All Classifier
Accept All Classifier
Actual Classifier
NFP
Ideal Classifier
Probability Cost function
Conclusion
 Data mining extracts useful information from
datasets
 Clustering:
Unsupervised
Information about the data
 Classification:
Supervised
Build models in order to predict outcome of future
observations
Multi-Linear Regression
y=ax+b
Sum of
Squared
Errors (SSE)
b
a