Slides for Lecture 1

Transcript Slides for Lecture 1

Interacting with Data
Materials from a Course in
Princeton University
-- Hu Yan
1
Outline
Introduction to this course
 Introduction to Classification
 The Nearest Neighbor Algorithm
 Decision Tree Algorithm
 Conclusion and future talks

2
What is this course about?

This course is about data!
 how
to get the most out of data and convert data into
knowledge, information or predictions.

Examples of the datasets
 credit
cards: every purchase you make is tracked,
used to detect fraud, marketing purposes, make
predictions
 security cameras: used for tracking (enforce fine), or
finding criminals via facial recognition software.
 articles: articles are indexed in multiple databases,
organize articles by topics or even track the evolution
of topics over time.

There are all kinds of data
 Text,
images, transaction records, etc.
3
Tasks

make predictions or classifications
 classify

customers whether or not switch companies
cluster or organize data
 cluster articles by topic
 different from classification:
don’t know the classes
ahead of time

find “simple” descriptions of complex objects
 find

a simple description of faces
identify what is typical and what is an outlier
 identify
purchases that are typical or unusual for a
given customer
4
Perspective

Related fields





Pattern recognition (from 60s) primarily concerns with images
Machine learning (from 80s) was a natural outgrowth of Artificial
Intelligence (AI)
Data mining (from 90s) in order to deal with the vast amounts of
data to discover “interesting patterns”
This course is largely a mixture of statistics, machine
learning, and data mining
Look at interacting with data:

Classification, clustering, regression, and dimensionality
reduction
5
Outline
Introduction to this course
 Introduction to Classification
 The Nearest Neighbor Algorithm
 Decision Tree Algorithm
 Conclusion and future talks

6
Introduction to Classification

Classifying objects from a data set based on a certain
characteristic.


Binary classification: positive or negative.
Classification learning algorithm


Input: labeled data sets
Output: classifier (predict the label of input unclassified
examples)
7
Example

classification criterion:
any integer greater
than 196 or less than
47 will be labeled
negative, and positive
otherwise.
8
Example

a decimal integer is positive if the second and sixth most
significant bits in its binary representation are set; it’s
negative otherwise.
9
Outline
Introduction to this course
 Introduction to Classification
 The Nearest Neighbor Algorithm
 Decision Tree Algorithm
 Conclusion and future talks

10
The Nearest Neighbor Algorithm

Training
 There
are m training examples.
 Each training example is of the form (xi, yi), where xi
\in Rn and yi \in {v1, …, vs}.
 Store all the training examples.

Testing.
 Given
a test point x, predict yi where xi is the closest
training example to x.
11
The Nearest Neighbor Algorithm


is a kind of Instance-based learning methods.
referred as to “lazy” learning methods.
 Simply
store the training examples, delay processing
until a new instance must be classified
 Some methods construct a general, explicit
description of the target function when training
examples are provided
 advantage: instead of estimating the target function
once for the entire space, estimate it locally and
differently for each new instance.
 disadvantage: the cost of classifying new instances
can be high.
12
k-Nearest Neighbor Algorithm
13
k-Nearest Neighbor Algorithm




Two-dimensional space
Positive, negative
1-nearest neighbor: xq +
5-nearest neighbor: xq -
14
k-Nearest Neighbor Algorithm



Never forms an explicit general hypothesis f^
regarding the target function f
Simply computes the classification of each new
query instance as needed
What’s the implicit general function?
15
Distance-weighted k-Nearest
Neighbor Algorithm

Obvious refinement
 Weight
the contribution of each of the k neighbors
according to their distance to the query point.
16
Curse of dimensionality



Imagine instances described by 20 attributes but
only 2 are relevant to target function
Curse of dimensionality nearest neighbor is
easily mislead when high-dimensional
One approach
jth axis by weight zj where z1, …, zn chosen to
minimize prediction error
 Use cross-validation to automatically choose weights
z1, …, zn
 Note setting zj to zero eliminates this dimension
altogether
 Stretch
17
Outline




Introduction to this course
Introduction to Classification
The Nearest Neighbor Algorithm
Decision Tree Algorithm
 Decision tree representation
 ID3 learning algorithm
 Entropy, Information
 Overfitting

gain
Conclusion and future talks
18
Decision tree for PlayTennis
19
Decision tree representation





Instances are represented by attribute-value pairs
Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification
In general, decision tree represent a disjunction of
conjunctions of constraints on the attribute values of
attributes tests.
20
Building a Decision tree



ID3 (1986), C4.5 (1993)
A top-down, greedy search through the space of
possible decision trees.
Main loop:
 A
the best decision attribute for next node;
 Assign A as decision attribute for node;
 For each value of A create new branch of node;
 Sort training examples to leaf nodes;
 If training examples perfectly classified Then STOP
Else iterate over new leaf nodes;
21
Entropy




S is a sample of training examples
p+ is the proportion of positive examples in S
p - is the proportion of negative examples in S
Entropy measures the impurity of S
Entropy(s)  i 1  pi log 2 pi
c
Here Entropy( s)   p log 2 p  p log 2 p
Entropy ([9+,5-]) = -(9/14)log2(9/14) - (5/14)log2(5/14) = 0.940
22
Entropy function
Entropy ( s)
  p log 2 p  p log 2 p

Entropy(S) = expected number of bits needed to
encode class (+ or -) of randomly drawn member
of S (under the optimal shortest-length code)
23
Information Gain
24
Information Gain



S is a collection of training example days described by attributes
including Wind, which have the values Weak and Strong.
S contains 14 examples, [9+, 5-]
6 of the positive and 2 of the negative examples have Wind = Weak,
and the remainder have Wind = Strong.
25
Training Examples
26
Which Attribute Is the Best Classifier


Information gain is the measure used by ID3 to select the best attribute at
each step in growing the tree.
Example: information gain of two attributes: Humidity, and Wind, is
computed to determine witch is better for classifying the training examples.
27
An Illustrative Example
Gain(S,Outlook) = 0.245 Gain(S,Humidity) = 0.151
Gain(S,Wind) = 0.048
Gain(S,Temp) = 0.029
Which attributes should be tested here?
28
Selecting the Next Attribute
Ssunny = {D1,D2,D8,D9,D11}
Gain (Ssunny, Humidity) =0 .970
Gain (Ssunny, Temp) = 0.570
Gain (Ssunny, Wind)= 0.019
1,2,8,9,11
1,2,8
3,7,12,13
4,5,6,10,14
9,11
29
Hypothesis Space Search by ID3
ID3 search through the
space of possible decision
trees from simple to
increasingly complex,
guided by the information
gain Gain(S,A)
30
Hypothesis Space Search by ID3





ID3 searches a complete hypothesis space, it
searches incompletely through the space.
Outputs a single hypothesis;
No back tracking, converging to Local optimal
solution (maybe not global optimal);
Using statistical properties, robust to noisy data;
Inductive bias: Preference for short trees and for
those with high information gain attributes near the
root
31
Day
Temp
Humidity
Wind
Play
D1
Cool
High
Weak
No
D2
Cool
Normal
Weak
No
Gain(S,humid) = 0.405
D3
Hot
High
Strong
No
Gain(S,wind) = Gain(S,Temp) =0.805
D4
Hot
Normal
Weak
Yes
D5
Cool
Normal
Strong
Yes
Gain(s) = -2/5 log22/5 – 3/5 log23/5 = 0.971
humid
[2+,3-]
normal
2,4,5
wind
weak
1,2,4
[1+,2-] temp
cool
1,2
N
strong
[2+,1-]
3,5
humid
hot
4
Y
normal
5
Y
hot
2, 5
[1+,1-]
high
3
N
temp
[1+,1-]
wind
weak strong
2
5
N
Y
[2+,3-]
high
1,3
N
cool
4
Y
32
Overfitting in Decision Tree Learning
Consider error of hypothesis h over
 training data errortrain(h)
 entire distribution D of data errorD(h)
Hypothesis h \in H overfits training data if there is
an alternative hypothesis h’ \in H such that
errortrain(h) < errortrain(h’) AND errorD(h) >errorD(h’)
33
Overfitting in Decision Tree Learning
34
Avoiding Overfitting

How can we avoid overfitting?



How to select best tree during the pruning?




stop growing before it reaches the point where it perfect
classifies the training dada
grow full tree then post-prune (widely used)
Split data into training and validation set
build decision tree over training data
measure performance over separate validation data set
Two ways of pruning:


reduced-error pruning
rule post pruning
35
Reduced Error Pruning
1. Split data into training and validation set
2. Build the tree over training data
3. For each of the decision node

Evaluate impact on validation set of pruning each
decision node

remove the one that improves validation set accuracy
 removing the subtree rooted at that node, making
it a leaf node
 assigning it the most common classification of the
training examples affiliated with that node
36
Effect of Reduced-Error Pruning
37
Rule Post Pruning
1.
2.
3.
Convert tree to equivalent set of rules (if-then
expression)
Prune (generalize) each rule by removing any
preconditions that improves its estimated accuracy
Sort the pruned rules by their estimated accuracy
(can be used in classifying subsequent instances)
IF (Outlook= Sunny) and (Humidity =
High) THEN PlayTennis = No
IF (Outlook = Sunny) and (Humidity =
Normal) THEN PlayTennis = Yes
….
38
Conclusion

Interacting with data
 how
to get the most out of data and convert data into
knowledge, information or predictions
 Classification, clustering, regression, and
dimensionality reduction

Classification
 categorize
objects into particular classes based on
their attributes


The Nearest Neighbor Algorithm
Decision Tree Algorithm
39
Contents

Classification

K-nearest-neighbor algorithm, Decision trees
 Computational learning theory
 Boosting, Support vector machines

Clustering


Graphic Models (a marriage of probability theory and graph theory)


Linear regression, Logistic regression
Dimensionality Reduction (reduce the representation of data)


Naive Bayes classification, EM (Expectation-Maximization) algorithm
Regression (predict a real value quantity based on observed data)


K-means clustering, Agglomerative clustering
PCA (Principal Components Analysis), Factor analysis
Advanced Topics and Applications
40
Thank you !
41

Slides for Lecture 1

Transcript Slides for Lecture 1

Directory