CIS 730 (Introduction to Artificial Intelligence) Lecture

Download Report

Transcript CIS 730 (Introduction to Artificial Intelligence) Lecture

KDD Group Research Seminar
Fall, 2001 - Presentation 8 – 11
Incremental Learning
Friday, November 16
James Plummer
[email protected]
Reference
Mitchell, Tom M. “Machine Learning” MaGraw-Hill
Companies. 1997.
Laboratory for Knowledge Discovery in Databases (KDD)
Kansas State University
Department of Computing and Information Sciences
Outline
•
Machine Learning
– Extracting information from data
– Forming concepts
•
The Data
– Arrangement of Data
• Attributes, Labels, and Instances
– Categorization of Data
– Results
•
MLJ ( Machine Learning in Java )
– Collection of Machine Learning algorithms
– Current Inducers
•
Incremental Learning
– Description of technique
– Nearest Neighbor Algorithm
– Distance-Weighted Algorithm
•
Advantages and Disadvantages
– Gains and Loses.
Laboratory for Knowledge Discovery in Databases (KDD)
Kansas State University
Department of Computing and Information Sciences
Machine Learning
•
Sometimes called Data Mining
•
The process of extracting useful information from data
– Marketing databases, medical databases, weather databases
• Finding Consumer purchase patterns
•
Used to form concepts
– Predictions
– Classifications
– Numeric Answers
Laboratory for Knowledge Discovery in Databases (KDD)
Kansas State University
Department of Computing and Information Sciences
The Data
•
Arrangement of Data
– A piece of data is a set of attributes ai which make up an instance xj
• Attributes can be considered evidence
– Each instance has a label or category f(xj) (outcome value)
xj = a1, a2, a3, . . . ai;
f(xj);
– A set of data is a set of instances
•
Categorization
– A set of instances is used as control for new query instances xq(training)
– Calculate f^(xj) based on training data
• f^(xj) is the predicted value of the actual f(xj)
•
Results
– The number of correctly predicted values over the total number of query instances
– f^(xq)correct / f(xq)total
Laboratory for Knowledge Discovery in Databases (KDD)
Kansas State University
Department of Computing and Information Sciences
Data Example
•
Predict the values of Example 6, 7, 8 given data examples 1 through 5
Example
Outlook
Air
Temp
Humidity
Wind
Play
Tennis?
1
Sunny
Hot
High
Weak
No
2
Sunny
Hot
High
Strong
No
3
Overcast
Hot
High
Weak
Yes
4
Rain
Mild
High
Weak
Yes
5
Rain
Cool
Normal
Weak
Yes
6
Rain
Cool
Normal
Strong
Yes
7
Overcast
Cool
Normal
Strong
No
8
Sunny
Mild
High
Weak
Yes
Laboratory for Knowledge Discovery in Databases (KDD)
Kansas State University
Department of Computing and Information Sciences
MLJ (Machine Learning in Java)
•
MLJ is a collection of learning algorithms
– Inducers
• Categorize data to learn concepts
•
Currently in Development
– ID3
• Uses trees
– Naïve Bayes
• Uses complex calculations
– C4.5
• Uses trees with pruning techniques
•
Incremental Learning
– Uses comparison techniques
– Soon to be added
Laboratory for Knowledge Discovery in Databases (KDD)
Kansas State University
Department of Computing and Information Sciences
Incremental Learning
•
Instance Based Learning
– k-Nearest Neighbor
– All instances correspond to points in an n-dimensional space
• The distance between two instances is determined by:
d ( xi , x j ) 
n
2
(
a
(
x
)

a
(
x
))
 r i r j
r 1
ar(x) is the rth attribute of instance x
– Given a query instance xq to be categorized the k-nearest neighbors are calculated
– f^(xq) is assigned the most frequent value of the nearest k f(xj)
k
f ^ ( xq )  arg max  f ( xi ) *
i 1
– For k = 1, f^(xq) will be assigned f(xi) if xi is the closest instance in the space
Laboratory for Knowledge Discovery in Databases (KDD)
Kansas State University
Department of Computing and Information Sciences
Distance-Weighted Nearest Neighbor
•
Same as k-Nearest Neighbor
– Effect of f(xj) on f^(xq) based on d(xq, xj)
k
f (x j )
i 1
d ( xq , x j )
f ^ ( xq )  arg max
*
– In the case xq = xi then f^(xq) = f(xi)
– Examine three cases for the 2
dimensional space to the right
• k=1
• k=5
• Weighted, k=5
Laboratory for Knowledge Discovery in Databases (KDD)
Kansas State University
Department of Computing and Information Sciences
Advantages and Disadvantages
•
Gains of using k-Nearest Neighbor
–
–
–
–
–
–
•
Individual attributes can be weighted differently
Change d(xi, xq) to allow nearest xi to have stronger of weaker effect on f^(xq)
Unaffected by noise in training data
Very Effective when provided a large set of training data
Flexible, f^(xq) can be calculated in many useful ways
Very small training time
Loses
– Not good when training data is insufficient
– Not very effective if similar xi have disimilar f^(xi)
– More computation time need to categorize new instances
Laboratory for Knowledge Discovery in Databases (KDD)
Kansas State University
Department of Computing and Information Sciences
Referrences
• Mitchell, Tom M. “Machine Learning” MaGraw-Hill Companies. 1997.
•
Witten and Frank, “Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations”. Morgan Kaufmann publishers. 2000.
* equation reduced for simplicity
Laboratory for Knowledge Discovery in Databases (KDD)
Kansas State University
Department of Computing and Information Sciences