Example: Data Mining for the NBA - The University of Texas at Dallas
Download
Report
Transcript Example: Data Mining for the NBA - The University of Texas at Dallas
Introduction to Biometrics
Dr. Bhavani Thuraisingham
The University of Texas at Dallas
Information Relevant to the Project
September 21, 2005
Outline of the Unit
Project Information
Data Mining Techniques (may be useful for the project)
Project Information
In this assignment, you will experiment on the following two
tasks
- Face recognition.
- Pose recognition.
For this experiment you will use neural network package
given in the code directory in location
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
For training and testing, you will use the face images that can
be found in the trainset directory in
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
(OPTIONAL) You will also use k-nearest neighborhood using
the same dataset and compare its performance with neural
network.
-
Project Information (Continued)
You don’t need to do significant amounts of coding for the
main part of the project.
but for the OPTIONAL k-nearest neighbor we need some
sort of coding, you will use k-nearest neighbor instead of
back propagation in the code.
For training your datasets for part 1, it will take time. It is
recommended that you read the materials in the following
location thoroughly and start it early.
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
-
-
Project Information (Continued)
Face images
The image data can be found in the faces directory in
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
This directory contains 20 subdirectories, one for each
person who volunteered for the photo shoot, named by
userid. Each of these directories contains several
versions of the face images.
For images the following naming convention is used:
<userid>_<pose>_< expression>_<eyes>_<scale>.pgm
For further details see the assignment document, section 2 in
the following link
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
-
Project Information (Continued)
How to view the face images
You will need View.java and View.class to view the images.
These two files will be placed on my web site by tomorrow
To view the images, you can use the java program View.java.
View.java handles a variety of image formats, including the
PGM format in which the face images are stored. While we
won't go into detail about View in this document, we will
quickly describe the basics you need to know to use View.
To start View, just specify on the command line:
java View ImageInput
- Here “ImageInput“ corresponds to the image you want to
view
-
Project Information (Continued)
The neural network and image access code
You can have C code for a three layer fully connected feed
forward neural network which uses the back propagation
algorithm to tune its weights. You can also have the top level
program for training and recognition, as a skeleton for you to
modify.
The code is located in the code directory in
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
For further details see in the assignment document, section 4
in the following link
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
Project Information (Continued)
Assignment
Copy straight_train.list, straight_test1.list, straight_test2.list,
all_train.list, all_test1.list, all_test2.list in your home directory
to obtain the training and test set data for this assignment
from the following trainset link in
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
- Train with straight_train.list and test with
straight_test1.list, straight_test2.list with the default
learning parameter settings (learning rate 0.3, momentum
0.3) for 75 epochs.
Project Information (Continued)
For further assistance see in assignment document, section
5-3 in the following link
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
Report your train/test performance and error as a function
of epochs. If you had stopped training when the
performance on test1 leveled off, what would the
performance have been on test2? And vice versa?
Implement a face recognizer; i.e. implement a neural net
which, when given an image as input, indicates who is in the
image. To do this, you will need to implement a different
output encoding. Describe your output encoding. (Hint: leave
learning rate and momentum at 0.3, and use 20 hidden units).
-
Project Information (Continued)
Implement a pose recognizer; i.e. implement a neural net
which, when given an image as input, indicates whether the
person in the image is looking straight ahead, up, to the left,
or to the right. You will also need to implement a different
output encoding for this task. Describe your output encoding.
(Hint: leave learning rate and momentum at 0.3, and use 6
hidden units).
Train with all_train.list and test with all_test1.list, all_test2.list
with the default learning parameter settings (learning rate 0.3,
momentum 0.3) for 100 epochs.
For further assistance see in assignment document, section
5-8 in the following link
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
Project Information (Concluded)
Report your train/test performance and error as a
function of epochs. If you had stopped training when
the performance on test1 leveled off, what would the
performance have been on test2? And vice versa?
(OPTIONAL) Use k-nearest neighborhood to do all
the tasks above. Here you will determine the best k
value.
Market Basket Analysis
What is Market Basket Analysis?
Example
Basic process
Association rules
Disassociation rules
Sequential Tim-series analysis
Challenges and Directions
Strengths/Weaknesses
What is Market Basket Analysis?
Market basket analysis is a collection of techniques that will
discover rules such as what items are purchased together
It has roots in point of sale transactions; but has gone beyond this
applications
- E.g., who travels together, who is seen with whom, etc.
Market basket analysis is used as a starting point when transactions
data is available and we are not sure of the patterns we are looking
for
- Find items that are purchased together
Essentially market basket analysis produces association rules
Example
Person
Countries Visited
John
England, France
James
Germany, England, Switzerland
William
England, Austria
Mary
England, Austria, France
Jane
Switzerland, France
Co-Occurrence Table
England Switzerland Germany France Austria
England
4
1
1
2
2
Switzerland
1
2
1
1
0
Germany
1
1
1
0
0
France
2
1
0
3
1
Austria
2
0
0
1
2
Example (Concluded)
England and France are more likely to be traveled together than any
other two countries
Austria is never traveled together with Germany or Switzerland
Germany is never traveled together with Austria or France
Rule:
- If a person travels to France then he/she also travels to England
Support for this rule is 2 out of 5 and that is 40% since 2 trips
out of five support this rule
Confidence for this rule is 66% since two out of three trips
that contain France also contains England
That is, if France then England rule has support 40% and
confidence 66%
Challenge: How to automatically generate the rules
Basic Process
Choosing the right set of items
- Need to gather the right set of transaction data and the right
level of detail, ensuring data quality
Generating rules from the data
- Generate co-occurrence matrix for single items
- Generate co-occurrence matrix with 2 items and use this to find
rules with 2 items
- Generate co-occurrence matrix with 3 items and use this to find
rules with 3 items; etc. - - -
- Algorithm Apriori
Overcoming practical limits imposed by thousand of items
- Avoid combinatorial explosions
Association Rules
Rules that find associations in data
Example of a association rule is (x1, x2, x3} x4 meaning that if x1,
x2, and x3 are purchased x4 is also purchased
Association rules have confidence values
- Strong rules are rules with confidence value above a threshold
Challenge is to improve the apriority algorithms
- E.g., Partition-based approach, sampling
Frequent-pattern growth method
- First a databases projection is made of frequent items and then
mine main memory data and build a data structure called a
frequent pattern tree
Multidimensional association rule mining
- Mining multidimensional transactional database
Disassociation Rules
Similar to association, but a negative connector
- E.g., if A and not B, then C
- Whenever A is present and B is not present, C is present
Follow a similar process but introduce new set of items that are
inverses of the original items
Because of the extra items, performance problems
Much of the work is focusing on association rules and not on
disassociation rules
Sequential Time Series Analysis
While market basket analysis focuses on associations that happen
at one time, time series analysis focuses on associations over a
period
- E.g. After John goes to bank he goes to church
Transaction must have additional information such as time stamp or
sequencing information and identifying information that identify
different transactions belonging to the same customer
Market basket analysis techniques have been extended to include
time series analysis
Challenges and Directions
Performance improvements
Applying techniques for web mining including web content mining,
web structure mining and web usage mining
- Hits and Logsom algorithms
- Mining path-traversal algorithm
Finding associations in text
- Associations between words in a document or multiple
documents
- Text refining and transformations
Strengths/Weaknesses
Strengths
- Understanding of results
- Supports data mining without knowing exactly what you want to
find
- Works on variable length data
- Simple computations
Weaknesses
- Computationally expensive
- Determining the right data is not difficult
Decision Trees
What is a Decision Tree?
Example
Some Requirements
Pruning Trees
Evaluating Trees
C4.5
Variations of Decision Tree methods
Strengths/Weaknesses
What is a Decision Tree?
It is used for classification
Tree is drawn with root at the top and leaves at the bottom
Record enters the root, test is applied which child path to take
- Depends on the test that is applied
- Choose test that vest discriminates among the target classes
All records that end up at a given leaf are classified the same way
Essentially decision trees produce classifiers
Example
Religion = X?
Yes
No
20 < Age < 50?
Yes
Class A
No
Class B
Class A: Highly suspicious
Class B: Suspicious
Class C: Mildly Suspicious
Class D: Non Suspicious
Country = Y?
Yes
Class C
No
Class D
Example (Concluded)
Classification
- Example 1:
Religion = X
Age = 25
Country = Y
Class = Highly suspicious
- Example 2
Religion = Z (not equal to X)
Age = 28
Country = P (not equal to Y)
Class = Non suspicious
Some Requirements
Some requirements have to be met for decision trees to work; they
are listed below
Attribute-value description
Pre-defined classes
Discrete classes
Sufficient data
Logical classification methods
Pruning Trees
Essentially discard one or more sub trees and replace them with
leaves
There are various ways to prune trees
- Examples include:
Decide not to divide a set of samples further under some
condition using some statistical test
Remove some of the tree (after tree is constructed) using
selected accuracy criteria; called post pruning
Well known C4.5 algorithms uses post pruning
Identify candidate sub trees and then determine how to
prune
Evaluating Sub-Trees
Candidate sub tress are selected for pruning
Next determine the sub trees that will best classify the data
- Apply error rate in selecting the best sub tree
The best sub tree has to be at the right level; pruned enough but not
over-pruned
The best sub tree may not work in all cases; may need alternate sub
trees
May also need to take costs into account; for example what is the
cost of misclassifying data
C4.5
C4.5 is Ross Quinlan’s technique based on decision trees
It is a refinement of the 1986 technique called ID3
C4.5 is used in products such as Clementine
Uses post pruning approach, but uses a specific technique to
estimate predicted error rate
- Pessimistic pruning
The algorithms generates decision rules
- E.g. If Attribute 1 = A and Attribute 2 = B, then
Classification = Class C
Variation of Decision Tree method: Multiple
variables
Most common decision tree methods are C4.5, CART and CHAID
- All three algorithms use test on a single variable to decide the
split
In some cases we may need to use more than one variable for the
test
- E.g. age and gender to make decisions
Algorithms have been developed to use multiple attributes
Strengths/Weaknesses
Strengths
- Understandable rules
- Classification without extensive computation
- Indicate which fields are important for classification
Weakness
- Error-prone with many classes
- Expensive to train
Neural Network
What is a neural network?
Example
Basic process
Types of Neural Networks
Back Propagation Learning
Key Issues in Learning
Neural networks for time series
Strengths/Weaknesses
What is a Neural Network?
Neural networks consist of basic units modeled on biological
neurons
Each unit has inputs that it combines into a single output
Main questions
- What are units and how do they behave?
- How are they connected together
Neural networks are trained using examples and then they are run
on data sets
Based on the training they receive they carry out activities
Used both for directed and undirected data mining
Example
Living space
Size of garage
Age of house
Appraised
value
Other
Example (Continued)
Features describing a house
Feature
# apts
Yr built
Plumb. fix
Heating type
Base gar.
Att. Gar
Living
Deck
Porch
Recroom
Basement
Description
# dwelling units
Year built
# plumbing fixtures
Heating system type
Basement garage
attached garage
Total living space
Deck space
Porch space
Recreation space
Basement space
Values
1-3
1900 – 2000
5-17
coded as A or B
0-2
0-228
714-4185
0-738
0-452
0-672
9-810
Example (Continued)
Training set example
Feature
Sales Price
Months ago
Num apts
Yr built
Plum. Fix
Heat Type
Base gar
Att. Gar
Living
Deck
Porch
Recroom
Basement
Value
171,000
4
1
1923
9
A
0
120
1614
0
210
0
175
Example (Continued)
Massaged Training set example
Feature
Sales Price
Months ago
Num apts
Yr built
Plum. Fix
Heat Type
Base gar
Att. Gar
Living
Deck
Porch
Recroom
Basement
Original Value
171,000
4
1
1923
9
A
0
120
1614
0
210
0
175
Range of Values
103,000-250,000
0-23
1-3
1850-1986
5-17
A or B
0-2
0-228
714-4185
0-738
0-452
0-672
0-810
Massaged Value
0.4626
0.1739
0.0000
0.5328
0.3333
1.0000
0.0000
0.5263
0.2593
0.0000
0.4646
0.0000
0.2160
Example (Concluded)
Computing the Appraised Value
Example: Sales price
- 103,000 is 0.0000 and 250,000 is 1.0000
- Therefore, 171,000 is 0.4626
Using some computation, neural network computes the appraised
value to be: $213,250
- The output of the neural network is between 0 and 1. To get the
appraised value of $213,250, the output will be 0.75 (the
massaged output)
After the neural network is trained with examples, it will determine
the formula then will compute the appraised value
Basic Process
Inputs are given to the network
A combination function combines the inputs into a single value
usually a weighted summation
A transfer function calculates the output value from the results of the
combination function
The results is exactly one output between 0 and 1
Types of Neural Network
Feed forward neural network
- Processing propagates from inputs to output without any loops
- In a layered representation of feed forward network, there are no
links in nodes within a layer
- Hidden layers in-between input and output layers
Recurrent neural network
- There is feedback link that forms a circular path in the network
Most widely used network is feed forward network with a back
propagation learning algorithm
Back Propagation Learning
Training the network is setting the best weights on the inputs of
each units
- Use weights where output is close to desired output
Consists of the following steps
- Network gets training example and using the existing weights in
the network it calculates the output or outputs of the example
- Back propagation calculates error by taking the calculated
results and actual result
- Error is fed back through the network and the weights are
adjusted t minimize error
Key Issues
Choosing appropriate training sets
Training set needs to consider full range of values
Number of inputs and number of outputs – need good coverage
Need computational power
Preparing the input data
Features with continues and discretely values as well as categorical
values
Need to interpret the results
Neural Networks for Time Series Data
Data has time series
- E.g., stock market values fluctuate
- Need data mining techniques to predict the next value in the
time series
Neural networks have been adapted to handle time series data
- Network is trained using time series starting
with oldest point,
second oldest point etc.
- Feed forward back propagation network
May take multiple time series inputs
- E.g., Euro, Dollar etc.
Heuristics may be used for feed forward back propagation networks
Strengths/Weaknesses
Strengths
- Wide range of problems
- Good results in complicated domains
- Handle discrete, continuous, categorical values
- Many packages
Weaknesses
- Require inputs in the range 0 to 1
- Cannot explain results
- May product solution prematurely
Link Analysis
What is Link Analysis?
Example
Graph Theoretic Techniques
Some problems in graph theory
Examples
Link analysis and Visualization
Strengths/Weaknesses
What is Link Analysis
Link analysis based on graph theory
Makes links between events, activities, places, objects
- E.g., linking physicians to pharmaceutical companies
Association rules also find relationships, however using graph
theoretic techniques one can find links not often found through
other techniques
Link analysis is not mature, few products
There is much interest in using link analysis for counter-terrorism
- Connecting the dots
Example
Example travel database given in Unite #10
Weighted graph where the edge weights are the number of trips
Containing the countries represented by the nodes at either end
England
1
2
1
1
France
1
Austria
1
Germany
1
Switzerland
Example (Continued)
Five calls link seven phone numbers
703-3658
781-5166
41 sec
23 sec
703-4271
703-3068
01 sec
42 sec
703-3108
1 min 22 sec
781-6595
202-1212
Example (Continued)
Call Graph for 15 numbers and 19 calls
6
34
50
169
4
117
22
67
342
44
35
20
2
35
133
44
61
3
70
Example (Continued)
Initial Call Graph: Some nodes labeled, short calls removed
U
34
U
50
F
117
F
U
35
U
U
342
U
22
44
67
169
I
44
20
35
U
U
133
61
U
F
U
70
U
F: Fax
I: Information
U: Unknown
Example (Concluded)
Assigning Labels to Nodes
S
34
U
50
F
117
F
F
35
V
F
342
U
22
44
67
169
I
44
20
35
V
U
133
61
F
F
U
70
F
Those connected initially to Fax are
Fax F
Those connected to Information are
Voice V
Those connected to both are Shared S
Rest Unknown U
Graph Theoretic Techniques
Graphs have nodes and edges
Earliest problem in graph theory is seven bridges of Konigsberg
Various types of graphs include directed graphs and undirected
graphs
Directed graphs edges are one directional; undirected graphs they
are both directional
Some Problems in Graph Theory
Detecting cycles in graphs
- Whether a directed graph has cycles
- If no cycles, it is an acyclic graph
Pattern matching problems
- Find recurrent occurrences of subgraphs in a graph
Graph Reduction
- Reduce graphs and eliminate unnecessary nodes
Link Analysis / Graph Reduction
E
A
F
B
A
D
C
Reduced
Graph
Original Graph
F
D
Connecting the Dots
Either a data miner auto matica lly connects the nodes ;
or a federated ad minis trator co mbines the graphs carries out
mining and connects the nodes; connections are s howed by
darkened lines
Link Analysis and Visualization
Link analysis is based on graphs and graphs can be visualized
With visualization one can detect patterns not usually detected in
tables or other kinds of data
Visualization can be used on data and the link analysis may be
performed on the concepts
Works in both directions:
- Visualization for link analysis
- Link analysis on visual data
Strengths/Weaknesses
Strengths
- Takes advantage of relationships
- Useful for visualization
Weaknesses
- Not applicable to many types of data
- Few tools
- May be difficult to implement
Cluster Detection
What is cluster detection?
Example
Similarity Measures
Agglomeration Clustering
Partitional Clustering
Incremental Clustering
Evaluating Clusters
Strengths/Weaknesses
What is Cluster Detection?
Clustering explores the interrelationships between samples
Examples: Customers in Cluster 1 purchased cars costing more than
50K and Customers in Cluster 2 purchased cars costing more than
100K
Various representations
- Represent as cluster of points in an n-dimensional space
- Represent graphically using nodes in a clustering tree
- Using some logical expression on sample attributes
Usually samples of clustering are represented as a vector of
measurements
- Point in multidimensional space
Example
Initial seeds determine initial cluster boundaries
seed3
X2
seed2
seed1
X1
Example (Continued)
Calculating the Centroids of New Clusters
X
X2
X
X
X1
Example (Concluded)
At Each Iteration, All Cluster Assignments are Reevaluated
X
X2
X
X
X1
Similarity Measures
We need a measure of similarity between two patterns drawn from
the same feature space
This measure must be chosen carefully because the quality of
clustering depends on this measure
One usually computes dissimilarity between two samples using a
distance measure defined on the feature space
Distance measure may be metric or quasi metric and used to
quantify the dissimilarity of samples
Hierarchical Clustering
In hierarchical cluster analysis, the number of clusters is not
specified as part of the inputs
- That is, inputs is (X, s) where X is a set of samples and s is a
measure of similarity
Output is hierarchy of clusters
Usually hierarchical clustering is not based on optimization, but
funding approximate sub optimal solutions
Two types of algorithms
- Divisible: Starts with entire set of samples and divides into
subsets and divides each subset further
- Agglomerative: Regards each object as an initial clusters;
Clusters are merged into a coarser partition, continues until
trivial partition is obtained
Partitional Clustering
Partitional clustering algorithm obtains a single partition of the data
instead of the clustering structure (e.g., dendogram), produced by
hierarchical technique
They have advantages for large data sets as constructing
dendograms may be computationally complex
Partitioning techniques produce clusters by optimizing a criterion
function defined either locally (subset of samples) or globally
(defines over all samples)
Incremental Clustering
The other types of clustering work on main memory
There are applications where not all of the data can reside in main
memory
Need incremental clustering techniques to solve this problem
Basic process
- Data set in secondary memory and subsets of data clustered
independently, followed by a merge of the clusters (divide and
conquer)
- Data is transferred to main memory one at a time for clustering
and only clusters representations are stored in main memory
- Parallel implementation may be used
Evaluating Clusters
We need clusters whose members have a high degree of similarity
How do we determine whether a cluster is good – that is members
are similar
- Standard measure for within cluster similarity is variance; sum
of the squared differences of each element from the mean
Need solutions that generate clusters with lowest variance
- Another measure: Take whatever similarity measure or distance
metric used to form clusters and compare the average distance
within clusters to the average distance between clusters
Strengths/Weaknesses
Strengths
- Undirected knowledge discovery
- Works well with numeric, categorical and textual data
Weaknesses
- Difficult to choose right distance measures and weights
- May be hard to interpret results
Memory-Based Reasoning
What is Memory-based Reasoning (MBR)?
Example
Applying MBR
Distance Function
Combination Function
Challenges and Directions
Strengths/Weaknesses
What is Memory-Based Reasoning?
Identifying similar cases based on experience and then applying the
information from these cases to the problem at hand
Two functions: Distance function assigns a distance between any
two records; combination function combines the results from the
neighbors to arrive at an answer
Used for
- Fraud detection: news cases of fraud are likely to be similar to
known cases
- Medical treatments: Most effective treatment is the one that
resulted from the best outcomes for similar patients
- Customer response prediction: Next customers are likely to
respond to offers similar to the previous customers who have
responded
Example
Scatter Graph of Events attended by Age and Source
4
D
D
C
Events:
A
B
C
D
Source
3
A
2
1
A
AA
A
D
B
ABBB
25
30
A
C
D
A A
35
40
C
45
50
Age
Example (Concluded)
To predict last event attended by 3 unknowns find where they land on the
scatter graph and make a prediction based on nearest neighbor
4
D
D
C
Events:
A
B
C
D
Source
3
A
2
Nearest
Neighbors
1
A
AA
A
D
B
ABBB
25
30
A
C
35
D
A A
40
C
45
50
Age
Applying MBR
Basic Idea
- The idea is to use training set and for every value in the set
compute a point in space
- When a new entry arrives, compute the point in space for the
new entry
- The new entry will have properties similar to its neighbors
Steps
- Choosing the Training Set
- Determining the Distance Function
- Choosing the number of nearest neighbors
- Determining the combination function
Distance Function
Distance is the way MBR measures similarity
Distance from A to B has four key properties
- Well defined. The distance between 2 points is a non negative
real number
- Identity: Distance from one point to itself is zero
- Commutative: Distance from A to B is the same as distance from
B to A
- Triangle inequality: Visiting an intermediate point C from A to B
never shortens the distance
For MBR the points are records in a database; basis for measuring
similarity
- Every record has a neighbor somewhere in the database
- Build a distance function one field at a time
- Merge the distance functions for different fields
Combination Function
Sometimes the nearest neighbor method may not work
Example:
- A may be close to B; but A may be a millionaire while B may be a
pauper
- Tornado hits a region, but sun shines 5 miles away
Solution: Combination function
- K nearest neighbors determine the answer
- Compute distance function to each neighbor and select K
nearest neighbors
- Use some combination function (e.g., majority vote) and
determine value for the new entry
- E.g., Is the person likely to leave the company
Strengths/Weaknesses
Strengths
- Results are easily understood
- Application to arbitrary data types; even nonrelational data
- Works well with any number of fields
- Minimum effort to maintain a training set
Weaknesses
- Computationally intensive
- Storage for training set
- Dependence on distance function and combination function
- http://www.ibia.org/membersadmin/whitepapers/pdf/17/En
suring%20integrity%20with%20fingerprint%20verification.
pdf
Smart Card and Biometrics: Useful reference
http://www.ibia.org/membersadmin/whitepapers/pdf/17/Ensu
ring%20integrity%20with%20fingerprint%20verification.pd
f