Example: Data Mining for the NBA - The University of Texas at Dallas

Download Report

Transcript Example: Data Mining for the NBA - The University of Texas at Dallas

Introduction to Biometrics
Dr. Bhavani Thuraisingham
The University of Texas at Dallas
Information Relevant to the Project
September 21, 2005
Outline of the Unit
 Project Information
 Data Mining Techniques (may be useful for the project)
Project Information
 In this assignment, you will experiment on the following two
tasks
- Face recognition.
- Pose recognition.
 For this experiment you will use neural network package
given in the code directory in location
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
 For training and testing, you will use the face images that can
be found in the trainset directory in
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
 (OPTIONAL) You will also use k-nearest neighborhood using
the same dataset and compare its performance with neural
network.
-
Project Information (Continued)
 You don’t need to do significant amounts of coding for the
main part of the project.
but for the OPTIONAL k-nearest neighbor we need some
sort of coding, you will use k-nearest neighbor instead of
back propagation in the code.
 For training your datasets for part 1, it will take time. It is
recommended that you read the materials in the following
location thoroughly and start it early.
http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
-
-
Project Information (Continued)
 Face images
 The image data can be found in the faces directory in
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
This directory contains 20 subdirectories, one for each
person who volunteered for the photo shoot, named by
userid. Each of these directories contains several
versions of the face images.
 For images the following naming convention is used:
<userid>_<pose>_< expression>_<eyes>_<scale>.pgm
 For further details see the assignment document, section 2 in
the following link
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
-
Project Information (Continued)
 How to view the face images
 You will need View.java and View.class to view the images.
These two files will be placed on my web site by tomorrow
 To view the images, you can use the java program View.java.
View.java handles a variety of image formats, including the
PGM format in which the face images are stored. While we
won't go into detail about View in this document, we will
quickly describe the basics you need to know to use View.
 To start View, just specify on the command line:
java View ImageInput
- Here “ImageInput“ corresponds to the image you want to
view
-
Project Information (Continued)
 The neural network and image access code
 You can have C code for a three layer fully connected feed
forward neural network which uses the back propagation
algorithm to tune its weights. You can also have the top level
program for training and recognition, as a skeleton for you to
modify.
 The code is located in the code directory in
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
 For further details see in the assignment document, section 4
in the following link
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
Project Information (Continued)
 Assignment
 Copy straight_train.list, straight_test1.list, straight_test2.list,
all_train.list, all_test1.list, all_test2.list in your home directory
to obtain the training and test set data for this assignment
from the following trainset link in
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
- Train with straight_train.list and test with
straight_test1.list, straight_test2.list with the default
learning parameter settings (learning rate 0.3, momentum
0.3) for 75 epochs.
Project Information (Continued)
 For further assistance see in assignment document, section
5-3 in the following link
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
Report your train/test performance and error as a function
of epochs. If you had stopped training when the
performance on test1 leveled off, what would the
performance have been on test2? And vice versa?
 Implement a face recognizer; i.e. implement a neural net
which, when given an image as input, indicates who is in the
image. To do this, you will need to implement a different
output encoding. Describe your output encoding. (Hint: leave
learning rate and momentum at 0.3, and use 20 hidden units).
-
Project Information (Continued)
 Implement a pose recognizer; i.e. implement a neural net
which, when given an image as input, indicates whether the
person in the image is looking straight ahead, up, to the left,
or to the right. You will also need to implement a different
output encoding for this task. Describe your output encoding.
(Hint: leave learning rate and momentum at 0.3, and use 6
hidden units).
 Train with all_train.list and test with all_test1.list, all_test2.list
with the default learning parameter settings (learning rate 0.3,
momentum 0.3) for 100 epochs.
 For further assistance see in assignment document, section
5-8 in the following link
- http://www.cs.cmu.edu/afs/cs.cmu.edu/user/avrim/www/M
L94/face_homework.html
Project Information (Concluded)
 Report your train/test performance and error as a
function of epochs. If you had stopped training when
the performance on test1 leveled off, what would the
performance have been on test2? And vice versa?
 (OPTIONAL) Use k-nearest neighborhood to do all
the tasks above. Here you will determine the best k
value.
Market Basket Analysis
 What is Market Basket Analysis?
 Example
 Basic process
 Association rules
 Disassociation rules
 Sequential Tim-series analysis
 Challenges and Directions
 Strengths/Weaknesses
What is Market Basket Analysis?
 Market basket analysis is a collection of techniques that will
discover rules such as what items are purchased together
 It has roots in point of sale transactions; but has gone beyond this
applications
- E.g., who travels together, who is seen with whom, etc.
 Market basket analysis is used as a starting point when transactions
data is available and we are not sure of the patterns we are looking
for
- Find items that are purchased together
 Essentially market basket analysis produces association rules
Example
 Person
Countries Visited
 John
England, France
 James
Germany, England, Switzerland
 William
England, Austria
 Mary
England, Austria, France
 Jane
Switzerland, France
Co-Occurrence Table
England Switzerland Germany France Austria
England
4
1
1
2
2
Switzerland
1
2
1
1
0
Germany
1
1
1
0
0
France
2
1
0
3
1
Austria
2
0
0
1
2
Example (Concluded)
 England and France are more likely to be traveled together than any
other two countries
 Austria is never traveled together with Germany or Switzerland
 Germany is never traveled together with Austria or France
 Rule:
- If a person travels to France then he/she also travels to England
Support for this rule is 2 out of 5 and that is 40% since 2 trips
out of five support this rule
Confidence for this rule is 66% since two out of three trips
that contain France also contains England
That is, if France then England rule has support 40% and
confidence 66%
 Challenge: How to automatically generate the rules
Basic Process
 Choosing the right set of items
- Need to gather the right set of transaction data and the right
level of detail, ensuring data quality
 Generating rules from the data
- Generate co-occurrence matrix for single items
- Generate co-occurrence matrix with 2 items and use this to find
rules with 2 items
- Generate co-occurrence matrix with 3 items and use this to find
rules with 3 items; etc. - - -
- Algorithm Apriori
 Overcoming practical limits imposed by thousand of items
- Avoid combinatorial explosions
Association Rules
 Rules that find associations in data
 Example of a association rule is (x1, x2, x3}  x4 meaning that if x1,
x2, and x3 are purchased x4 is also purchased
 Association rules have confidence values
- Strong rules are rules with confidence value above a threshold
 Challenge is to improve the apriority algorithms
- E.g., Partition-based approach, sampling
 Frequent-pattern growth method
- First a databases projection is made of frequent items and then
mine main memory data and build a data structure called a
frequent pattern tree
 Multidimensional association rule mining
- Mining multidimensional transactional database
Disassociation Rules
 Similar to association, but a negative connector
- E.g., if A and not B, then C
- Whenever A is present and B is not present, C is present
 Follow a similar process but introduce new set of items that are
inverses of the original items
 Because of the extra items, performance problems
 Much of the work is focusing on association rules and not on
disassociation rules
Sequential Time Series Analysis
 While market basket analysis focuses on associations that happen
at one time, time series analysis focuses on associations over a
period
- E.g. After John goes to bank he goes to church
 Transaction must have additional information such as time stamp or
sequencing information and identifying information that identify
different transactions belonging to the same customer
 Market basket analysis techniques have been extended to include
time series analysis
Challenges and Directions
 Performance improvements
 Applying techniques for web mining including web content mining,
web structure mining and web usage mining
- Hits and Logsom algorithms
- Mining path-traversal algorithm
 Finding associations in text
- Associations between words in a document or multiple
documents
- Text refining and transformations
Strengths/Weaknesses
 Strengths
- Understanding of results
- Supports data mining without knowing exactly what you want to
find
- Works on variable length data
- Simple computations
 Weaknesses
- Computationally expensive
- Determining the right data is not difficult
Decision Trees
 What is a Decision Tree?
 Example
 Some Requirements
 Pruning Trees
 Evaluating Trees
 C4.5
 Variations of Decision Tree methods
 Strengths/Weaknesses
What is a Decision Tree?
 It is used for classification
 Tree is drawn with root at the top and leaves at the bottom
 Record enters the root, test is applied which child path to take
- Depends on the test that is applied
- Choose test that vest discriminates among the target classes
 All records that end up at a given leaf are classified the same way
 Essentially decision trees produce classifiers
Example
Religion = X?
Yes
No
20 < Age < 50?
Yes
Class A
No
Class B
Class A: Highly suspicious
Class B: Suspicious
Class C: Mildly Suspicious
Class D: Non Suspicious
Country = Y?
Yes
Class C
No
Class D
Example (Concluded)
 Classification
- Example 1:

Religion = X

Age = 25

Country = Y

Class = Highly suspicious
- Example 2

Religion = Z (not equal to X)

Age = 28

Country = P (not equal to Y)

Class = Non suspicious
Some Requirements
 Some requirements have to be met for decision trees to work; they
are listed below
 Attribute-value description
 Pre-defined classes
 Discrete classes
 Sufficient data
 Logical classification methods
Pruning Trees
 Essentially discard one or more sub trees and replace them with
leaves
 There are various ways to prune trees
- Examples include:

Decide not to divide a set of samples further under some
condition using some statistical test

Remove some of the tree (after tree is constructed) using
selected accuracy criteria; called post pruning

Well known C4.5 algorithms uses post pruning

Identify candidate sub trees and then determine how to
prune
Evaluating Sub-Trees
 Candidate sub tress are selected for pruning
 Next determine the sub trees that will best classify the data
- Apply error rate in selecting the best sub tree
 The best sub tree has to be at the right level; pruned enough but not
over-pruned
 The best sub tree may not work in all cases; may need alternate sub
trees
 May also need to take costs into account; for example what is the
cost of misclassifying data
C4.5
 C4.5 is Ross Quinlan’s technique based on decision trees
 It is a refinement of the 1986 technique called ID3
 C4.5 is used in products such as Clementine
 Uses post pruning approach, but uses a specific technique to
estimate predicted error rate
- Pessimistic pruning
 The algorithms generates decision rules
- E.g. If Attribute 1 = A and Attribute 2 = B, then
Classification = Class C
Variation of Decision Tree method: Multiple
variables
 Most common decision tree methods are C4.5, CART and CHAID
- All three algorithms use test on a single variable to decide the
split
 In some cases we may need to use more than one variable for the
test
- E.g. age and gender to make decisions
 Algorithms have been developed to use multiple attributes
Strengths/Weaknesses
 Strengths
- Understandable rules
- Classification without extensive computation
- Indicate which fields are important for classification
 Weakness
- Error-prone with many classes
- Expensive to train
Neural Network
 What is a neural network?
 Example
 Basic process
 Types of Neural Networks
 Back Propagation Learning
 Key Issues in Learning
 Neural networks for time series
 Strengths/Weaknesses
What is a Neural Network?
 Neural networks consist of basic units modeled on biological
neurons
 Each unit has inputs that it combines into a single output
 Main questions
- What are units and how do they behave?
- How are they connected together
 Neural networks are trained using examples and then they are run
on data sets
 Based on the training they receive they carry out activities
 Used both for directed and undirected data mining
Example
Living space
Size of garage
Age of house
Appraised
value
Other
Example (Continued)
Features describing a house
Feature
# apts
Yr built
Plumb. fix
Heating type
Base gar.
Att. Gar
Living
Deck
Porch
Recroom
Basement
Description
# dwelling units
Year built
# plumbing fixtures
Heating system type
Basement garage
attached garage
Total living space
Deck space
Porch space
Recreation space
Basement space
Values
1-3
1900 – 2000
5-17
coded as A or B
0-2
0-228
714-4185
0-738
0-452
0-672
9-810
Example (Continued)
Training set example
Feature
Sales Price
Months ago
Num apts
Yr built
Plum. Fix
Heat Type
Base gar
Att. Gar
Living
Deck
Porch
Recroom
Basement
Value
171,000
4
1
1923
9
A
0
120
1614
0
210
0
175
Example (Continued)
Massaged Training set example
Feature
Sales Price
Months ago
Num apts
Yr built
Plum. Fix
Heat Type
Base gar
Att. Gar
Living
Deck
Porch
Recroom
Basement
Original Value
171,000
4
1
1923
9
A
0
120
1614
0
210
0
175
Range of Values
103,000-250,000
0-23
1-3
1850-1986
5-17
A or B
0-2
0-228
714-4185
0-738
0-452
0-672
0-810
Massaged Value
0.4626
0.1739
0.0000
0.5328
0.3333
1.0000
0.0000
0.5263
0.2593
0.0000
0.4646
0.0000
0.2160
Example (Concluded)
 Computing the Appraised Value
 Example: Sales price
- 103,000 is 0.0000 and 250,000 is 1.0000
- Therefore, 171,000 is 0.4626
 Using some computation, neural network computes the appraised
value to be: $213,250
- The output of the neural network is between 0 and 1. To get the
appraised value of $213,250, the output will be 0.75 (the
massaged output)
 After the neural network is trained with examples, it will determine
the formula then will compute the appraised value
Basic Process
 Inputs are given to the network
 A combination function combines the inputs into a single value
usually a weighted summation
 A transfer function calculates the output value from the results of the
combination function
 The results is exactly one output between 0 and 1
Types of Neural Network
 Feed forward neural network
- Processing propagates from inputs to output without any loops
- In a layered representation of feed forward network, there are no
links in nodes within a layer
- Hidden layers in-between input and output layers
 Recurrent neural network
- There is feedback link that forms a circular path in the network
 Most widely used network is feed forward network with a back
propagation learning algorithm
Back Propagation Learning
 Training the network is setting the best weights on the inputs of
each units
- Use weights where output is close to desired output
 Consists of the following steps
- Network gets training example and using the existing weights in
the network it calculates the output or outputs of the example
- Back propagation calculates error by taking the calculated
results and actual result
- Error is fed back through the network and the weights are
adjusted t minimize error
Key Issues
 Choosing appropriate training sets
 Training set needs to consider full range of values
 Number of inputs and number of outputs – need good coverage
 Need computational power
 Preparing the input data
 Features with continues and discretely values as well as categorical
values
 Need to interpret the results
Neural Networks for Time Series Data
 Data has time series
- E.g., stock market values fluctuate
- Need data mining techniques to predict the next value in the
time series
 Neural networks have been adapted to handle time series data
- Network is trained using time series starting
with oldest point,
second oldest point etc.
- Feed forward back propagation network
 May take multiple time series inputs
- E.g., Euro, Dollar etc.
 Heuristics may be used for feed forward back propagation networks
Strengths/Weaknesses
 Strengths
- Wide range of problems
- Good results in complicated domains
- Handle discrete, continuous, categorical values
- Many packages
 Weaknesses
- Require inputs in the range 0 to 1
- Cannot explain results
- May product solution prematurely
Link Analysis
 What is Link Analysis?
 Example
 Graph Theoretic Techniques
 Some problems in graph theory
 Examples
 Link analysis and Visualization
 Strengths/Weaknesses
What is Link Analysis
 Link analysis based on graph theory
 Makes links between events, activities, places, objects
- E.g., linking physicians to pharmaceutical companies
 Association rules also find relationships, however using graph
theoretic techniques one can find links not often found through
other techniques
 Link analysis is not mature, few products
 There is much interest in using link analysis for counter-terrorism
- Connecting the dots
Example
Example travel database given in Unite #10
Weighted graph where the edge weights are the number of trips
Containing the countries represented by the nodes at either end
England
1
2
1
1
France
1
Austria
1
Germany
1
Switzerland
Example (Continued)
Five calls link seven phone numbers
703-3658
781-5166
41 sec
23 sec
703-4271
703-3068
01 sec
42 sec
703-3108
1 min 22 sec
781-6595
202-1212
Example (Continued)
Call Graph for 15 numbers and 19 calls
6
34
50
169
4
117
22
67
342
44
35
20
2
35
133
44
61
3
70
Example (Continued)
Initial Call Graph: Some nodes labeled, short calls removed
U
34
U
50
F
117
F
U
35
U
U
342
U
22
44
67
169
I
44
20
35
U
U
133
61
U
F
U
70
U
F: Fax
I: Information
U: Unknown
Example (Concluded)
Assigning Labels to Nodes
S
34
U
50
F
117
F
F
35
V
F
342
U
22
44
67
169
I
44
20
35
V
U
133
61
F
F
U
70
F
Those connected initially to Fax are
Fax F
Those connected to Information are
Voice V
Those connected to both are Shared S
Rest Unknown U
Graph Theoretic Techniques
 Graphs have nodes and edges
 Earliest problem in graph theory is seven bridges of Konigsberg
 Various types of graphs include directed graphs and undirected
graphs
 Directed graphs edges are one directional; undirected graphs they
are both directional
Some Problems in Graph Theory
 Detecting cycles in graphs
- Whether a directed graph has cycles
- If no cycles, it is an acyclic graph
 Pattern matching problems
- Find recurrent occurrences of subgraphs in a graph
 Graph Reduction
- Reduce graphs and eliminate unnecessary nodes
Link Analysis / Graph Reduction
E
A
F
B
A
D
C
Reduced
Graph
Original Graph
F
D
Connecting the Dots
Either a data miner auto matica lly connects the nodes ;
or a federated ad minis trator co mbines the graphs carries out
mining and connects the nodes; connections are s howed by
darkened lines
Link Analysis and Visualization
 Link analysis is based on graphs and graphs can be visualized
 With visualization one can detect patterns not usually detected in
tables or other kinds of data
 Visualization can be used on data and the link analysis may be
performed on the concepts
 Works in both directions:
- Visualization for link analysis
- Link analysis on visual data
Strengths/Weaknesses
 Strengths
- Takes advantage of relationships
- Useful for visualization
 Weaknesses
- Not applicable to many types of data
- Few tools
- May be difficult to implement
Cluster Detection
 What is cluster detection?
 Example
 Similarity Measures
 Agglomeration Clustering
 Partitional Clustering
 Incremental Clustering
 Evaluating Clusters
 Strengths/Weaknesses
What is Cluster Detection?
 Clustering explores the interrelationships between samples
 Examples: Customers in Cluster 1 purchased cars costing more than
50K and Customers in Cluster 2 purchased cars costing more than
100K
 Various representations
- Represent as cluster of points in an n-dimensional space
- Represent graphically using nodes in a clustering tree
- Using some logical expression on sample attributes
 Usually samples of clustering are represented as a vector of
measurements
- Point in multidimensional space
Example
Initial seeds determine initial cluster boundaries
seed3
X2
seed2
seed1
X1
Example (Continued)
Calculating the Centroids of New Clusters
X
X2
X
X
X1
Example (Concluded)
At Each Iteration, All Cluster Assignments are Reevaluated
X
X2
X
X
X1
Similarity Measures
 We need a measure of similarity between two patterns drawn from
the same feature space
 This measure must be chosen carefully because the quality of
clustering depends on this measure
 One usually computes dissimilarity between two samples using a
distance measure defined on the feature space
 Distance measure may be metric or quasi metric and used to
quantify the dissimilarity of samples
Hierarchical Clustering
 In hierarchical cluster analysis, the number of clusters is not
specified as part of the inputs
- That is, inputs is (X, s) where X is a set of samples and s is a
measure of similarity
 Output is hierarchy of clusters
 Usually hierarchical clustering is not based on optimization, but
funding approximate sub optimal solutions
 Two types of algorithms
- Divisible: Starts with entire set of samples and divides into
subsets and divides each subset further
- Agglomerative: Regards each object as an initial clusters;
Clusters are merged into a coarser partition, continues until
trivial partition is obtained
Partitional Clustering
 Partitional clustering algorithm obtains a single partition of the data
instead of the clustering structure (e.g., dendogram), produced by
hierarchical technique
 They have advantages for large data sets as constructing
dendograms may be computationally complex
 Partitioning techniques produce clusters by optimizing a criterion
function defined either locally (subset of samples) or globally
(defines over all samples)
Incremental Clustering
 The other types of clustering work on main memory
 There are applications where not all of the data can reside in main
memory
 Need incremental clustering techniques to solve this problem
 Basic process
- Data set in secondary memory and subsets of data clustered
independently, followed by a merge of the clusters (divide and
conquer)
- Data is transferred to main memory one at a time for clustering
and only clusters representations are stored in main memory
- Parallel implementation may be used
Evaluating Clusters
 We need clusters whose members have a high degree of similarity
 How do we determine whether a cluster is good – that is members
are similar
- Standard measure for within cluster similarity is variance; sum
of the squared differences of each element from the mean

Need solutions that generate clusters with lowest variance
- Another measure: Take whatever similarity measure or distance
metric used to form clusters and compare the average distance
within clusters to the average distance between clusters
Strengths/Weaknesses
 Strengths
- Undirected knowledge discovery
- Works well with numeric, categorical and textual data
 Weaknesses
- Difficult to choose right distance measures and weights
- May be hard to interpret results
Memory-Based Reasoning
 What is Memory-based Reasoning (MBR)?
 Example
 Applying MBR
 Distance Function
 Combination Function
 Challenges and Directions
 Strengths/Weaknesses
What is Memory-Based Reasoning?
 Identifying similar cases based on experience and then applying the
information from these cases to the problem at hand
 Two functions: Distance function assigns a distance between any
two records; combination function combines the results from the
neighbors to arrive at an answer
 Used for
- Fraud detection: news cases of fraud are likely to be similar to
known cases
- Medical treatments: Most effective treatment is the one that
resulted from the best outcomes for similar patients
- Customer response prediction: Next customers are likely to
respond to offers similar to the previous customers who have
responded
Example
Scatter Graph of Events attended by Age and Source
4
D
D
C
Events:
A
B
C
D
Source
3
A
2
1
A
AA
A
D
B
ABBB
25
30
A
C
D
A A
35
40
C
45
50
Age
Example (Concluded)
To predict last event attended by 3 unknowns find where they land on the
scatter graph and make a prediction based on nearest neighbor
4
D
D
C
Events:
A
B
C
D
Source
3
A
2
Nearest
Neighbors
1
A
AA
A
D
B
ABBB
25
30
A
C
35
D
A A
40
C
45
50
Age
Applying MBR
 Basic Idea
- The idea is to use training set and for every value in the set
compute a point in space
- When a new entry arrives, compute the point in space for the
new entry
- The new entry will have properties similar to its neighbors
 Steps
- Choosing the Training Set
- Determining the Distance Function
- Choosing the number of nearest neighbors
- Determining the combination function
Distance Function
 Distance is the way MBR measures similarity
 Distance from A to B has four key properties
- Well defined. The distance between 2 points is a non negative
real number
- Identity: Distance from one point to itself is zero
- Commutative: Distance from A to B is the same as distance from
B to A
- Triangle inequality: Visiting an intermediate point C from A to B
never shortens the distance
 For MBR the points are records in a database; basis for measuring
similarity
- Every record has a neighbor somewhere in the database
- Build a distance function one field at a time
- Merge the distance functions for different fields
Combination Function
 Sometimes the nearest neighbor method may not work
 Example:
- A may be close to B; but A may be a millionaire while B may be a
pauper
- Tornado hits a region, but sun shines 5 miles away
 Solution: Combination function
- K nearest neighbors determine the answer
- Compute distance function to each neighbor and select K
nearest neighbors
- Use some combination function (e.g., majority vote) and
determine value for the new entry
- E.g., Is the person likely to leave the company
Strengths/Weaknesses
 Strengths
- Results are easily understood
- Application to arbitrary data types; even nonrelational data
- Works well with any number of fields
- Minimum effort to maintain a training set
 Weaknesses
- Computationally intensive
- Storage for training set
- Dependence on distance function and combination function
- http://www.ibia.org/membersadmin/whitepapers/pdf/17/En
suring%20integrity%20with%20fingerprint%20verification.
pdf
Smart Card and Biometrics: Useful reference
http://www.ibia.org/membersadmin/whitepapers/pdf/17/Ensu
ring%20integrity%20with%20fingerprint%20verification.pd
f