Data Miing and Knowledge Discvoery - Web

Download Report

Transcript Data Miing and Knowledge Discvoery - Web

The Knowledge Discovery Process;
Data Preparation & Preprocessing
Bamshad Mobasher
DePaul University
The Knowledge Discovery Process
- The KDD Process
2
Steps in the KDD Process
 Learning the application domain
 Translate the business problem into a data mining problem
 Gathering and integrating of data
 Cleaning and preprocessing data
 may be the most resource intensive part
 Reducing and selecting data
 find useful features, dimensionality reduction, etc.
 Choosing functions of data mining
 summarization, classification, regression, association, clustering, etc.




Choosing the mining algorithm(s)
Data mining: discover patterns of interest; construct models
Evaluate patterns and models
Interpretation: analysis of results
 visualization, alteration, removing redundant patterns, querying
 Use of discovered knowledge
3
Data Mining: What Kind of Data?
Structured Databases
relational, object-relational, etc.
can use SQL to perform parts of the process
e.g., SELECT count(*) FROM Items WHERE
type=video GROUP BY category
4
Data Mining: What Kind of Data?
 Flat Files
 most common data source
 can be text (or HTML) or binary
 may contain transactions, statistical data, measurements, etc.
 Transactional databases
 set of records each with a transaction id, time stamp, and a set of items
 may have an associated “description” file for the items
 typical source of data used in market basket analysis
5
Data Mining: What Kind of Data?
 Other Types of Databases
 legacy databases
 multimedia databases (usually very high-dimensional)
 spatial databases (containing geographical information, such as maps, or
satellite imaging data, etc.)
 Time Series Temporal Data (time dependent information such as stock market
data; usually very dynamic)
 World Wide Web
 basically a large, heterogeneous, distributed database
 need for new or additional tools and techniques
information retrieval, filtering and extraction
agents to assist in browsing and filtering
Web content, usage, and structure (linkage) mining tools
6
Data Mining: What Kind of Data?
 Data Warehouses
 a data warehouse is a repository of data collected from multiple data sources
(often heterogeneous) so that analysis can be done under the same unified schema
 data from the different sources are loaded, cleaned, transformed and integrated
 allows for interactive analysis and decision-making
 usually modeled by a multi-dimensional data structure (data cube) to facilitate
decision-making
 aggregate values can be pre-computed and stored along many dimensions
 each dimension of the data cube contains a hierarchy of values for one attribute
 data cubes are well suited for fast interactive querying and analysis of data at
different conceptual levels, known as On-Line Analytical Processing (OLAP)
 OLAP operations allow the navigation of data at different levels of abstraction,
such as drill-down, roll-up, slice, dice, etc.
7
DM Tasks: Classification
 Classifying observations/instances into different “given” classes
(i.e., classification is “supervised”)
 example, classifying credit applicants as low, medium, or high risk
 normally use a training set where all objects are already associated with known
class labels
 classification algorithm learns from the training set and builds a model
 the model is used to classify new objects
 Suitable data mining tools
 Decision Tree algorithms (CHAD, C 5.0, and CART)
 Memory-based reasoning
 Neural Network
 Example (Hypothetical Video Store)
 build a model of users based their rental history (returning on-time, payments, etc.)
 observe the current customer’s rental and payment history
 decide whether should charge a deposit to current customer
8
DM Tasks: Prediction
 Same as classification
 except classify according to some predicted or estimated future value
 In prediction, historical data is used to build a (predictive)
model that explains the current observed behavior
 Model can then be applied to new instances to predict future behavior or
forecast the future value of some missing attribute
 Examples
 predicting the size of balance transfer if the prospect accepts the offer
 predicting the load on a Web server in a particular time period
 Suitable data mining tools
 Association rule discovery
 Memory-based reasoning
 Decision Trees
 Neural Networks
9
DM Tasks: Affinity Grouping
 Determine what items often go together (usually in transactional
databases)
 Often Referred to as Market Basket Analysis
 used in retail for planning arrangement on shelves
 used for identifying cross-selling opportunities
 “should” be used to determine best link structure for a Web site
 Examples
 people who buy milk and beer also tend to buy diapers
 people who access pages A and B are likely to place an online order
 Suitable data mining tools
 association rule discovery
 clustering
 Nearest Neighbor analysis (memory-based reasoning)
10
DM Tasks: Clustering
 Like classification, clustering is the organization of data into
classes
 however, class labels are unknown and it is up to the clustering algorithm to
discover acceptable classes
 also called unsupervised classification, because the classification is not dictated
by given class labels
 There are many clustering approaches
 all based on the principle of maximizing the similarity between objects in a
same class (intra-class similarity) and minimizing the similarity between
objects of different classes (inter-class similarity)
 Examples:
 doing market segmentation of customers based on buying patterns and
demographic attributes
 grouping user transactions on a Web site based their navigational patterns
11
Characterization & Discrimination
 Data characterization is a summarization of general features of
objects in a target class
 The data relevant to a target class are retrieved by a database query and run
through a summarization module to extract the essence of the data at different
levels of abstraction
 example: characterize the video store’s customers who regularly rent more than
30 movies a year
 Data discrimination is used to compare of the general features of
objects between two classes
 comparison relevant features of objects between a target class and a
contrasting class
 example: compare the general characteristics of the customers who rented
more than 30 movies in the last year with those who rented less than 5
 The techniques used for data discrimination are very similar to the techniques
used for data characterization with the exception that data discrimination
results include comparative measures
12
Example: Moviegoer Database
13
Example: Moviegoer Database
SELECT moviegoers.name, moviegoers.sex, moviegoers.age,
sources.source, movies.name
FROM movies, sources, moviegoers
WHERE sources.source_ID = moviegoers.source_ID AND
movies.movie_ID = moviegoers.movie_ID
ORDER BY moviegoers.name;
moviegoers.name
sex
age
source
movies.name
Amy
Andrew
Andy
Anne
Ansje
Beth
Bob
Brian
Candy
Cara
Cathy
Charles
Curt
David
Erica
f
m
m
f
f
f
m
m
f
f
f
m
m
m
f
27
25
34
30
25
30
51
23
29
25
39
25
30
40
23
Oberlin
Oberlin
Oberlin
Oberlin
Oberlin
Oberlin
Pinewoods
Oberlin
Oberlin
Oberlin
Mt. Auburn
Oberlin
MRJ
MRJ
Mt. Auburn
Independence Day
12 Monkeys
The Birdcage
Trainspotting
I Shot Andy Warhol
Chain Reaction
Schindler's List
Super Cop
Eddie
Phenomenon
The Birdcage
Kingpin
T2 Judgment Day
Independence Day
Trainspotting
14
Example: Moviegoer Database
 Classification
 determine sex based on age, source, and movies seen
 determine source based on sex, age, and movies seen
 determine most recent movie based on past movies, age, sex, and source
 Prediction or Estimation
 for predict, need a continuous variable (e.g., “age”)
 predict age as a function of source, sex, and past movies
 if we had a “rating” field for each moviegoer, we could predict the rating a new
moviegoer gives to a movie based on age, sex, past movies, etc.
 Clustering
 find groupings of movies that are often seen by the same people
 find groupings of people that tend to see the same movies
 clustering might reveal relationships that are not necessarily recorded in the
data (e.g., we may find a cluster that is dominated by people with young
children; or a cluster of movies that correspond to a particular genre)
15
Example: Moviegoer Database
 Affinity Grouping
 market basket analysis (MBA): “which movies go together?”
 need to create “transactions” for each moviegoer containing movies seen by that
moviegoer:
name
TID
Transaction
Amy
Andrew
Andy
Anne
…
001
002
003
004
…
{Independence Day, Trainspotting}
{12 Monkeys, The Birdcage, Trainspotting, Phenomenon}
{Super Cop, Independence Day, Kingpin}
{Trainspotting, Schindler's List}
...
 may result in association rules such as:
{“Phenomenon”, “The Birdcage”} ==> {“Trainspotting”}
{“Trainspotting”, “The Birdcage”} ==> {sex = “f”}
 Sequence Analysis
 similar to MBA, but order in which items appear in the pattern is important
 e.g., people who rent “The Birdcage” during a visit tend to rent “Trainspotting” in the
next visit.
16
The Knowledge Discovery Process
- The KDD Process
17
Data Preprocessing
 Why do we need to prepare the data?
 In real world applications data can be inconsistent, incomplete and/or noisy
Data entry, data transmission, or data collection problems
Discrepancy in naming conventions
Duplicated records
Incomplete data
Contradictions in data
 What happens when the data can not be trusted?
 Can the decision be trusted? Decision making is jeopardized
 Better chance to discover useful knowledge when data is clean
18
Data Preprocessing
19
Data Preprocessing
Data Cleaning
Data Integration
-2,32,100,59,48
-0.02,0.32,1.00,0.59,0.48
Data Transformation
Data Reduction
20
Data Cleaning
 Real-world application data can be incomplete, noisy, and
inconsistent
 No recorded values for some attributes
 Not considered at time of entry
 Random errors
 Irrelevant records or fields
 Data cleaning attempts to:
 Fill in missing values
 Smooth out noisy data
 Correct inconsistencies
 Remove irrelevant data
21
Dealing with Missing Values
 Data is not always available (missing attribute values in records)
 equipment malfunction
 deleted due to inconsistency or misunderstanding
 not considered important at time of data gathering
 Solving Missing Data
 Ignore the record with missing values;
 Fill in the missing values manually;
 Use a global constant to fill in missing values (NULL, unknown, etc.);
 Use the attribute value mean to filling missing values of that attribute;
 Use the attribute mean for all samples belonging to the same class to fill in the
missing values;
 Infer the most probable value to fill in the missing value
may need to use methods such as Bayesian classification or decision trees
to automatically infer missing attribute values
22
Smoothing Noisy Data
 The purpose of data smoothing is to eliminate noise
Original Data for “price” (after sorting): 4, 8, 15, 21, 21, 24, 25, 28, 34
Binning
Each value in a
bin is replaced
by the mean
value of the bin.
Partition into equidepth bins
Bin1: 4, 8, 15
Bin2: 21, 21, 24
Bin3: 25, 28, 34
means
Bin1: 9, 9, 9
Bin2: 22, 22, 22
Bin3: 29, 29, 29
boundaries
Bin1: 4, 4, 15
Bin2: 21, 21, 24
Bin3: 25, 25, 34
Min and Max
values in each bin
are identified
(boundaries).
Each value in a
bin is replaced
with the closest
boundary value.
23
Smoothing Noisy Data
 Other Methods
Clustering
Regression
Similar values are organized into
groups (clusters). Values falling
outside of clusters may be considered
“outliers” and may be candidates for
elimination.
Fit data to a function. Linear
regression finds the best line to fit two
variables. Multiple regression can
handle multiple variables. The values
given by the function are used instead
of the original values.
24
Smoothing Noisy Data - Example
Want to smooth “Temperature” by bin means with bins of size 3:
1.
First sort the values of the attribute (keep track of the ID or key so
that the transformed values can be replaced in the original table.
Divide the data into bins of size 3 (or less in case of last bin).
Convert the values in each bin to the mean value for that bin
Put the resulting values into the original table
2.
3.
4.
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy
85
85
FALSE
80
90
TRUE
83
78
FALSE
70
96
FALSE
68
80
FALSE
65
70
TRUE
58
65
TRUE
72
95
FALSE
69
70
FALSE
71
80
FALSE
75
70
TRUE
73
90
TRUE
81
75
FALSE
75
80
TRUE
ID
7
6
5
9
4
10
8
12
11
14
2
13
3
1
Temperature
58
65
68
69
70
71
72
73
75
75
80
81
83
85
Bin1
Bin2
Bin3
Bin4
Bin5
25
Smoothing Noisy Data - Example
ID
7
6
5
9
4
10
8
12
11
14
2
13
3
1
Temperature
58
65
68
69
70
71
72
73
75
75
80
81
83
85
Bin1
Bin2
Bin3
Bin4
Bin5
ID
7
6
5
9
4
10
8
12
11
14
2
13
3
1
Temperature
64
64
64
70
70
70
73
73
73
79
79
79
84
84
Bin1
Bin2
Bin3
Bin4
Bin5
Value of every record in each bin is changed to the mean value for
that bin. If it is necessary to keep the value as an integer, then the
mean values are rounded to the nearest integer.
26
Smoothing Noisy Data - Example
The final table with the new values for the Temperature attribute.
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy
84
85
FALSE
79
90
TRUE
84
78
FALSE
70
96
FALSE
64
80
FALSE
64
70
TRUE
64
65
TRUE
73
95
FALSE
70
70
FALSE
70
80
FALSE
73
70
TRUE
73
90
TRUE
79
75
FALSE
79
80
TRUE
27
Data Integration
 Data analysis may require a combination of data from multiple
sources into a coherent data store
 Challenges in Data Integration:
 Schema integration: CID = C_number = Cust-id = cust#
 Semantic heterogeneity
 Data value conflicts (different representations or scales, etc.)
 Synchronization (especially important in Web usage mining)
 Redundant attributes (redundant if it can be derived from other attributes) -may be able to identify redundancies via correlation analysis:
Correlation analysis: Pr(A,B) / (Pr(A).Pr(B))
= 1: independent,
> 1: positive correlation,
< 1: negative correlation.
 Meta-data is often necessary for successful data integration
28
Normalization
 Min-max normalization: linear transformation from v to v’
 v’ = [(v - min)/(max - min)] x (newmax - newmin) + newmin
 Ex: transform $30000 between [10000..45000] into [0..1] ==>
[(30000 – 10000) / 35000] x (1 - 0) + 0 = 0.514
 z-score normalization: normalization of v into v’ based on
attribute value mean and standard deviation
 v’ = (v - Mean) / StandardDeviation
 Normalization by decimal scaling
 moves the decimal point of v by j positions such that j is the minimum number
of positions moved so that absolute maximum value falls in [0..1].
 v’ = v / 10j
 Ex: if v ranges between -56 and 9976, j=4 ==> v’ ranges between -0.0056 and
0.9976
29
Normalization: Example
 z-score normalization: v’ = (v - Mean) / Stdev
 Example: normalizing the “Humidity” attribute:
Humidity
85
90
78
96
80
70
65
95
70
80
70
90
75
80
Mean = 80.3
Stdev = 9.84
Humidity
0.48
0.99
-0.23
1.60
-0.03
-1.05
-1.55
1.49
-1.05
-0.03
-1.05
0.99
-0.54
-0.03
30
Normalization: Example II
 Min-Max normalization on an employee database
 max distance for salary: 100000-19000 = 81000
 max distance for age: 52-27 = 25
 New min for age and salary = 0; new max for age and salary = 1
ID
1
2
3
4
5
Gender
F
M
M
F
M
Age
27
51
52
33
45
Salary
19,000
64,000
100,000
55,000
45,000
ID
1
2
3
4
5
Gender
1
0
0
1
0
Age
0.00
0.96
1.00
0.24
0.72
Salary
0.00
0.56
1.00
0.44
0.32
31
Data Reduction
 Data is often too large; reducing data can improve performance
 Data reduction consists of reducing the representation of the data
set while producing the same (or almost the same) results
 Data reduction includes:
 Data cube aggregation
 Dimensionality reduction
 Discretization
 Numerosity reduction
Regression
Histograms
Clustering
Sampling
32
Data Cube Aggregation
 Reduce the data to the concept level needed in the analysis
 Use the smallest (most detailed) level necessary to solve the problem
 Queries regarding aggregated information should be answered
using data cube when possible
33
Dimensionality Reduction
 Feature selection (i.e., attribute subset selection):
 Select only the necessary attributes.
 The goal is to find a minimum set of attributes such that the resulting
probability distribution of data classes is as close as possible to the original
distribution obtained using all attributes.
 Exponential number of possibilities.
 Use heuristics: select local ‘best’ (or most pertinent) attribute,
e.g., using information gain, etc.
 step-wise forward selection {}{A1}{A1, A3}{A1, A3, A5}
 step-wise backward elimination {A1, A2, A3, A4, A5}{A1, A3, A4, A5}
{A1, A3, A5}
 combining forward selection and backward elimination
 decision-tree induction
34
Decision Tree Induction
35
Numerocity Reduction
 Reduction via histograms:
 Divide data into buckets and store
representation of buckets (sum, count, etc.)
 Reduction via clustering
 Partition data into clusters based on
“closeness” in space
 Retain representatives of clusters (centroids)
and outliers
 Reduction via sampling
 Will the patterns in the sample represent the
patterns in the data?
 Random sampling can produce poor results
 Stratified sample (stratum = group based on
attribute value)
36
Sampling Techniques
Raw Data
Cluster/Stratified Sample
Raw Data
37
Discretization
 3 Types of attributes
 nominal - values from an unordered set (also “categorical” attributes)
 ordinal - values from an ordered set
 continuous - real numbers (but sometimes also integer values)
 Discretization is used to reduce the number of values for a given
continuous attribute
 usually done by dividing the range of the attribute into intervals
 interval labels are then used to replace actual data values
 Some data mining algorithms only accept categorical attributes
and cannot handle a range of continuous attribute value
 Discretization can also be used to generate concept hierarchies
 reduce the data by collecting and replacing low level concepts (e.g., numeric
values for “age”) by higher level concepts (e.g., “young”, “middle aged”, “old”)
38
Discretization - Example
 Example: discretizing the “Humidity” attribute using 3
bins.
Humidity
85
90
78
96
80
70
65
95
70
80
70
90
75
80
Low = 60-69
Normal = 70-79
High = 80+
Humidity
High
High
Normal
High
High
Normal
Low
High
Normal
High
Normal
High
Normal
High
39
Converting Categorical Attributes to
Numerical Attributes
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy
85
85
FALSE
80
90
TRUE
83
78
FALSE
70
96
FALSE
68
80
FALSE
65
70
TRUE
58
65
TRUE
72
95
FALSE
69
70
FALSE
71
80
FALSE
75
70
TRUE
73
90
TRUE
81
75
FALSE
75
80
TRUE
Create separate columns
for each value of a
categorical attribute (e.g.,
3 values for the Outlook
attribute and two values
of the Windy attribute).
There is no change to the
numerical attributes.
Attributes:
Outlook (overcast, rain, sunny)
Temperature real
Humidity real
Windy (true, false)
Standard Spreadsheet Format
OutLook OutLook OutLook Temp Humidity Windy Windy
overcast
rain
sunny
TRUE FALSE
0
0
1
85
85
0
1
0
0
1
80
90
1
0
1
0
0
83
78
0
1
0
1
0
70
96
0
1
0
1
0
68
80
0
1
0
1
0
65
70
1
0
1
0
0
64
65
1
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
Visualizing Patterns
 Example: Cross Tabulation
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Temperature Humidity Windy
85
85
FALSE
80
90
TRUE
83
78
FALSE
70
96
FALSE
68
80
FALSE
65
70
TRUE
58
65
TRUE
72
95
FALSE
69
70
FALSE
71
80
FALSE
75
70
TRUE
73
90
TRUE
81
75
FALSE
75
80
TRUE
Windy
Not Windy
Outlook =
sunny
2
3
Outlook = rain
2
3
Outlook =
overcast
2
2
3
2.5
2
Outlook = sunny
1.5
Outlook = rain
Outlook = overcast
1
0.5
0
Windy
Not Windy
41
Evaluating Models
 To train and evaluate models, data are often divided into three
sets: the training set, the test set, and the evaluation set
 Training Set
 is used to build the initial model
 may need to “enrich the data” to get enough of the special cases
 Test Set
 is used to adjust the initial model
 models can be tweaked to be less idiosyncrasies to the training data and can be
adapted for a more general model
 idea is to prevent “over-training” (i.e., finding patterns where none exist).
 Evaluation Set
 is used to evaluate the model performance
42
Training Sets
 The training set will be used to train the models
 most important consideration: need to cover the full range of values for all the
features that the model might encounter
 good idea to have several examples for each value of a categorical feature and
for a range of values for numerical features
 Data Enrichment
 training set should have sufficient number of examples of rare events
 a random sample of available data is not sufficient since common examples
will swamp the rare examples
 training set needs to over-sample the rare cases so that the model can learn the
features for rare events instead of predicting everything to be common outputs
43
Test and Evaluation Sets
 Reading too much into the training set (overfitting)
 common problem with most data mining algorithms
 resulting model works well on the training set but performs miserably on
unseen data
 test set can be used to “tweak” the initial model, and to remove unnecessary
inputs or features
 Evaluations Set for Final Performance Evaluation
 test set can not be used to evaluate model performance for future unseen data
 error rate on the evaluation is a good predictor of error rate on unseen data
 Insufficient data to divide into three disjoint sets?
 In such cases, validation techniques can play a major role
Cross Validation
Bootstrap Validation
44
Cross Validation
 Cross validation is a heuristic that works as follows
 randomly divide the data into n folds, each with approximately the same
number of records
 create n models using the same algorithms and training parameters; each model
is trained with n-1 folds of the data and tested on the remaining fold
 can be used to find the best algorithm and its optimal training parameter
 Steps in Cross Validation
 1. Divide the available data into a training set and an evaluation set
 2. Split the training data into n folds
 3. Select an architecture (algorithm) and training parameters
 4. Train and test n models
 5. Repeat step 2 to 4 using different architectures and parameters
 6. Select the best model
 7. Use all the training data to train the model
 8. Assess the final model using the evaluation set
45
Bootstrap Validation
 Based on the statistical procedure of sampling with replacement
 data set of n instances is sampled n times (with replacement) to give another data
set of n instances
 since some elements will be repeated, there will be elements in the original data
set that are not picked
 these remaining instances are used as the test set
 How many instances in the test set?
 Probability of not getting picked in one sampling = 1 - 1/n
 Pr(not getting picked in n samples) = (1 -1/n)n = e-1 = 0.368
 so, for large data set, test set will contain about 36.8% of instances
 to compensate for smaller training sample (63.2%), test set error rate is combined
with the re-substitution error in training set:
e = (0.632 * e test instance) + (0.368 * e training instance)
 Bootstrap validation increases variance that can occur in each fold
46
Measuring Effectiveness
 Predictive models are measured based on the accuracy of their
predictions on unseen data
 Classification and Prediction Tasks
 accuracy measured in terms of error rate (usually % of records classified incorrectly)
 error rate on a pre-classified evaluation set estimates the real error rate
 can also use cross-validation methods discussed before
 Estimation Effectiveness
 difference between predicted scores and the actual results (from evaluation set)
 the accuracy of the model is measured in terms of variance (i.e., average of the
squared differences)
 to be able to express this in terms of the original units, compute the standard
deviation (i.e., square root of the variance)
( p1  a1 ) 2  ( pn  an ) 2
e
n
47
Ordinal or Nominal Outputs
 When the output field is ordinal or nominal (e.g., in two-class
prediction), we use the classification table, the so-called
confusion matrix to evaluate the resulting model
 Example
Predicted Class
Actual Class
T
F
Total
T
18
3
21
F
2
15
17
Total
20
18
38
 Overall correct classification rate = (18 + 15) / 38 = 87%
 Given T, correct classification rate = 18 / 20 = 90%
 Given F, correct classification rate = 15 / 18 = 83%
48
Measuring Effectiveness
 Market Basket Analysis
 MBA may be used for estimation or prediction (e.g., “people who buy milk and
diapers tend to also buy beer”)
 confidence: percentage of the time “beer” occurs in transaction that contain
“milk” and “diapers” (i.e., conditional probability)
 support: percentage of the time “milk”, “diaper”, and “beer” occurred together in
the whole training set (i.e., prior probability)
 Distance-Based Techniques
 clustering and memory-based reasoning: a measure of distance is used to evaluate
the “closeness” or “similarity” of a point to cluster centers or a neighborhood
 regression: line of best fit minimizes the sum of the distances between the line
and the observations
 often for numeric variables, can use Euclidean distance measure (square root of
the sum of the squares of the difference along each dimension)
49
Measuring Effectiveness: Lift Charts
 usually used for classification, but can be adopted to other methods
 measure change in conditional probability of a target class when going from the
general population (full test set) to a biased sample:
lift  Pr(classt | sample) / Pr(classt | population)
 Example:
 suppose expected response rate to a direct mailing campaign is 5% in the training set
 use classifier to assign “yes” or “no” value to a target class “predicted to respond”
 the “yes” group will contain a higher proportion of actual responders than the test set
 suppose the “yes” group (our biased sample) contains 50% actual responders
 this gives lift of 10 = 0.5 / 0.05
 What if the lift sample is too small
No. of
respondents
 need to increase the sample size
 trade-off between lift and sample size
lift
Sample size
50