Transcript MYRAW data

DSCI 4520/5240
DATA MINING
MYRAW
Nonprofit donor data
MYRAW Overview
Determine who is likely to donate to
a non-profit organization campaign
and target them for donation
solicitation
Lecture 2 - 1
The Scenario
A nonprofit organization wants to send greeting cards to
lapsed donors to encourage them to make a new
donation. The organization wants to build a model to
predict those most likely to donate. The types of
information available include
 personal information such as age, gender, and income
 past donation information such as average gift and
time since first donation
 census tract information based on the donor’s address
such as the percent of households with at least one
member that works for the federal, state, or local
government.
2
DSCI 4520/5240
MYRAW data
DATA MINING
Lecture 2 - 3
DSCI 4520/5240
MYRAW data
DATA MINING
Lecture 2 - 4
DSCI 4520/5240
Variable roles
DATA MINING
The first several variables (Age
through FirstTime) have the
measurement level interval
because they are numeric in the
data set and have more than 20
distinct levels. The model role for
all interval variables is set to
input by default.
The variables Gender and
Homeowner have the
measurement level binary
because they have only two
different nonmissing levels. The
model role for all binary
variables is set to input by
default.
Lecture 2 - 5
DSCI 4520/5240
Variable roles
DATA MINING
The variable IDCode is listed as a
nominal variable because it is a
character variable with more
than two nonmissing levels.
Furthermore, because it is
nominal and the number of
distinct values is greater than 20,
the IDCode variable has the
model role ID. If IDCode had
been stored as a number, it would
have been assigned an interval
measurement level and an input
model role.
The variable Income is listed as a
nominal variable because it is a
numeric variable with more than
two but no more than ten distinct
levels. All nominal variables are
set to have the input model role.
You will change the level to
ordinal.
Lecture 2 - 6
DSCI 4520/5240
Variable roles
DATA MINING
These variables do have useful
information, however, and it is
the way in which they are coded
that makes them seem useless.
Both variables contain the value
Y for a person if the person has
that condition (pet owner for
Pets, computer owner for
PCOwner) and a missing value
otherwise. Decision trees handle
missing values directly, so no
data modification needs to be
done for fitting a decision tree;
however, neural networks and
regression models ignore any
observation with a missing value,
so you will need to recode these
variables to get at the desired
information. For example, you
can recode the missing values as a
U, for unknown. You do this later
using the Impute node.
Lecture 2 - 7