Transcript Chapter 17

Chapter 17
Preparing Data for Mining
Introduction
• Just as manufacturing and refining are
about transformation of raw materials into
finished products, so too with data to be
used for data mining
• ECTL – extraction, clean, transform, load –
is the process/methodology for preparing
data for data mining
• The goal: ideal DM environment (Ch 16)
2
What the Data Should Look Like
• All data mining algorithms want their input
in tabular form – rows & columns as in a
spreadsheet or database table
3
What the Data Should Look Like
• Customer Signature
– Continuous “snapshot” of customer behavior
Each row represents
the customer and
whatever might be useful
for data mining
4
What the Data Should Look Like
• The columns
– Contain data that describe aspects of the
customer (e.g., sales $ and quantity for each
of product A, B, C)
– Contain the results of calculations referred to
as derived variables (e.g., total sales $)
5
What the Data Should Look Like
1.
2.
3.
4.
Columns with One Value - Often not very useful
Columns with Almost Only One Value
Columns with Unique Values
Columns Correlated with Target Variable (synonyms
with the target variable)
1.
2.
3.
6
What the Data Should Look Like
• Columns have important Model Roles in
Data Mining:
– Input columns – input into the model
– Target column(s) – used only for predictive
models – the values are created by the
algorithm
– Ignored columns – not used in a particular
data mining analysis
7
What the Data Should Look Like
• Variable Measures
–
–
–
–
Categorical variables (e.g., CA, AZ, UT…)
Ordered variables (e.g., course grades)
Interval variables (e.g., temperatures)
True numeric variables (e.g., money)
• Dates & Times
• Fixed-Length Character Strings (e.g., Zip Codes)
• IDs and Keys – used for linkage to other data in other
tables
• Names (e.g., Company Names)
• Addresses
• Free Text (e.g., annotations, comments, memos, email)
• Binary Data (e.g., audio, images)
8
What the Data Should Look Like
• Data Format Expectations for Data Mining
– All data in a single table (rows/columns)
– Each row corresponds to an entity (customer)
– Columns with single value should be ignored
– Columns with unique values should be ignored
– Target column identified for predictive DM
9
Constructing the Customer Signature
10
Typical Customer Model
3 different
definitions
of Customer
11
The “Dark Side” of Data
• Missing values (nulls = empty or
something else)
• Dirty data (erroneous zip codes, etc.)
• Inconsistent values (different revisions)
12
Conclusion
• Lots to think about and take action on for
Preparing Data for Mining
• Remember a process/methodology is
needed which includes ECTL (extraction, clean,
transform, load)
• Remember Data Mining Group Skills
needed
13
End of Chapter 17
14