Data Mining Chapter 1

Download Report

Transcript Data Mining Chapter 1

Data Mining – Input: Concepts,
instances, attributes
Chapter 2
• Thing to be learned
– Ignore any philosophy about what a concept is
– Need description that is
• Intelligible – can be understood, and thus can be argued / discussed as
to its validity by humans
• Operational – it can be applied to future examples
• How the concept is expressed is the “concept
• Concept may differ based on different styles of
learning … classification, association, clustering,
numeric prediction …
• Concept description may differ based on learning
scheme/algorithm used
Styles of Learning
• Classification – learn way of “classifying”
unseen examples – put them in the correct
• Association – learn any association between
• Clustering – seek groups of examples that
belong together, without pre-classification
• Numeric prediction – prediction of numeric
quantity instead of category
• “Supervised” – learning scheme is provided
correct classification/class/category for
“training” data
• Success is measured by trying out what is
learned on independent/ previous unseen “test”
data (withholding category/class until checking
the program’s answer)
• Classification and numeric prediction are
• Association and Clustering are “unsupervised”
Inputs – What’s in an Example?
• Input is a set of instances (records/examples)
• Instance has set of values for pre-determined
attributes (like a record in a DB)
• I.e. input is like a single DB table, or “flat file”
– There may be things we’d like to learn that don’t fit
into this simple structure – but current technology is
largely only up to handling simple input
– You may find it useful sometimes to “denormalize”
a DB – do a JOIN of two or more tables to produce a
flat file (just make sure you don’t just re-learn the
primary keys or foreign key!)
• Flat file format means that all examples are
expected to have values for the same attributes
– Some attributes may be irrelevant for some
– Some attributes relevance may depend on value of
another attribute
– Usual workaround – irrelevant attributes have a
special irrelevant “value”
Kinds of attributes
• Binary/boolean – two valued; e.g. Resident Student?
• Nominal/categorical/enumerated/discrete – multiple valued,
unordered; e.g. Major
• Ordinal - Ordered, but no sense of distance between –
– e.g. Fr, So, Jr, Sr;
– e.g. Household Income 1 - < 15K, 2 – 15-20K, 3- 20-25K, 4- 2530K, 5 – 30-40K, 6 – 40-50K, 7 - > 50K
• Interval – ordered, distance is measurable; e.g. birth year
• Ratio – an actual measurement with defined zero point such that we could say that one value is double another or
triple, or ½; e.g. GPA
Kinds of Attributes
• Many algorithms cannot handle all of those
different types of attributes
• One approach –
– treat binary and nominal as nominal
– Treat ordinal, interval, and ratio as “numeric”
• Requires coding ordinals such as Fr, So etc as numbers
Preparing the Data
• Preparing the data “usually consumes the bulk
of the effort invested in the entire data mining
• Real data is frequently low quality
• Data Cleaning is frequently necessary and
time consuming
Preparing the Data
• Integrating data from multiple sources
– E.g. data from different departments – marketing,
sales, billing, customer service
– E.g. sometimes outside data is valuable – economic
conditions, weather data
• Challenges – different coding conventions,
different time periods, different aggregations,
different keys, different kinds of errors
• Point of intersection with Data Warehousing –
this work needs to be done for BOTH!
• May need to iterate to get right
Preparing the Data
• Standard format – any tool needs data to be in
some standard format
• Weka tool requires data to be in ARFF format
ARFF Format
• Lines beginning with % are comments
• File starts with name of the relation
• Attributes are defined
– Nominal attributes are followed by the set of values
– Numeric attributes list the keyword “numeric”
– No identification of class to be predicted – flexible
• Beginning of data is flagged with @data
• Data itself is comma delimited (easily created from
Access or Excel)
• Missing values are represented with a ?
% ARFF file for the we ather d ata w ith s ome numeric features
@relation weather
outlook { sunny, overcast, rainy }
temperature numeric
humidity numeric
windy { tr ue, false }
play? { ye s, no }
% 14 instances
sunny, 8 5, 85, false, n o
sunny, 8 0, 90, true, no
overcast, 83, 86 , false, yes
rainy, 7 0, 96, false, y es
rainy, 6 8, 80, false, y es
rainy, 6 5, 70, true, no
overcast, 64, 65 , true, ye s
sunny, 7 2, 95, false, n o
sunny, 6 9, 70, false, y es
rainy, 7 5, 80, false, y es
sunny, 7 5, 70, true, yes
overcast, 72, 90 , true, ye s
overcast, 81, 75 , false, yes
rainy, 7 1, 91, true, no
Figure 2.2 ARFF file for the
weather data.
Data Preparation
• You need to understand machine learning schemes before
using them for data mining
– Some schemes treat numerics as ordinals and only compare < > =
– Others treat numerics as ratios and perform distance and other
• If distance measurements are to be made, avoid scheme if
datasets contain ordinals that distort distances (e.g. income
example earlier)
• Distance between nominals is frequently all or nothing (0 or
• If scheme only deals with nominals, any numerics need to
be converted to nominals (e.g. age converted to young, mid,
old) (some info is lost)
• If dataset has nominals that are coded as integers, don’t
confuse the scheme by marking them numeric
• Some schemes require all numeric attributes to be
on a similar scale – thus normalize or standardize
(different term than DB normalization)
• One normalization approach:
Norm val = (val – minimum value for attribute)
(max value for attribute – min val)
• One standardization approach:
Stand val = (val – mean) / SD
Missing Values
• In real datasets, missing values are frequently coded with
weird value (e.g. –1, 999999)
• Sometimes different types of missing values are
distinguished – unknown, vs unrecorded vs not applicable
vs …
• Missing values may have meaning –
– e.g. maybe income may be left blank more often by people whose
income is particularly high or low
– E.g. in diagnosis, a particular test may not need to be done for a
particular case
– Get data-knowledgeable person involved
• Most machine learning schemes assume that missing value
is not particularly meaningful
– If meaningful, need to let scheme know …
Inaccurate Values
• Errors and omissions may be more important to mining
algorithms than to source system
• Misspelling of nominal attribute values may suggest
incorrect possible values
• Typos or incorrect measurement may yield numeric
– Find via graphing / involve data-knowledgeable person
• Duplicate records – confuse scheme by giving heavier
weight to
• Deliberate mis-entry occurs (e.g. supermarket checkout
entering own bonus card)
Data Age
• We are frequently using data to predict the
• At some point, the world / business has changed
enough that the data is no longer appropriate for
Getting to Know Your Data
• Several points above reflect this need
• Graphic display of data can help find problems (e.g.
outliers, large numbers of unknown value (e.g. 9999),
typos of nominals)
• Domain knowledgeable people are valuable – explain
anomalies, missing values, coding schemes.
• Data cleaning is extremely important.
– At least look at some records to see what is going on
– “Time spent looking at your data is always time well spent”
End Chapter 2
• Work with basic formatting data into ARFF format
– do japanbank – see
• (Data Courtesy of Dr Markov of C Conn St U)