Object-Oriented Programming (Java), Unit 2

Download Report

Transcript Object-Oriented Programming (Java), Unit 2

Data Mining
Chapter 2
Input: Concepts, Instances,
and Attributes
Kirk Scott
1
• Hopefully the idea of instances and
attributes is clear
• Assuming there is something in the data to
be mined, either this is the concept, or the
concept is inherent in this
• Earlier data mining was defined as finding
a structural representation
• Essentially the same idea is now
expressed as finding a concept description
2
Concept Description
• The concept description needs to be:
• Intelligible
– It can be understood, discussed, disputed
• Operation
– It can be applied to actual examples
3
2.1 What’s a Concept?
4
Reiteration of Types of
Discovery
• Classification
– Prediction
• Clustering
– Outliers
• Association
• Each of these is a concept
• Successful accomplishment of these for a
data set is a concept description
5
Recall Examples
• Weather, contact lenses, iris, labor
contracts
• All were essentially classification problems
• In general, the assumption is that classes
are mutually exclusive
• In complicated problems, data sets may
be classified in multiple ways
• This means individual instances can be
“multilabeld”
6
Supervised Learning
• Classification learning is supervised
• There is a training set
• A structural representation is derived by
examining a set of instances where the
classification is known
• How to test this?
• Apply the results to another data set with
known classifications
7
Association Rules
• In any given data set there can be many
association rules
• The total may approach n(n – 1) / 2 for n
attributes
• The book doesn’t use the terms support
and confidence, but it discusses these
concepts
• These terms will be introduced
8
Support for Association Rules
• Let an association rule X = (x1, x2, …,
xi)y be given in a data set with m
instances
• The support for Xy is the count of the
number of instances where the
combination of x values, X, occurs in the
data set, divided by m
• In other words, the association rule may
be interesting if it occurs frequently
enough
9
Confidence for Association
Rules
• Confidence here is based on the statistical
use of the term
• The confidence for Xy is the count of the
number of occurrences in the data set
where this relationship holds true divided
by the number of occurrences of X overall
• The book describes this idea as accuracy
• In other words, the association is
interesting the more likely it is that X does
10
determine y
Clustering
• We haven’t gotten the details yet, but this
is an interesting data mining problem
• Given a data set without predefined
classes, is it possible to determine classes
that the instances fall into?
• Having determined the classes, can you
then classify future instances into them?
• Outliers are instances that you can
definitely say do not fall into any of the
11
classes
Numerical Prediction
• This is a variation on classification
• Given n attribute values, determine the
(n + 1)st attributed value
• Recall the CPU performance problem
• It would be a simple matter to dream up
sample data where the weather data
predicted how long you would play rather
than a simple yes or no
• (The book does so)
12
2.2 What’s in an Example?
13
• The authors are trying to present some
important ideas
• In case their presentation isn’t clear, I
present it here in a slightly different way
• The basic premise goes back to this
question:
• What form does a data set have to be in in
order to apply data mining techniques to
it?
14
Data Sets Should Be Tabluar
• The simple answer based on the
examples presented so far:
• The data has to be in tabular form,
instances with attributes
• The remainder of the discussion will
revolve around questions related to
normalization in db
15
Not All Data is Naturally Tabular
• Some data is not most naturally
represented in tabular form
• Consider OO db’s, where the natural
representation is tree-like
• How should such a representation be
converted to tabular form that is amenable
to data mining?
16
Correctly Normalized Data May
Fall into Multiple Tables
• You might also have data which naturally
falls into >1 table
• Or, you might have data (god forbid) that
has been normalized into >1 table
• How do you make it conform to the single
table model (instances with attributes) for
data mining?
17
• Tree-like data and multi-table data may be
related questions
• It would not be surprising to find that a
conversion of a tree to a table resulted in
>1 table
18
Denormalization
• The situation goes against the grain of
correct database design
• The classification, association, and
clustering you intend to do may cross db
entity boundaries
• The fact that you want to do mining on a
single tabular representation of the data
means you have to denormalize
19
• In short, you combine multiple tables back
into one table
• The end result is the monstrosity that is
railed against in normalization theory:
• The monolithic, one-table db
20
The Book’s Family Examples
• Family relationships are typically viewed in
tree-like form
• The book considers a family tree and the
relationship “is a sister of”
• The factors for inferring sisterhood:
• Two people, one female
• The same (or at least one common)
parents for both people
21
Two People in the Same Table
• Suppose you want to do this in tabular
form
• You end up with the two people who might
be in a sisterhood relationship in the same
table
• These occurrences of people are matched
with a classification, yes or no
22
• Recall that according to normalization, a
truly one-to-one relationship can be stored
in a single table
• Pairings of all people would result in lots of
instances/rows where the classification
was simply no
• This isn’t too convenient
23
• In theory, you might restrict your attention
only to those rows where the classification
was yes
• This restriction is known as the “closed
world assumption” in data mining
• Unfortunately, it is hardly ever the case
that you have a problem where this kind of
simplifying assumption applies
• You have to deal with all cases
24
Two People with Attributes in
the Same Table
• Suppose the two people are only listed by
name in the table, without parent
information
• The classification might be correct, but this
is of no help
• There are no attributes to infer sisterhood
from
• The table has to include attributes about
the two people, namely parent information
25
The Connection with
Normalization
• There is a problem with denormalized data
mining which is completely analogous to
the normalization problem
• Suppose you have two people in the same
instance (the same row) with their
attributes
• By definition, you will have stray
dependencies
• The Person identifiers determine the
26
attributes values
• So far we’ve considered classification
• However, what would happen if you mined
for associations?
• The algorithm would find the perfectly true,
but already known associations between
the pk identifiers of the people and their
attribute fields
• This is not helpful
• It’s a waste of effort
27
Recursive Relationships
• Recall the monarch and product-assembly
examples from db
• These give tables in recursive
relationships with themselves or others
• In terms of the book’s example, how do
you deal with parenthood when there is a
potentially unlimited sequence of
ancestors?
28
• In general, the answer is that you would
need recursive rules
• Mining recursive rules is a step beyond
classification, association, etc.
• The good news is that this topic will not be
covered further
• It’s simply of interest to know that such
problems can arise
29
One-to-Many Relationships
• A denormalized table might be the result
joining two tables in a pk-fk relationship
• If the classification is on the “one” side of
the relationship, then you have multiple
instances in the table which are not
independent
• In data mining this is called a multiinstance situation
30
• The multiple instances belonging to one
classification together actually form one
example of the concept under
consideration in such a problem
• Data mining algorithms have been
developed to handle cases like these
• They will be presented with the other
algorithms later
31
Summary of 2.2
• The fundamental practical idea here is that
data sets have to be manipulated into a
form that’s suitable for mining
• This is the input side of data mining
• The reality is that denormalized tables
may be required
• Data mining can be facetiously be referred
to as file mining since the required form
does not necessarily agree with db theory
32
• The situation can be restated in this way:
• Assemble the query results first; then mine
them
• This leads to an open question:
• Would it be possible to develop a data
mining system that could encompass >1
table, crawling through the pk-fk
relationships like a query, finding
assocations?
33
2.3 What’s in an Attribute?
34
• This subsection falls into two parts:
• 1. Some ideas that go back to db design
and normalization questions
• 2. Some ideas having to do with data type
35
Design and Normalization
• You could include different kinds
(subtypes) of entities in the same table
• To make this work you would have to
include all of the fields of all of the kinds of
entities
• The fields that didn’t apply to a particular
instance would be null
• The book uses transportation vehicles as
an example: ships and trucks
36
• You could also have fields in a table that
depend on each other (ack)
• The book gives married T/F and spouse’s
name as examples
• Again, you can handle this with null values
37
Data Types
• The simplest distinction is numeric vs.
categorical
• Some synonyms for categorical: symbolic,
nominal, enumerated, discrete
• There are also two-valued variables
known as Boolean or dichotomy
38
Spectrum of Data Types
• 1. Nominal = unordered, unmeasurable
named categories
• Example: sunny, overcast, rainy
• 2. Ordinal = named categories that can be
put into a logical order but which have no
intrinsic numeric value and no defined
distance between them (support < or >)
• Example: hot, mild, cool
39
• 3. Interval = numeric values where the
distance between them makes sense
(support subtraction) but other operations
do not
• Example: Time expressed in years
40
• 4. Ratio = numeric values where all
operations make sense
• These are real or continuous (or possibly
integer) values on a scale with a natural 0
point
• Example: Physical distance
41
• In principle, data mining has to handle all
possible types of data
• In practice, applied systems typically have
some useful subset of the type distinctions
given above
• You adapt your data to the types provided
42
2.4 Preparing the Input
43
• In practice, preparing the data can take
more time and effort than doing the mining
• Data needs to be in the format required by
whatever mining software you’re using
• In Weka, this is ARFF = attribute relation
file format
44
• Real data tends to be low in quality
• Think data integrity and completeness
• “Cleaning” the data before mining it pays
off
45
•
•
•
•
Weka
From Wikipedia, the free encyclopedia
Jump to: navigation, search
For other uses, see Weka
(disambiguation).
46
• The Weka or woodhen (Gallirallus
australis) is a flightless bird species of the
rail family. It is endemic to New Zealand,
where four subspecies are recognized.
Weka are sturdy brown birds, about the
size of a chicken. As omnivores, they feed
mainly on invertebrates and fruit. Weka
usually lay eggs between August and
January; both sexes help to incubate.
47
48
• Behaviour
• …
• Where the Weka is relatively common,
their furtive curiosity leads them to search
around houses and camps for food scraps,
or anything unfamiliar and transportable.[2]
49
50
51
52
Gathering the Data Together
• In a large organization, different
departments may manage their own data
• Global level data mining will require
integration of data from multiple databases
• If you’re lucky, the organization has
already created a unified archive, a data
warehouse
• Interesting mining may also require
integrating external data into the data set
53
Aggregation
• It may be necessary to aggregate data in
order to mine it successfully
• You may have data on parameters of
interest spread through multiple instances
• To be useful to problem solution, it may be
necessary to add the values of data points
together, for example
54
• The type of aggregation is important
• Remember the aggregation operators in
db: COUNT, SUM, AVERAGE, etc.
• The level of aggregation is important
• Remember GROUP BY in db
• Do you aggregate all instances, or is it
useful to do it by subsets of some sort?
55
ARFF (Format)
• ARFF is the regular version of the data
format for Weka
• XRFF is the XML version
• In ARFF:
• % marks a comment
• @ marks file descriptor information,
relation, attributes, and data
56
• For a categorical attribute, the set of
values is given in braces
• Categorical values containing spaces have
to be put in quotation marks
• Numeric attributes are simply identified as
numeric
57
• Instances in the file are given line by line
• They are separate by new lines
• Attributes values in instances are
separated by commas
• Missing values (nulls) are indicated with a
question mark
58
• In an ARFF file, a classification attribute, if
there is one, is treated no differently than
any others
• The format is equally suited to
classification, association, or cluster
mining
• Figure 2.2, on the following overhead,
shows the weather data set in ARFF
59
60
Weka Has Three Additional
Attribute Types
• String = the moral equivalent of
VARCHAR in db
• Date = the equivalent of DATE in db
• Relational = Stay tuned; this will require
some explanation
61
Relational-Valued Attributes
• The book gives an example which is OK,
but it’s not necessarily presented in the
clearest way possible
• My plan is to first give a bunch of
explanatory background
• Then explain the book’s example in a
slightly different order than it does
62
Relational Background
• Recall that multivalued problems can be
viewed as mining the result of a 1-m join
• In preparing a data set for mining, this is
what a relational-valued attribute is:
• It is an attribute that can contain or consist
of multiple instances of the same kind of
set of values, where these sets belong
together for some reason
63
• In a 1-m, pk-fk join, the multiple sets are
the rows of the many table which belong
together because they share the same fk
value
• In case this general overview isn’t clear,
the idea can be illustrated with mothers
and children
64
Mothers and Children
•
•
•
•
•
•
Suppose you ran this query:
SELECT *
FROM Mother, Child
WHERE Mother.motherid = Child.motherid
GROUP BY motherid
Children of the same mother would be
grouped together
65
• In data mining, it is possible that you
would want to elicit information about
children in general
• You might also want to elicit information
that generally held for children of the same
mother
• Conceptually, you would be mining
information about siblinghood
66
• This is where relational-valued attributes
come in
• From a relational point of view, the
representation is wrong
• First normal form says you have flat files
with no repeating groups
• But for data mining purposes, in ARFF
format, you want the repeating groups
67
Explaining the Book’s Example
• The weather adapts the weather/play a
game data to a multivalued example
• The new twist is this: Games extend over
2 days, not just one
• Each day is still a single instance
• But for each game, there are two of these
instances which belong together
68
• The instances in the rows of the table
representing game information will be
multivalued
• Each game will contain two days’ worth of
weather data
• These two days’ worth of data are the
contents of one relational-valued attribute
in the overall data set
69
• Note that in general, relational attributes,
multivalued attributes, are not limited to 2
sets of values
• This is just an artifact of their example,
where games last exactly 2 days
70
• The book also uses terminology which
could be clearer
• In their new weather table they name the
relational attribute “bag”
• Suffice it to say that it would have been
clearer if they had named the attribute
game_days or something like that
71
• There are three major attributes in the
book’s table:
• bag_ID (id for sets of days belonging to
games)
• bag (multivalued relational attribute
containing days belonging to games,
grouped by game)
• play (the classification to play, yes or no)
72
• The bag attribute has 4 (familiar) attributes
describing the multivalued instances (of
day):
• outlook
• temperature
• humidity
• windy
73
• In the body of the ARFF table, the
multivalued entries are structured in this
way:
• The data for the multiple days that belong
together for a single bag_ID is enclosed in
quotation marks
• Within the quotation marks, the individual
sets of day data are separated by “\n”, the
new line character
74
• The book’s ARFF table is shown in Figure
2.3 on the following overhead
75
76
Sparse Data
• Some data sets are sparse
• In this context the book means 0’s for
numerical values, not nulls
• Rather than listing everything, a row can
be economically expressed by showing
only the values present
77
•
•
•
•
•
•
•
•
In ARFF, the attributes for a row are:
identified by number starting with 0
Followed by the value
Separated by commas
Enclosed in braces
E.g.:
{1 X, 6 Y, 10 “Class A”}
This doesn’t work for nulls; you still have
to include ?’s
78
Attribute Types
• The bottom line is that ARFF only has two
fundamental types: nominal and numeric
• String attributes are effectively nominal
• Date attributes are effectively numeric
• (Recall the discussions of stuff like this in
db)
• The rest of this subsection has to do with
numeric types in particular
79
Numerics as Ordinals
• The important point is this:
• Different data mining algorithms treat
vanilla numeric values differently
• One algorithm may treat numerics as
ordinals, where subtraction applies,
generating rules based on <, =, >
comparisons
80
Numerics as Ratio Values
• Another algorithm may treat numerics as
ratio values
• Recall that all arithmetic operations are
defined in this case
• The algorithm may normalize ratio values
81
Normalization
• Normalization means putting values into a
range, most commonly the range 01
• A simple approach for positive values:
Divide any given data value by the
maximum present
• Another simple approach for positive
values: Subtract the minimum from the
data value and divide by (max – min)
82
Standardization
• Values can also be statistically
standardized
• Each data point is converted using this
approach:
• xstandardized = (x – μ) / σ
• This puts the values into a distribution
where the mean is 0 and the standard
deviation is 1
83
Distance as an Example of
Ratio Values
• Consider the calculation of distance in n
dimensional space, 2-space for example
• Calculating the square root of the sum of
the squares of the differences of the
coordinates involves using arithmetic
operators other than subtraction
• Normalization is implicated in a situation
like this
84
• Given some (x, y) space, suppose x is in
the range 010 and y is in the range
0100
• Do you normalize both x and y before
calculating distances or not?
• Another way of stating this is, do x and y
make corresponding contributions to the
measure of distance between two data
points or not?
85
Nominal Attributes and Distance
• This is a crude measure of distance for
nominal attributes:
• If two instances have the same value for
that attribute, the distance between them,
measure on that attribute is 0
• If two instances have a different value for
an attribute, the distance between them is
1
86
• There are cases where nominal attributes
can be reverse engineered back to
numerics
• One example from the book: Zip codes
and geographic location coordinates
• Recall that zip codes came up in db in a
similar way as determinants of geographic
locations
87
Nominal vs. Numeric
• Just like in db the assertion is made that
an id “number” field should be TEXT—
• In data mining there may be attributes
containing numeric digits which are simply
nominal fields and should be mined as
such
88
• Finally, some algorithms support nominals
but not ordinals
• In the contact lens data, young < prepresbyopic < presbyopic
• If their ordinal relationships is not
recognized, a complete and correct set of
rules can still be mined
• However a complete and correct set of
rules about 1/3 as large can be mined in a
89
system that recognizes the relationship
Missing Values
• This is essentially a discussion of nulls
• The only new element consists of two
questions:
• Can you infer anything from the absence
of values?
• Would it be possible to meaningfully code
why values are absent and mine
something from this?
90
Inaccurate Values
• This is essentially a discussion of data
integrity
• Both data miners and regular db users
have to cope with faulty data one way or
the other
• The authors say this is especially
important when mining
• It’s especially important to the data miner if
the data miner ascribes more significance
91
to an attribute than a regular user does
The End
92