csce462chapter1PowerPointOldestx

Download Report

Transcript csce462chapter1PowerPointOldestx

Data Mining
Chapter 1
Kirk Scott
1
Iris virginica
2
Iris versicolor
3
Iris setosa
4
1.1 Data Mining and Machine
Learning
5
Definition of Data Mining
• The process of discovering patterns in
data.
• (The patterns discovered must be
meaningful in that they lead to some
advantage, usually an economic one.)
• …Useful patterns allow us to make
predictions on new data.
6
Automatic or Semi-Automatic
Pattern Discovery
• Pattern discovery by, or with the help of
computers is of interest
• Hence machine “learning”
• The “learning” part comes from the
algorithms used
• Stay tuned for a brief discussion of
whether this is learning
7
Expression of Patterns
• 1. Black box
• 2. Transparent box
• A transparent box expression reveals the
structure of the pattern
• The structure can be examined, reasoned
about, and used to inform future decisions
8
Black Box vs. Transparent Box
• Black vs. transparent box is not a trivial
distinction
• Some modern computational approaches
are black box in nature
• They seek “answers” without necessarily
revealing the structure of the problem at
hand
9
Genetic Algorithms
• The book in particular says that genetic
algorithms are beyond the realm of
consideration
• They are explicitly designed for
optimization, not the revelation of structure
• They exemplify black box thinking
• They give a single (sub) optimal answer
without providing additional information
about the problem being optimized
10
What is the book about?
• Techniques for finding and describing
structural patterns in data
• Description of the patterns is an inherent
part of data mining as it will be considered
• Techniques that lead to black box
predictors will not be considered
11
Describing Structural Patterns
• First example:
• Contact lens table, page 6 in textbook
• The first 5 columns are effectively 5
factors under consideration
• The 6th column is the result
• Inspection reveals that this is essentially a
comprehensive listing of all possible
combinations
12
13
• The information in the table could be
stored syntactically in the form of a set of
rules
• For example:
• If tear production rate = reduced then
recommendation = none
• Otherwise, if age = young and astigmatic =
no then recommendation = soft
14
• The same information could be
encapsulated graphically in a decision tree
• The representation is up to the person
working with the scenario
• The point is that the ability to represent
signifies that the structure has been
revealed
15
Input and Output
• The 5 factors represent input
• The 6th column represents output, or
“prediction”
• Overall, this distinction is typically present
in a data mining problem
16
Completeness
• This example is simplistic because it is
complete
• In interesting, practical, applied problems,
not all combinations may be present
• Individual data values will be missing
• The goal of data mining is to be able to
generalize from the given data so that
correct predictions can be made for other
cases where the results aren’t specified
17
Machine Learning
• High level learning in humans seems to
presuppose self-conscious adaptability
• Whether or not machines are capable of
learning in the human sense is open to
question
• No current machine/software combination
demonstrates behavior exactly analogous
to the behavior seen in biological systems
18
• My personal take on this is that the phrase
“machine learning” is an unfortunate
example of inflated naming
• The phrase “data mining” is neutral
• To ask whether a machine can mine is not
as fraught with difficulty as asking whether
it can learn
19
• IEEE publishes a research journal with the
title “Transactions on Knowledge and Data
Discovery”
• Similarly, to ask whether a machine can
discover is not as fraught with difficulty as
asking whether it can learn
20
• “Machine learning” isn’t as inflated as
“Artificial intelligence”
• It is a step back from that level of hype
• In the field of artificial intelligence
researchers have scaled back their
expectations of what they might
accomplish
• They have not been able to mimic general
human intelligence
21
• If you look at data mining from an AI point
of view, you might see it as a step along
the road to mimicking learning and term it
machine learning
• If you come from a database management
systems background you might say the
point of view is results oriented rather than
process oriented
22
• The data are not animate
• The machines are not animate
• The algorithms are not animate
23
• The human mind cannot readily discern
some patterns, whether due to the quantity
of the data or the subtlety and complexity
of the patterns
• Human programmers devise algorithms
which are able to discern such patterns
• This is not so different from devising an
algorithm to solve any problem which a
computer might solve more easily than a
24
human being
Naming Problems in Math
•
•
•
•
•
Consider these terms from math:
Imaginary numbers
Complex numbers
Chaos theory
I’ve always marveled at how harmful
tendentious naming can be to achieving
understanding of what’s going on
• You might as well talk about magic spells
25
A Biological Perspective
• Although I don’t worship at the church of
St. Charles of Darwin, the biologists make
an interesting point:
• When reasoning about animals, it’s a
mistake to anthropomorphize
• In others, don’t ascribe human
characteristics to non-human organisms
26
• The shortcoming of this point of view is
that biologists tend towards the rigidity of
theologists:
• Animals are not like us; they are no more
than machines
• A dog cannot feel anything approaching
human emotion, etc., etc.
27
• Whatever the truth of the emotional state
of dogs, isn’t this a valid question:
• Why do certain computer scientists
persistent in naming technologies in such
a way that they seem to be
anthropomorphizing machines?
• The day may come when there is truth to
this, but isn’t it a wee bit premature?
28
• Ideas on this question?
29
Practically Speaking, What
Does Data Mining Consist of?
• Algorithms that examine data sets and
extract knowledge
• This time the authors have chosen the
relatively neutral words examine and
extract
• The word knowledge is admittedly a bit
tendentious itself
30
• In more detail, the idea is simply this:
• The computer is programmed to
implement an algorithm
• The algorithm is designed to run through
or process all of a data set
• The goal of the algorithm is to summarize
the data set
31
• The summary is a generalization
• For our purposes, the summary consists of
inferences that can be drawn about the
data set
• The inferences may be between values for
the attributes of a given data point
• They may also be about the relationships
among various data points in the set
32
• The contact lens data set illustrated the
idea that an inference is a rule
• For our purposes, such a rule takes the
form of “if x, then y”
• It is a comprehensive collection of such
rules that summarizes the structure of the
data points in the set
33
The Value of the Summary
• A set of rules is definitely useful for
predictions
• Once again the distinction is made
between black and transparent box
techniques
• The structural description is also of great
value because it helps the human
understand the data and what it means
34
• In this sense, at least, a degree of learning
is evident in the whole complex
• If the human consumer of the end product
ultimately learns something previously
unknown about the data set—
• Then the machine and algorithm must
have genuinely learned something about
the data that was previously unknown
35
• It may be of value to contrast this with
transparent box systems again
• A transparent box system achieves results
without a structural description
• This is still a valid question:
• Just because the end result is not human
learning, does that in fact lessen in any
way the degree of learning that the
computer system achieves?
36
• There are no fixed answers to these side
questions
• This is a 400 level course
• There’s no time like now to at least think
about such things, before you step out the
door with your sparkling new sheepskin
• Now, back to the grindstone
37
1.2 Simple Examples: The
Weather and Other Problems
38
• The book introduces some comparatively
simple (and unrealistic) data sets by way
of further illustration
• Note that complex, real data sets tend to
be proprietary anyway
• Databases and data sets are among the
most valuable assets of organizations that
have them
• This kind of stuff isn’t given away
39
The Weather Problem
• This is a fictitious data set
• Four symbolic categories (non-numeric
attributes) determine whether a game
should be played
• There are 36 possible combinations
• A table exists with only 14 of the
combinations
• See the following overhead
40
An (Incomplete) Table of Data
Point Values
41
Decision Lists
• More ideas about representing structure
with rules
• Sets of rules can be arranged in order
• You work your way through the list
• You accept the outcome of a rule if it
applies
• Otherwise you continue down the list
• This is called a decision list
42
• The following rules represent the idea
• No claim is made that this set of rules is
complete or necessarily very good
43
• In an ordered list of rules:
• Each rule is potentially (most likely)
fragmentary
• It doesn’t include every factor
• Because they are ordered, the rules
depend on each other
• Individual rules, applied in isolation, do not
necessarily (most likely don’t) give the
correct result
44
Handling Numeric Data
• If all of the determining attributes are
numeric, you can refer to a numeric
attribute problem
• If some are numeric and some are
categorical, you can refer to a mixed
attribute problem
• Consider the table of the weather problem
with some numerical attributes on the
following overhead
45
46
• If numeric attributes are present, the
decision rules typically become
inequalities rather than equalities
• For example
47
Classification vs. Association
• The premise of the foregoing discussion:
• A set of independent variables determines
the value of a dependent variable
• This is reminiscent of multivariate
statistical regression
• It is classification
• If relationships can be found among the
supposedly independent variables, this is
association
48
• These are examples of association rules
taken from the original weather data set
• It is not hard to imagine that outlook,
temperature, humidity, and wind depend
on each other in whole or in part
49
• The data set is fictitious
• The data set is not exhaustive
• Unlike the contact lens example, it is not a
list of all possible combinations
• It is presumably based on some set of real
observations
50
• The book observes that many possible
association rules can be derived from the
data set
• Many would be true for 100% of the data
points in the set
• Many others would be true for a high
percent of the data points in the set
51
• In summary, the ability to derive
association rules depends on several
things
• The variables are dependent in reality
• The data set is representative of that
reality
• The scheme for eliciting the structure
produces a good model
52
Contact Lenses: An Idealized
Problem
• This subsection of the book presents the
problem again, which I will not repeat
• Recall again that the data set was a
complete list of all possible combinations
• The new part considers a complete set of
rules derived from the data set
• See the following overhead
53
54
Representing Structure
• Converting an exhaustive listing of all
possible cases into a complete set of rules
represents the structure of the problem
• Questions to consider:
• Would a smaller set of rules be possible?
• In practice, would we be willing to accept a
smaller rule set, even though this might be
an imperfect representation of the
structure?
55
A Decision Tree
• The figure on the following overhead
shows a decision tree for the data
• This is an alternative, graphical or visual
representation of the structure
• In summary, it illustrates 3 binary
decisions in a row
• As a matter of fact, this is a simplification
• It doesn’t correctly classify 2 of the cases
56
57
Irises: A Classic Numeric
Dataset
• In this problem, the attributes are numeric
• However, the result is still categorical
classification
• See the table illustrating the data set and a
sample set of rules on the following
overheads
58
59
60
CPU Performance: Introducing
Numeric Prediction
•
•
•
•
Regression was mentioned earlier
In general, it takes this form:
y = b1x1 + b2x2 + … + bnxn + c
An example of a data set with a numeric
result is shown on the following overhead
• Being able to derive an equation is nice
• Numeric data sets may yield data mining
results which have structure which can’t
be represented by a single equation
61
62
Labor Negotiations: A More
Realistic Example
• This a mixed attribute problem
• There are missing attribute values in some
data points
• It is based on actual results of Canadian
labor negotiations
63
• The classification is of an “acceptable”
labor contract
• The meaning of “acceptable”, i.e., the
classification, is open to interpretation
• For all of these reasons, this is a more
realistic example than the others
presented so far
64
• A table for this example is shown on the
following overhead
• For presentation purposes the attributes
are shown down the left hand side
• In other words, the attributes are the rows,
not the columns
• The 40 cases are abbreviated with ellipses
in the columns
65
66
Decision Trees, Again
• The book gives two possible decision
trees for the contract data, tree (a) and
tree (b)
• (a) is simpler than (b)
• (a) is also somewhat less accurate than
(b)
• (a) was obtained by trimming (b)
• See the figures on the following overhead
67
68
• The trees represent structure
• This makes it possible to reason about
what the data/structure mean
• Because the trees differ, it is possible to
draw inferences about the differences in
how they define or represent the structure
of the problem
69
• Consider the bottom left 3 options in tree
(b)
• Why is half of healthcare good, but both
none and full are bad?
• This comes down to the question of the
meaning of a good, or acceptable contract
70
• In negotiations, both labor and
management have to agree
• Half of healthcare may be “good” because
it represents an option that both parties
can accept
• In other words, compromise is a good
outcome
71
• On the other hand, it’s also possible that
the derived model is too dependent on the
data set
• The book introduces the terminology
“overtrained”
• If the data set were expanded to include
more cases, we might find that the
structure we think is there is not reflected
in a more comprehensive view of the
72
problem space
Soybean Classification: A Classic
Machine Learning Success
• The soybean classification problem was
based on an initial data set of
• 680 cases
• 35 attributes
• 19 disease categories (the classification)
• A human expert’s diagnosis of the cases
• See the table on the following overhead
73
74
• A subset of ~300 cases was selected from
the overall data set
• The subset was selected so that the data
points were spread out through the space
• Based on the subset, a set of rules could
be derived that correctly diagnosed plant
diseases based on the 35 attributes 97.5%
of the time
75
• The expert who collaborated and provided
diagnoses, also participated in a rulebuilding exercise
• Instead of doing automated pattern
discovery, the computer people worked
with the domain expert to build a set of
classification rules based on human
understanding of the heuristics used
76
• Back in the day, this was the human side
of artificial intelligence
• I don’t know if it’s still commonly used
• In any case, the expert derived rule set
was only successful in classifying 72% of
the time
77
• In short, an automated system for
discovering patterns was more successful
than a human based system
• This is nice, but not overwhelming
• One of the strengths of computing is the
ability manage large amounts of data
• This is true in many areas and it’s not
surprising that it’s also true in pattern
discovery
78
• Continuing the philosophical discussion
from earlier, the algorithms for discovering
the patterns were still devised by humans
• Meta-pattern discovery would be the next
step up the food chain
• Can algorithms be devised that allow
programs to devise new algorithms for
diverse purposes?
79
1.3 Fielded Applications
80
Web Mining
• The relevance of pages to queries can be
ranked by applying data mining techniques
• User clicks/choices can be mined to select
suitable ads to display
• User clicks/choices can be mined to select
suitable products to suggest
• Social network pages and other online
information can be mined to form profiles
and target users
81
Decisions Involving Judgment
• Lending money is both profitable and risky
• Data mining can be used to assess risk vs.
reward
• First cut analysis can be done using
statistics
• People above threshold x are accepted,
people below threshold y are rejected
82
• What to do with the 10% in the middle?
• These are potentially “good” customers
• In other words, people who are borderline
financially are likely to be a good pool of
customers to market loans to
• How to determine who among them is a
good risk?
83
• A project was done on 1000 cases with
~20 attributes
• The data mined system had a success
rate of 2/3 in predicting repayment of loans
• Professional judgment had a success rater
of ½
• If you can do better than flipping a coin, go
with it
84
Screening Images
• Basic problem: Identify oil slicks from
satellite images
• What was fielded:
• A system designed to process satellite
images
• The system could then be trained with
sample data to identify slicks
85
• Fielding a trainable, as opposed to pretrained, system meant that:
• It was customizable by the end user
• In particular, the user could tune the
undetected spill vs. false alarm rate
• (This is what statisticians refer to as type I
and type II error)
86
Load Forecasting
• The basic problem here is predicting
demand for electricity
• In general it depends on time of day, day
of week, and season
• A model was built and enhanced with input
based on current weather conditions
• The goal was hourly prediction of demand
two days in advance
87
• The book’s description makes this sound
like more of an analytical and statistical
model and less of a data mining problem
• It certainly includes traditional elements
• On the other hand, traditional elements
are valid as larger or smaller elements of
overall data mining
88
Diagnosis
• This refers to the diagnosis of problems in
mechanical equipment
• A traditional approach might rely on
eliciting rules from an expert
• This can be costly and time-consuming
• It can also be less accurate than what
results from data mining
89
•
•
•
•
Specific example:
~300 examples of mechanical faults
The goal was not to classify fault/no fault
The goal was to identify fault type when
there was a fault
• The book seems to think this is an
important example as much for the human
factor as the technical factor
90
• The system mined data
• The expert wasn’t satisfied
• The data was manipulated until the rule
set that resulted satisfied the expert
• The expert was not satisfied with rules that
did not conform to his understanding of the
problem domain
91
• He was satisfied with a rule set that was
consistent with his understanding of the
domain
• The derived rule set got better
performance
• It expanded the expert’s understanding of
the domain
92
• In short, the expert was not satisfied with a
black box solution
• The expert didn’t trust results based on
improved performance alone
• The expert would only accept the results
when the structural representation derived
by the system was understandable
93
Marketing and Sales
• Overall, this is a big area of data mining
application
• Banks and cell phone companies are good
examples
• They involve large sums of money and
they keep extensive transaction records
94
• They can classify customers
• They can target customers to keep them
from changing provider
• They can target customers who have
behavior patterns that make them likely
candidates for profitable services
95
• Market basket analysis is another area of
data mining application
• Grocery stores are the classic example
• Your card allows them to profile you as a
customer
• Whether you individually are a highly
profitable customer, your profile in
aggregate with others is valuable
96
• Classification depends on a collection of
profiles
• Once classification is done, potentially
profitable customers can be targeted with
tailored offers
97
• Direct marketing/direct
mail/telemarketing/online marketing is a
profitable area for data mining
• A brick and mortar store may have virtually
no information on its customers
• Once a sale is made, direct marketers
have made a sale, they potentially have
extensive information on buyers
98
• Direct marketers certainly solicit repeat
customers
• They also have a virtually limitless
potential market
• How do they avoid wasting money
contacting unlikely prospects?
• How do they target marketing only towards
likely customers?
99
• Profile your customers
• Once again, one customer’s profile in
isolation may not be particularly valuable
• However, compare these profiles with the
general population
• Use this information to identify a high
probability market segment
100
Other Applications
• This is a grab bag, without detail
• There’s no reason to even list all of these
examples
• The foregoing examples were reasonably
representative
101
1.4 Machine Learning and
Statistics
102
• This can be summarized with a simple
observation:
• Data mining is an area where CS and
statistics converge
• Statistical thinking and techniques are an
integral part of data mining
103
• There is a continuum between classical
statistical techniques, like multivariate
regression, and the most recent,
computer-aided techniques
• It’s not either/or
• The middle ground is a mix
104
1.5 Generalization as Search
• This is an optional section
• I’m not going to cover it
105
1.6 Data Mining and Ethics
• In summary:
• If you’re using data to make decisions, you
should consider whether the decisions are
based on ethical or legal factors, like race,
creed, sex, national origin, sexual
preference, etc.
106
Reidentification
• You can read the details in the book
• The idea is this:
• Even if data is “anonymized”, quite
frequently the remaining attributes can be
used to infer identity
107
• This is a conundrum
• Completely anonymizing would remove
attributes
• At some point you would remove so much
that a data set would be of limited or no
value
108
• Note that this is really a database and
security question
• A structural representation of a problem by
definition is more abstract than individual
cases
• The problem arises when you post data
sets that have supposedly been
anonymized by the removal of key
attributes
109
Using Personal Information
• This is also fundamentally a database and
security problem
• Here is the data mining twist:
• When a user submits data and clicks “I
accept” on the use policy, have they
meaningfully been informed that they may
be data mined?
110
• Mining data and making decisions based
on it may affect that user or others
• It is not the same as static data collection
and dissemination
111
Wider Issues
• The book doesn’t express it this way, but
these are questions that statisticians have
long faced:
• A result may be statistically significant, but
is it of practical significance?
• If so, what?
112
• As noted earlier, what are the chances
and consequences of false positives and
false negatives?
• Also, if you can find correlations
(associations) do they imply cause and
effect, either one way or the other
• Be leery of falsely inferring cause and
effect, which statistics alone can’t
establish
113
• The eternal question is, if you gain new
knowledge, how do you use it?
• Do you use your powers for good or evil?
• Who defines good and evil?
114
1.7 Further Reading
• In general, I’m not going to cover the
further reading sections
• You may want to look at them to get ideas
for your paper or project
115
The End
116