Operations Research & Data Mining

Download Report

Transcript Operations Research & Data Mining

Operations Research
&
Data Mining
Siggi Olafsson
Associate Professor
Department of Industrial Engineering
Iowa State University
20th European Conference on Operational Research
Rhodes, Greece, July 5 - 8
Purpose of Talk




Should
I be
here?
Give a definition and an
overview of data mining as
it relates to operations
research
Present some examples to
give the flavor for the type
of work that is possible
My views and future of OR
and data mining
Aim for it to be accessible
without prior knowledge of
data mining
20th European Conference on Operational
Research, July 4-7, 2004
2
Overview


Background
Intersection of OR and Data Mining

Optimization algorithms used for data mining





Data mining used in OR applications


Production scheduling
Optimization methods applied to output of standard data
mining algorithms


Data visualization
Attribute selection
Classification
Unsupervised learning
Selecting and improving decision trees
Open research areas
20th European Conference on Operational
Research, July 4-7, 2004
3
Background

Rapidly growing interest in data mining among
operations research academics and practitioners

For example evidenced by increased data mining
presence in professional organizations

New INFORMS Section on Data Mining

Large number of data mining sessions at INFORMS and
IIE research conferences

Special issues in Computers & Operations Research, IIE
Transactions, Discrete Applied Mathematics, etc.

Numerous presentations/sessions at this conference
20th European Conference on Operational
Research, July 4-7, 2004
4
What is Data Mining?
20th European Conference on Operational
Research, July 4-7, 2004
5
What is Data Mining, Really?

Extracting meaningful, previously unknown
patterns or knowledge from large databases

The knowledge discovery process
Define
Objective
Business/scientific
objective
Data mining
objective
Prepare
Data
Mine
Knowledge
Classification
Data cleaning
Association rule
Data selection
discovery
Attribute selection
Clustering
Visualization
20th European Conference on Operational
Research, July 4-7, 2004
Interpret
Results
Predictive models
Structural insights
6
Interdisciplinary Field
Statistics
Machine
Learning
Data Mining
Databases
Optimization
20th European Conference on Operational
Research, July 4-7, 2004
7
Input Engineering


Preparing the data may take as much as 70% of
the entire effort
Numerous steps, including







Combining data sources
Transforming attributes
Data cleaning
Data selection
Attribute selection
Data visualization
Many of those have connections with operations
research and optimization in particular
20th European Conference on Operational
Research, July 4-7, 2004
8
Overview


Background
Intersection of OR and Data Mining

Optimization algorithms used for data mining





Data mining used in OR applications


Production scheduling
Optimization methods applied to output of standard data
mining algorithms


Data visualization
Attribute selection
Classification
Unsupervised learning
Selecting and improving decision trees
Open research areas
20th European Conference on Operational
Research, July 4-7, 2004
9
Data Visualization
Visualizing the data is important in any
data mining project
 Generally difficult because the data is
always high-dimensional, i.e., hundreds or
thousands of attributes (variables)
 How can we best visualize such data in 2
or 3 dimensions?
 Traditional techniques include
multidimensional scaling, which uses
nonlinear optimization

20th European Conference on Operational
Research, July 4-7, 2004
10
Optimization Formulation



Recent combinatorial optimization formulation by AbbiwJackson, Golden, Raghavan, and Wasil (2004)
Map a set M of m points from Rr to Rq, q = 2,3
Approximate the q-dimensional space by a lattice N
min
   F d
iM jM kN lN
j 1
s.t.
original
x
kN
ik

(i, j ), d new (k , l ) xik x jl
 1, i  M
xik  0,1
d original(i, j ) Distance measure in R r
d new (k , l )
Distance measure in R q
F
Function such as least square, Sammon map, etc
20th European Conference on Operational
Research, July 4-7, 2004
11
Solution Methods




Quadratic Assignment Problem (QAP)
Not possible to solve exactly for large scale problems
Local search procedure proposed
Key to the formulation is selection of objective function,
e.g., Sammon map
min
1
  
  d original(i, j ) iM
iM jM
j i
d
original(i , j )  d new ( k , l )  xik x jl
jM kN lN
j i
20th European Conference on Operational
Research, July 4-7, 2004
2
d original(i, j )
12
Overview


Background
Intersection of OR and Data Mining

Optimization algorithms used for data mining





Data mining used in OR applications


Production scheduling
Optimization methods applied to output of standard data
mining algorithms


Data visualization
Attribute selection
Classification
Unsupervised learning
Selecting and improving decision trees
Open research areas
20th European Conference on Operational
Research, July 4-7, 2004
13
Attribute Selection
Usually large number of attributes
 Some attributes are redundant or
irrelevant and should be removed
 Benefits:





Faster subsequent induction
Simpler models (important in data mining)
Better (predictive) performance of models
Discover which attributes are important
(descriptive or structural knowledge)
20th European Conference on Operational
Research, July 4-7, 2004
14
Optimization Formulation

Define decision variable
1, if attribute j is selected,
xj  
otherwise.
0,

Combinatorial optimization problem
max
f x  x1 , x2 ,..., xn 
s.t.
x j  0,1 j
x

Number of solutions is 2n-1

How should the objective function be defined?
20th European Conference on Operational
Research, July 4-7, 2004
15
Solution Methods

Non-linear objective function

(Defining a good objective is a major issue)

Mathematical programming approach (Bradley,
Mangasarian and Street, 1998)

Metaheuristics have been applied extensively

Genetic algorithms, simulated annealing

Nested partitions method (Olafsson and Yang, 2004)


Intelligent partitioning: take advantage of what is known in
data mining about evaluating attributes
Random instance sampling: in each step the algorithm uses
a sample of instances, which improves scalability
20th European Conference on Operational
Research, July 4-7, 2004
16
Learning from Data
Each data point (instance) represents an
example from which we can learn
 The instances are either


Labeled (supervised learning)



One attribute is of special interest (called the class or
target) and each instance is labeled by its class value
Unlabeled (unsupervised learning)
Instances are assumed to be independent

(However, spatial and temporal data mining
are active areas of research)
20th European Conference on Operational
Research, July 4-7, 2004
17
Learning Tasks in Data Mining

Classification (supervised learning)


Clustering (unsupervised learning)


Learn how to classify data in one of a given
number of categories or classes
Learn natural groupings (clusters) of data
Association Rule Discovery

Learn correlations (associations) among the
data instances

Also called market basket analysis
20th European Conference on Operational
Research, July 4-7, 2004
18
Overview


Background
Intersection of OR and Data Mining

Optimization algorithms used for data mining





Data mining used in OR applications


Production scheduling
Optimization methods applied to output of standard data
mining algorithms


Data visualization
Attribute selection
Classification
Unsupervised learning
Selecting and improving decision trees
Open research areas
20th European Conference on Operational
Research, July 4-7, 2004
19
Classification
Classification is the most common learning
task in data mining
 Many methods have been proposed


Decision trees, neural networks, support vector
machines, Bayesian networks, etc.
The algorithm is trained on part of the
data and the accuracy tested on
independent data (or use cross-validation)
 Optimization is relevant to many
classification methods

20th European Conference on Operational
Research, July 4-7, 2004
20
Optimization Formulation




Suppose we have n attributes and each instance has been
labeled as belonging to one of two classes
Represent by two matrices A and B
Need to learn what separates the points in the two sets (if
they can be separated)
In a 1965 Operations Research article, Olvi Mangasarian
studied the case where the two sets can be separated with
a hyperplane:
Aw  e , Bw  e ,
wx    0
20th European Conference on Operational
Research, July 4-7, 2004
21
Separating Hyperplane
x2
Closest points in
convex hulls
Class A
Class B
d
c
Separating hyperplane
x1
20th European Conference on Operational
Research, July 4-7, 2004
22
Finding the Closest Points
Formulate as QP: min
c ,d
s.t.
1
2
cd
2
c    i xi
i:Class A
 x
d

i i
i:Class B
i
1
i
1
i:Class A

i:Class B
i  0
20th European Conference on Operational
Research, July 4-7, 2004
23
Support Vector Machines
Support Vectors
Class A
x2
Class B
Separating
Hyperplane
x1
20th European Conference on Operational
Research, July 4-7, 2004
24
Limitations

The points (instances) may not be separable by a
hyperplane


Add error terms to minimize
A linear separation is quite limited
x2
Class A
Class B
x1
Solution is to map the data to a higher dimensional space
20th European Conference on Operational
Research, July 4-7, 2004
25
Wolfe Dual Problem

First formulate the Wolfe dual
1
w   i    i j yi y j x i  x j
2 i, j
i
0  i  C
2
max
α
subject to
 y
i
i
 0.
i

Now the data only appears in the dot
product in the objective function
20th European Conference on Operational
Research, July 4-7, 2004
26
Kernel Functions

Use kernel functions to map the data and replace
the dot product with
K (x, y )  (x)  (y )

 : Rn  H
For example,
K (x, y )  (x  y  1)
K (x, y )  e
p
x  y / 2 2
2
K (x, y )  tanh( x  y   )
20th European Conference on Operational
Research, July 4-7, 2004
27
Other Classification Work


Extensive publications on SVM and mathematical
programming for classifications
Several other approaches also relevant, e.g.




Logical Analysis of Data (LAD) learns logical
expressions to classify the target attribute (series of
papers by Hammer, Boros, et al.)
Related approach is Logic Data Miner Lsquare (e.g.,
talk by Felici, Truemper, and Paola last Monday)
Bayesian networks are often used, and finding the
best structure of such networks is a combinatorial
optimization problem
Further discussed in the next talk
20th European Conference on Operational
Research, July 4-7, 2004
28
Overview


Background
Intersection of OR and Data Mining

Optimization algorithms used for data mining





Data mining used in OR applications


Production scheduling
Optimization methods applied to output of standard data
mining algorithms


Data visualization
Attribute selection
Classification
Unsupervised learning
Selecting and improving decision trees
Open research areas
20th European Conference on Operational
Research, July 4-7, 2004
29
Data Clustering
Now we do not have labeled data to train
(unsupervised learning)
 Want to identify natural clusters or
groupings of data instances
 Many possible set of clusters


What makes a set of clusters good?
20th European Conference on Operational
Research, July 4-7, 2004
30
Optimization Formulation

Given a set A of m points, find the centers Cj of k
clusters that minimize the 1-norm
m
min
C ,D
s.t.



T
min
e
Dij

i 1
j

 Dij  AiT  C j  Dij , i  1,..., m; j  1,..., k
This formulation is due to Bradley, Mangasarian,
and Street (1997)
Much more work is needed in this area
20th European Conference on Operational
Research, July 4-7, 2004
31
Association Rule Discovery




Find strong associations among instances (e.g.,
high support and confidence)
Originally used in market basket analysis, e.g.,
what products are candidates for cross-sell, upsell, etc.
Define an item as an attribute-value pair
Algorithm approach (Agrawal et al., 1992, Apriori
and related methods):


Generate frequent item sets with high support
Generate rules from these sets with high confidence
20th European Conference on Operational
Research, July 4-7, 2004
32
Objectives for Association Rules

Want high support and high confidence





Maximizing support would lead to only discovering a few
trivial rules (those that occur very frequently)
Maximizing confidence leads to obvious rules (those that
are 100% accurate)
Support and confidence are usually treated as
constraints (user specified minimum)
Still need measures for good rules (i.e., rules that
add insights and are hence interesting)
Significant opportunities for optimizing the rules
that are obtained (not much work, yet)
20th European Conference on Operational
Research, July 4-7, 2004
33
Overview


Background
Intersection of OR and Data Mining

Optimization algorithms used for data mining





Data mining used in OR applications


Production scheduling
Optimization methods applied to output of standard data
mining algorithms


Data visualization
Attribute selection
Classification
Unsupervised learning
Selecting and improving decision trees
Open research areas
20th European Conference on Operational
Research, July 4-7, 2004
34
Data Mining for OR Applications

Data mining can be used to complement
traditional OR methods in many areas

Example applications areas:

E-commerce

Supply chain management (e.g., to enable
customer-value management in the chain)

Production scheduling
20th European Conference on Operational
Research, July 4-7, 2004
35
Data Mining for Scheduling

Production scheduling is often ad-hoc in practice


Experience and intuition of human schedulers
Li and Olafsson (2004) propose a method to learn
directly from production data

Benefits

Make scheduling practices explicit

Incorporate in automatic scheduling system

Insights into operations

Improve schedules
20th European Conference on Operational
Research, July 4-7, 2004
36
Background

Scheduling task



Given a finite set of jobs, sequence the jobs in
order of priority
Many simple dispatching rules available
Machine learning in scheduling



Considerable work over two decades
Expert systems
Inductive learning


Select dispatching rules from simulated data
Has not been applied directly to scheduling data
(which would be data mining)
20th European Conference on Operational
Research, July 4-7, 2004
37
Simple Example: Dispatching List
Job
ID
Release
Time
Start
Time
Processing
Time
Completion
Time
J5
J1
J3
J4
J2
0
10
18
0
30
0
17
32
52
59
17
15
20
7
5
17
32
52
59
64
How were these five jobs scheduled?
Longest processing time first (LPT)
20th European Conference on Operational
Research, July 4-7, 2004
38
Data Mining Formulation

Determine the target concept

Dispatching rules are a pair-wise comparison

Learning task: Given two jobs, which job
should be dispatched first?

Data preparation

Construct a flat file

Each line (instance/data object) is an example
of the target concept
20th European Conference on Operational
Research, July 4-7, 2004
39
Prepared Data File
Job
1
Processing
Time1
Release
1
J1
15
10
J2
5
30
Yes
J1
15
10
J3
20
18
Yes
J1
15
10
J4
7
0
Yes
J1
15
10
J5
17
0
No
J2
5
30
J1
15
10
No
J2
5
30
J3
20
18
No
J2
5
30
J4
7
0
No
Job
2
Processing
Time2
20th European Conference on Operational
Research, July 4-7, 2004
Release
2
Job1Scheduled
First
40
Input Engineering
Attribute creation (i.e., composite
attributes) and attribute selection is an
important part of data mining
 Add attributes:





ProcessingTimeDifference
ReleaseDifference
Job1Longer
Job1ReleasedFirst
Select the best subset of attributes
 Apply the C4.5 decision tree algorithm

20th European Conference on Operational
Research, July 4-7, 2004
41
Decision Tree
Job 1 Longer?
Yes
No
Job 1 Released
First?
No
Yes
Yes
LPT for
released jobs
Job 1 Released
First?
Yes
Processing Time
Difference
5
Processing Time
Difference
>5
No
Do not wait for Job 1
if not much longer than Job 2
No
Yes
 -8
No
No
> -8
Yes
Wait for Job 1 to be
released if it is much
longer than Job 2
20th European Conference on Operational
Research, July 4-7, 2004
42
Structural Knowledge



The dispatching rule is LPT
Mine data that use this rule and the processing
time and release time data
The induced model takes into account:




Possible range of processing times
Largest delay caused by a not released job
New structural patterns, not explicitly known by
the dispatcher, discovered
Next step is to improve schedules


Instance selection: learn from best practices
Optimize the decision tree
20th European Conference on Operational
Research, July 4-7, 2004
43
Overview


Background
Intersection of OR and Data Mining

Optimization algorithms used for data mining





Data mining used in OR applications


Production scheduling
Optimization methods applied to output of standard data
mining algorithms


Data visualization
Attribute selection
Classification
Unsupervised learning
Selecting and improving decision trees
Open research areas
20th European Conference on Operational
Research, July 4-7, 2004
44
Optimizing Decision Trees
Decision tree induction is often unstable
 Genetic algorithms have been used to
select the best tree from a set of trees





Kennedy et al. (1997) encode decision trees
and define crossover and mutation operators
The accuracy of the tree is the fitness function
A series of papers by Fu, Golden, et al. (2003;
2004a; 2004b) builds further on this approach
Other optimization methods could also
apply and other outputs can be optimized
20th European Conference on Operational
Research, July 4-7, 2004
45
Overview


Background
Intersection of OR and Data Mining

Optimization algorithms used for data mining





Data mining used in OR applications


Production scheduling
Optimization methods applied to output of standard data
mining algorithms


Data visualization
Attribute selection
Classification
Unsupervised learning
Selecting and improving decision trees
Open research areas
20th European Conference on Operational
Research, July 4-7, 2004
46
Conclusions


Although data mining related optimization work
dates back to the 1960s, most problems are
still open or need more research
Need to be aware of the key concerns of data
mining: extracting meaningful, previously
unknown patterns or knowledge from large
databases



Algorithms should handle massive data sets, that is, be
scalable with respect to both time and memory use
Results often focus on simple to interpret meaningful
patterns that provide structural insights
Previously unknown means few modeling assumptions
that restrict what can be discovered
20th European Conference on Operational
Research, July 4-7, 2004
47
Open Problems

Many data mining problems can be formulated as
optimization problems




Seen numerous examples, e.g., classification and
attribute selection (most work for these problems)
Many areas have not been addressed or need more work
(in particular, clustering and association rule mining)
Optimizing model outputs is very promising
Use of data mining in OR applications has been
very little investigated



Supply chain management
Logistics and transportation
Planning and scheduling
20th European Conference on Operational
Research, July 4-7, 2004
48
Questions?

For more information after today:

Email me at [email protected]

Visit my homepage at http://www.public.iastate.edu/~olafsson
Consult Dilbert

20th European Conference on Operational
Research, July 4-7, 2004
49
Select References

The following surveys on optimization and data mining are available:
1.
2.

Padmanabhan, B. and A. Tuzhilin (2003). “On the Use of Optimization for Data Mining: Theoretical Interactions and
eCRM Opportunities,” Management Science 49: 1327-1343.
Bradley, P.S., U.M. Fayyad, and O.L. Mangasarian (1999). “Mathematical Programming for Data Mining: Formulations
and Challenges,” INFORMS Journal of Computing 11: 217-238.
Work mentioned in presentation:
3.
4.
5.
6.
7.
8.
9.
10.
11.
Abbiw-Jackson, B. Golden, S. Raghavan, and E. Wasil (2004). “A Divide-and-Conquer Local Search Heuristic for Data
Visualization,” Working Paper, University of Maryland.
Boros, E. P.L. Hammer, T. Ibaraki, A. Kogan (1997). “Logical Analysis of Numerical Data,” Mathematical Programming
79: 163-190.
Bradley, P.S., O.L. Mangasarian, and W.N. Street (1997). “Clustering via Concave Minimization,” in M.C. Mozer, M.I.
Jordan, T. Petsche (eds.) Advances in Neural Information Processing Systems. MIT Press, Cambridge, MA.
Bradley, P.S., O.L. Mangasarian, and W.N. Street (1998). “Feature Selection via Mathematical Programming,”
INFORMS Journal of Computing 10: 209-217.
Fu, Z., B. Golden, S. Lele, S. Raghavan, and E. Wasil (2003). “A Genetic Algorithm-Based Approach for Building
Accurate Decision Trees,” INFORMS Journal of Computing 15: 3-22.
Kennedy, H., C. Chinniah, P. Bradbeer, and L. Morss (1997). “The Construction and Evaluation of Decision Trees: A
Comparison of Evolutionary and Concept Learning Methods,” in D. Corne and J.L. Shapiro (eds.) Evolutionary
Computing, Lecture Notes in Computer Science, Springer-Verlag, 147-161.
Li, X. and S. Olafsson (2004). “Discovering Dispatching Rules using Data Mining,” Journal of Scheduling, to appear.
Mangasarian, O.L. (1965). “Linear and Nonlinear Separation of Patterns by Linear Programming,” Operations
Research 13: 455-461.
Olafsson, S. and J. Yang (2004). “Intelligent Partitioning for Feature Selection,” INFORMS Journal on Computing, to
appear.
20th European Conference on Operational
Research, July 4-7, 2004
50