The Inductive Software Engineering Manifesto: Principles

Download Report

Transcript The Inductive Software Engineering Manifesto: Principles

The Inductive
Software Engineering
Manifesto
Principles for Industrial Data Mining
Presentation By: Ebeid Soliman & Mason Schoolfield
Paper Authored By:
Menzies & Kocaganeli – Lane Dept of CS/EE, WVU
Bird, Zimmerman, & Schulte – Microsoft Research
Motivation
•
•
•
•
This paper is a reflection of the authors’ applied data mining
work, discussions with researchers, and software engineering
practitioners.
Document methods and experience from industrial
practitioners
The principal question is : what characterizes the difference
between academic and industrial data mining ?
Motivation: Successful data-mining projects in industry
Inductive Software Engineering
• “A branch of software engineering that focuses on the
delivery of data mining based software applications to
users”
• Understand user goals to inductively generate the
models that most matter to the user
• Industrial practitioners are focused on users, whereas
academic data mining research is focused on algorithms
Industrial Data Mining
7 Principles
• Users before algorithms
• Plan for scale
• Early feedback
• Be open-minded
• Do smart learning
• Live with the data you have
• Broad skill set, big toolkit
Users before algorithms
• Guiding Principle – Users Before
Algorithms
• Mining algorithms are only good if users
fund their use in real-world applications
Users before Algorithms
Hallmarks of good interaction meetings
•
•
•
•
Users bring senior management to the meetings
Users keep interrupting (you or each other) and debating your
results
•
•
Indicates the users understand your explanation of the results
Your results are touching on issues that concern them
User begin to offer more data sources for analysis
Users invite you to their workspace to show how to do part of the
analysis
Plan for scale
Knowledge Discovery in Databases
• KDD – Knowledge
(KDD)
Discovery In
Databases
•
•
The non-trivial
process of
identifying valid,
novel, potentially
useful, and
ultimately
understandable
patterns in data
Repetition Required
Steps that compose the KDD process - Fayyad
•
•
•
•
Plan for scale
Most data mining is data pre-processing
Gaining access to databases in business groups is
time consuming
To ensure repeatability automate as many KDD steps
as possible
Data mining methods are repeated multiple times
•
•
•
Answer user questions
Enhance data mining method or Fix bugs
Deploy to different user groups
Plan for scale
• Observed Phases
•
•
•
•
Scout - rapid prototyping, apply many methods to data,
explore range of hypotheses, gain user interest (get feedback)
Survey - experiment to find stable models - focusing on user
goals
Build - integrate models into a deployment framework –
suitable for target user base
Team size doubles after scouting, doubles after surveying –
time implications!
Early feedback
•
•
•
Simplicity first: before conducting very elaborate studies, try
applying very simple tools to gain rapid early feedback
Get Feedback Early and Often
Discretize continuous attributes (determine what is
ignorable)
Be open-minded
• Avoid a fixed hypothesis
• Avoid a fixed approach, particularly for data not been
mined before
• Initial results are important and can change goals
Smart Learning
•
Inductive agents, human or otherwise, make errors
•
•
Don’t torture the data to meet preconceptions, but it can be ok to
go “fishing”
Important outcomes are riding on your conclusions - check &
validate!
•
•
•
Check the variance before concluding, it may be based on
statistical noise
Check conclusion stability against different sample sizes
Check conclusion support to avoid conclusions based on a small
percent of the data
Smart Learning
• Prevent spurious conclusions by carefully controlling
data collection and focusing on a small space of
hypotheses (IF YOU CAN)
• Rule learners – RIPPER and INDUCT check against
randomly generated alternatives (if probabilities are the
same you can delete the rule)
Live with the data you have
•
•
Collecting data comes at a cost!
•
Remove spurious data - conduct instance or feature selection
studies
•
•
Go mining with the data you have, not the data you hope to have
at a later date
80 to 90% of rows and all but the square root of columns can be
deleted before compromising performance of the learned model
Be respectful but doubtful to all user-suggested domain
hypotheses
Broad skill set, big toolkit
• Try multiple inductive technologies
• Inductive Engineers generate novel and insightful
feedback for users
•
Researchers can work to perfect a single algorithm
• Big ecology: Use tools supported by a large ecosystem
of developers who are constantly building new modules
(e.g. R, WEKA, MATLAB)
What does this mean for Industry?
• Implications for Project Management
•
Scouting takes weeks, Surveying takes months, and Building
takes years
• Implications for Training
•
•
•
Communications skills
Results briefing
Scripting
Research to help Industry
• Research themes to benefit industrial data mining
•
•
•
•
•
Analysis patterns for inductive engineers (like design patterns
for developers)
Design pattern for data miners
Optimizations of learning algorithms
Anomaly detectors
Business-aware learners
Final Notes
• Conclusion – Be user-focused, keep these principles in
mind
• Hopefully these generalities will be helpful
• Share your experiences and knowledge so that
Industrial Inductive Engineering can mature