What is Data Mining?

Download Report

Transcript What is Data Mining?

Data Mining and
Knowledge Discovery
By Matt Goliber and
Jim Hougas
What is Data Mining?
• Not like gold or diamond mining
• Mining of knowledge from data
• Important to many different fields
• A Part of Knowledge Discovery in Databases (KDD)
The Process of Knowledge
Discovery
Data cleaning and integration
Raw data
Data Warehouse
Data
Data transformation,
transformation,
selection,
selection, and
and mining
mining
Pattern evaluation and
knowledge presentation
Patterns
KNOWLEDGE!
Why is Data Mining useful?
• We are data rich but information poor
-Internet
-Intelligence
• Humans often lack the ability to comprehend and manage the
immense amount of available and sometime seemingly unrelated
data
How long has this idea been
around?
• Late 60’s and Early 70’s
• Stanford’s Meta-DENDRAL (1970-76)
-Extension of DENDRAL
• Doug Lenat with AM (1976)
Meta-DENDRAL
• Extension of the DENDRAL (1965) program
-One of the first expert systems
-Interpreted mass spectra
• Meta-DENDRAL took the mass spectra of compound of known 3D structure and formulated rules about the interpretation of the
spectra
• Came up with known rules and some new ones!
Sample Mass Spec
ethyl 3-oxy-3-phenylpropanoate (ethyl benzoylacetate)
AM
• Doug Lenat, 1976
• Name means nothing, stand alone
• AM was given sets, bags, ordered sets, and lists
• AM was also given operations to perform on these data sets
-Union, Intersection, ect…
• Came up with ideas about counting, addition, multiplication, prime
numbers, and Goldbach’s conjecture
• AM thought that these were all uninteresting
• Liked maximally divisible numbers though…
What next?
• Not a whole lot…
• Databases were not prevalent enough, no great demand
• Did benefit from machine learning research
• Beginning of the 1990’s, “The next area…”
-Ranked as one of the most promising research areas (NSF)
-Information explosion
• Early commercial systems
-Farm Journal
-GM
Next Generation Techniques
•
Decision Trees
– Each branch is a classification question
– Allows businesses to segment customers, products, and sales regions
– Questions organize the data
•
Rule Induction
– All patterns are pulled from the data
– Accuracy and Significance are then added to them
– Help the user know how strong pattern is and likelihood of it occurring
again
– Ex: If bagels are purchased then cream cheese is purchased 90% of the
time and this pattern occurs in 3% of all shopping baskets
Decision Trees vs. Rule Induction
•
Decision Trees
– Many rules to cover same instance or
– no rule to cover an instance
•
Rule Induction
– Always and only one rule
•
Example
– Decision Trees use height and shoe size to determine size of person
– Rule Induction uses one or the other
Examples of Significant
Developments
•
Stock Market Advances (1991)
– Astrophysicists Doyne Farmer and Norman Packard
– Prediction company could predict stock market trends
•
Bell Atlantic (1996)
– Consumer phone buying trends
– Rule Induction
•
Advanced Scout (1997)
– Inderpal Bhandari assists NBA coaches
– Rule Induction
•
Persuade 400,000 undecided voters (2004)
– MoveOn attemps to influence the election
– Decision Tree
Challenges
• Large Data Sets with High Complexity
- One or the other is currently possible, but not both
• Expensive
- Costs of Bell Atlantic (Experts are needed)
- Cost for a two-day course in Las Vegas ($1,300)
- Software ($100,000)
Research
•
DARPA
– Defense Advance Research Projects Agency
– ACLU claims this is an invasion of privacy
– Decision Tree
•
Uncovering Terrorists in public chat rooms
– Tracks the times that messages are sent
•
Advanced Scout
– Bhandari is working on Advanced Scout for the NHL
– Rule Induction
Current State
•
Out of the Lab
– Into Fortune 500 companies
•
Automate Model Scoring
– Fingers are currently crossed in hopes that scoring by IT personnel is
done correctly
Future States
•
Utilizing Company Warehouses
– Data miners must take advantage of a million dollar warehouse that a
company builds
•
Effort Knob
– Low for quick model, high for quality model
•
Computed Target Columns
– User could create a new target variable
– Ex: finance information that a business has
Sources
http://web.media.mit.edu/~haase/thesis/node54.html#SECTION00711000000000000000
http://smi-web.stanford.edu/projects/history.html#METADENDRAL
http://www.cs.cf.ac.uk/Dave/AI2/node151.html
http://64.233.161.104/search?q=cache:Q6eMD9tEKwIJ:www.cosc.brocku.ca/Offerings/4P79/Week12.ppt+meta-dendral&hl=en
http://laurel.actlab.utexas.edu/~cynbe/muq/muf3_21.html
http://64.233.161.104/search?q=cache:yft0cQ5tZJQJ:www.cs.uwaterloo.ca/~shallit/Talks/cct.ps+%22fundamental+theorem+of+a
rithmetic%22+computer+data+mining+prove&hl=en
http://mathworld.wolfram.com/GoldbachConjecture.html
http://www.quantlet.com/mdstat/scripts/csa/html/node202.html
http://www.thearling.com
http://www.wired.com
http://www.dmreview.com
http://www.ebscohost.com
http://www.thearling.com/text/dmtechniques/dmtechniques.htm
http://www.aaai.org/Library/Magazine/Vol13/13-03/vol13-03.html
Data Mining: Concepts and Techniques. Han J. and Kamber M.