Data Mining - Computer Science and Engineering
Download
Report
Transcript Data Mining - Computer Science and Engineering
DATA MINING TECHNIQUES
Introductory and Advanced Topics
Eamonn Keogh
(some slides adapted from) Margaret Dunham
Dr. M.H.Dunham, Data Mining, Introductory and Advanced Topics,
Prentice Hall, 2002.
http://iubio.bio.indiana.edu/treeapp/treeprint-sample1.html
© Prentice Hall
1
Data Mining Outline
– Introduction
– Related Concepts
– Data Mining Techniques
© Prentice Hall
2
Introduction Outline
Goal: Provide an overview of data mining.
Define data mining
Data mining vs. databases
Basic data mining tasks
Data mining issues
© Prentice Hall
3
Introduction
Data is growing at a phenomenal rate (read “How
Much Information Is There In the World?” By Michael Lesk )
Users expect more sophisticated information
How?
UNCOVER HIDDEN INFORMATION
DATA MINING
© Prentice Hall
4
Data Mining Definition
Finding hidden information in a database
Data Mining has been defined as
“The nontrivial extraction of implicit, previously
unknown, and potentially useful information
from data”.
Similar terms
– Exploratory data analysis
– Data driven discovery
– Deductive learning
– Discovery Science
– Knowledge Discovery
© Prentice Hall
5
Database Processing vs. Data
Mining Processing
Query
– Well defined
– SQL
Query
– Poorly defined
– No precise query language
Output
– Subset of database
Output
–Not a subset of database
© Prentice Hall
6
Query Examples
Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all customers who have purchased milk
Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
© Prentice Hall
7
Data Mining Models and Tasks
© Prentice Hall
8
Basic Data Mining Tasks I
Classification maps data into predefined
groups or classes
– Supervised learning
– Pattern recognition
– Prediction
Regression is used to map a data item to a
real valued prediction variable.
Clustering groups similar data together into
clusters.
– Unsupervised learning
– Segmentation
– Partitioning
© Prentice Hall
H =1.31 (Fem + Fib) + 63.05
9
Basic Data Mining Tasks II
Summarization maps data into subsets with
associated simple descriptions.
– Characterization
– Generalization
Link Analysis uncovers relationships among data.
– Affinity Analysis
– Association Rules
– Sequential Analysis determines sequential patterns.
© Prentice Hall
10
KDD Process
Modified from [FPSS96C]
Selection: Obtain data from various sources.
Preprocessing: Cleanse data.
Transformation: Convert to common format.
Transform to new format.
Data Mining: Obtain desired results.
Interpretation/Evaluation: Present results
to user in meaningful manner.
© Prentice Hall
11
KDD Process Ex: Shuttle Data
Selection:
– Select data (which missions etc) to
use
Preprocessing:
– Remove Spikes
Transformation:
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
00
100
200
300
400
500
600
700
800
900
1000
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
– DFT, DWT, PAA etc
Data Mining:
– Look for Rules…
0
100 200 300 400 500 600 700 800 900 1000
Interpretation/Evaluation:
– Show rules to domain experts
Potential User Applications:
– Prediction of Failures© Prentice Hall
12
Data Mining Development
•Relational Data Model
•SQL
•Association Rule Algorithms
•Data Warehousing
•Scalability Techniques
•Similarity Measures
•Hierarchical Clustering
•IR Systems
•Imprecise Queries
•Textual Data
•Web Search Engines
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis
•Data Structures
•Neural Networks
•Decision Tree Algorithms
© Prentice Hall
13
KDD Issues
Human Interaction
Overfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
© Prentice Hall
14
KDD Issues (cont’d)
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data (streams)
Integration
Application
© Prentice Hall
15
Social Implications of DM
Privacy
Profiling
Unauthorized use
© Prentice Hall
16
Data Mining Metrics
Usefulness
Return on Investment (ROI)
Accuracy
Space/Time Complexity
© Prentice Hall
17