The Art and Technology of Data Mining

Download Report

Transcript The Art and Technology of Data Mining

Data Mining
The Art and Science of
Obtaining Knowledge from Data
Dr. Saed Sayad
4/8/2016
University of Toronto
1
Agenda
 Explosion of data
 Introduction to data mining
 Examples of data mining in science
and engineering
 Challenges and opportunities
4/8/2016
University of Toronto
2
Explosion of Data
 Data in the world doubles every 20 months!
 NASA’s Earth Orbiting System:
46 megabytes of data per second
4,000,000,000,000 bytes a day
 FBI fingerprints image library:
200,000,000,000,000 bytes
 In-line image analysis for particle detection:
1 megabyte in one second
4/8/2016
University of Toronto
3
Explosion of Data (cont.)
4/8/2016
University of Toronto
4
Explosion of Data (cont.)
4/8/2016
University of Toronto
5
Explosion of Data (cont.)
4/8/2016
University of Toronto
6
Explosion of Data (cont.)
4/8/2016
University of Toronto
7
What we need?
Fast, accurate, and scalable data
analysis techniques to extract useful
knowledge:
The answer is Data Mining.
4/8/2016
University of Toronto
8
What is Data Mining?
“Data Mining is the exploration and
analysis of large or small quantities of
data in order to discover meaningful
patterns, trends and rules.”
Data
4/8/2016
Data Mining
University of Toronto
Knowledge
9
Data Analysis
AI,
Machine
Learning
Statistics
Data Mining
Database
Data Warehouse
OLAP
4/8/2016
University of Toronto
10
Data Mining
Data Analysis
Statistics
4/8/2016
Database
Machine Learning
Data Warehouse
University of Toronto
OLAP
11
Database
4/8/2016
Text Files
Relational
Database
Multidimensional
Database
Entities
File
Table
Cube
Attributes
Row and
Col
Record, Field,
Index
Dimension,
Level,
Measurement
Methods
Read,
Write
Select, Insert,
Update,
Delete
Drill down,
Drill up, Drill
through
Language
-
SQL
MDX
University of Toronto
12
Data Analysis
 Classification
 Regression
 Clustering
 Association
 Sequence Analysis
4/8/2016
University of Toronto
13
Data Analysis
Numeric
X1
W1
Numeric
Regression
age, income, …
Categorical
Y1
X2
W2
Model
(0,1)
Y2
Categorical
Classification
gender, occupation, …
(good, bad)
Input Variables
or
Attributes
4/8/2016
Linear Models
or
Decision Trees
University of Toronto
Output Variables
or
Targets
14
Data Analysis (cont.)
Clustering
Association
Income
1, chips, coke, chocolate
2, gum, chips
3, chips, coke
4, …
Age
Probability (chips, coke) ?
Sequence Analysis
…ATCTTTAAGGGACTAAAATGCCATAAAAATCCATGGGAGAGACCCAAAAAA…
Xt-1
4/8/2016
University of Toronto
T
Xt
15
Data Mining in Research Life Cycle
 Questions
 Needs
Report
Library
Search
Data
Analysis
Modeling
Database
Research
Data
Experiment
4/8/2016
University of Toronto
16
Data Mining – Modeling Steps
1.Problem Definition
2.Data Preparation
3.Exploration
4.Modeling
5.Evaluation
6.Deployment
4/8/2016
University of Toronto
17
Agenda
 Explosion of data
 Introduction to data mining
 Examples of data mining in science and
engineering
 Challenges and opportunities
4/8/2016
University of Toronto
18
Examples of data mining in science & engineering
1. Data mining in Biomedical Engineering
“Robotic Arm Control Using Data Mining Techniques”
2. Data mining in Chemical Engineering
“Data Mining for In-line Image Monitoring of
Extrusion Processing”
4/8/2016
University of Toronto
19
1. Problem Definition
“Control a robotic arm by means of EMG signals
from biceps and triceps muscles.”
Muscle
Contraction
Biceps
Triceps
Supination
H
L
H
L
Flexion
H
L
Extension
L
H
Pronation
4/8/2016
Supination Pronation
University of Toronto
Flexion
Extension
20
2. Data Preparation
The dataset includes 80 records.
 There are two input variables; biceps
signal and triceps signal.
 One output variable, with four possible
values; Supination, Pronation, Flexion and
Extension.
4/8/2016
University of Toronto
21
3. Exploration
Scatter Plot
Triceps
Record#
Flexion
4/8/2016
Extension Supination Pronation
University of Toronto
22
3. Exploration (cont.)
Scatter Plot
Biceps
Record#
Flexion
4/8/2016
Extension Supination Pronation
University of Toronto
23
5. Modeling

Classification
 OneR
 Decision Tree
 Naïve Bayesian
 K-Nearest Neighbors
 Neural Networks
 Linear Discriminant Analysis
 Support Vector Machines
…
4/8/2016
University of Toronto
24
6. Model Deployment
A neural network model was successfully
implemented inside the robotic arm.
4/8/2016
University of Toronto
25
Examples of data mining in science & engineering
1. Data mining in Biomedical Engineering
“Robotic Arm Control Using Data Mining Techniques”
2. Data mining in Chemical Engineering
“Data Mining for In-line Image Monitoring of Extrusion
Processing”
4/8/2016
University of Toronto
26
Plastics Extrusion
Plastic
pellets
Plastic melt
4/8/2016
University of Toronto
27
Film Extrusion
Defect due to
particle
contaminant
Extruder
Plastic Film
4/8/2016
University of Toronto
28
In-Line Monitoring
Transition
Piece
Window
Ports
4/8/2016
University of Toronto
29
In-Line Monitoring
Optical Assembly
Light
Light Source
Extruder and
Interface
Imaging
Computer
4/8/2016
University of Toronto
30
Melt Without Contaminant Particles (WO)
4/8/2016
University of Toronto
31
Melt With Contaminant Particles (WP)
4/8/2016
University of Toronto
32
1. Problem Definition
Classify images into those with particles (WP)
and those without particles (WO).
WO
4/8/2016
WP
University of Toronto
33
2. Data Preparation
 2000 Images
 54 Input variables all numeric
 One output variables with two possible values
-With Particle
-Without Particle
4/8/2016
University of Toronto
34
2. Data Preparation (cont.)
 Pre-processed images to remove noise
 Dataset 1 with sharp images: 1350 images
including 1257 without particles and 91 with particles
 Dataset 2 with sharp and blurry images: 2000
images including 1909 without particles and blurry
particles and 91 with particles
 54 Input variables, all numeric
 One output variable, with two possible values (WP
and WO)
4/8/2016
University of Toronto
35
3. Exploration
Demo!
4/8/2016
University of Toronto
36
4. Modeling
Classification:
• OneR
• Decision Tree
• 3-Nearest Neighbors
• Naïve Bayesian
4/8/2016
University of Toronto
37
5. Evaluation
10 -fold cross-validation
Dataset
Attrib.
Class
One-R
C4.5
3.N.N
Bayes
Sharp
Images
54
2
99.9
99.8
99.8
95.8
Sharp +
Blurry
Images
54
2
98.5
97.8
97.8
93.3
Sharp +
Blurry
Images
54
3
87
87
84
79
If pixel_density_max < 142 then WP
4/8/2016
University of Toronto
38
6. Deploy model
 A Visual Basic program will be developed to implement the model.
4/8/2016
University of Toronto
39
Agenda
 Explosion of data
 Introduction to data mining
 Examples of data mining in science &
engineering
 Challenges and opportunities
4/8/2016
University of Toronto
40
Challenges and Opportunities
 Data mining is a ‘top ten’ emerging technology.
 High pay job! in the financial, medical and
engineering.
 Faster, more accurate and more scalable
techniques.
 Incremental, on-line and real-time learning
algorithms.
 Parallel and distributed data processing
techniques.
4/8/2016
University of Toronto
41
Data mining is an exciting and
challenging field with the ability to
solve many complex scientific and
business problems.
You can be part of the solution!
4/8/2016
University of Toronto
42