Tutorial on High Performance Data Mining
Download
Report
Transcript Tutorial on High Performance Data Mining
Data Mining
Dr. Mohsen Kahani
Email: [email protected]
http://www.um.ac.ir/~kahani/
Overview
Introduction
Data Mining Functions and Models
Data Mining Methodologies
Data Mining Case Studies
Final Remarks
Motivation:
“Necessity is the Mother of Invention”
Data explosion problem:
Automated data collection tools and mature
database technology lead to tremendous
amounts of data stored in databases, data
warehouses and other information
repositories
We are drowning in data, but starving for
knowledge!
Data pyramid
Wisdom
Knowledge + experience
Knowledge
Information + rules
Information
Data + context
Data
Related Fields
Machine
Learning
Visualization
Data Mining and
Knowledge Discovery
Statistics
Databases
Knowledge Discovery Process
Integration
Interpretation
& Evaluation
Knowledge
Knowledge
__ __ __
__ __ __
__ __ __
DATA
Ware
house
Transformed
Data
Target
Data
Patterns
and
Rules
Understanding
Raw
Dat
a
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Definition of Data Mining
“…The non-trivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns in
data…”
Fayyad, Piatetsky-Shapiro, Smyth [1996]
The Evolution of Data Analysis
Evolutionary Step Business Question Enabling
Technologies
Product Providers Characteristics
Data Collection
(1960s)
"What was my total Computers, tapes,
revenue in the last disks
five years?"
IBM, CDC
Retrospective,
static data delivery
Data Access
(1980s)
"What were unit
sales in New
England last
March?"
Relational
databases
(RDBMS),
Structured Query
Language (SQL),
ODBC
Oracle, Sybase,
Informix, IBM,
Microsoft
Retrospective,
dynamic data
delivery at record
level
Data Warehousing
& Decision
Support
(1990s)
"What were unit
sales in New
England last
March? Drill down
to Boston."
On-line analytic
processing
(OLAP),
multidimensional
databases, data
warehouses
SPSS, Comshare, Retrospective,
Arbor, Cognos,
dynamic data
Microstrategy,NCR delivery at multiple
levels
Data Mining
(Emerging Today)
"What’s likely to
happen to Boston
unit sales next
month? Why?"
Advanced
algorithms,
multiprocessor
computers, massive
databases
SPSS/Clementine,
Lockheed, IBM,
SGI, SAS, NCR,
Oracle, numerous
startups
Prospective,
proactive
information
delivery
Need for Data Mining
Data accumulate and double every 9 months
There is a big gap from stored data to knowledge; and
the transition won’t occur automatically.
Manual data analysis is not new but a bottleneck
Fast developing Computer Science and Engineering
generates new demands
Seeking knowledge from massive data
Any personal experience?
When is DM useful
Data rich world
Large data (dimensionality and size)
Image data (size)
Gene chip data (dimensionality)
Little knowledge about data (exploratory
data analysis)
What if we have some knowledge?
Challenges
Increasing data dimensionality and data size
Various data forms
New data types
Streaming data, multimedia data
Efficient search and access to data/knowledge
Intelligent update and integration
Data Mining Survey
Industry Pioneers
23%
19%
17%
13%
12%
Manufacturing
Financial Serv.
Tele/Data communication
Media
Retail/Wholesaler
Objectives
21.4% Understanding Customer Segments and
Preferences,
19,5% Identifying Profitable Customers and Acquiring New
ones,
14,1% Increasing Revenue From Customers.
World Data Mining Survey, 6 August, 2002.
Results of Data Mining
Include:
Forecasting what may happen in the
future
Classifying people or things into groups by
recognizing patterns
Clustering people or things into groups
based on their attributes
Associating what events are likely to occur
together
Sequencing what events are likely to lead
to later events
Data Mining versus OLAP
OLAP - On-line
Analytical Processing
Provides you with a
very good view of what
is happening, but can
not predict what will
happen in the future or
why it is happening
Data Mining Versus Statistical Analysis
Data Mining
Originally developed to act
as expert systems to solve
problems
Less interested in the
mechanics of the technique
If it makes sense then let’s
use it
Does not require
assumptions to be made
about data
Can find patterns in very
large amounts of data
Requires understanding of
data and business problem
Data Analysis
Tests for statistical
correctness of models
Are statistical assumptions of
models correct?
• Eg Is the R-Square good?
Hypothesis testing
Is the relationship
significant?
• Use a t-test to validate
significance
Tends to rely on sampling
Techniques are not optimised
for large amounts of data
Requires strong statistical
skills
Data Mining Taxonomy
Predictive Method
- …predict the value of a particular
attribute…
Descriptive Method
- …foundation of human-interpretable
patterns that describe the data…
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Deviation Detection [Predictive]
Data Mining Tasks:
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
Classification: Linear
Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
Not flexible enough
Classification: Decision
Trees
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
Y
3
2
5
X
Decision Trees
-a way of representing a series of rules that
lead to a class or value;
-basic components of a decision tree:
decision node, branches and leaves;
Income>40,000
No
Yes
Job>5
Yes
Low Risk
High Debt
No
High Risk
Yes
High Risk
No
Low Risk
Decision Trees (cont.)
- handle very well non-numeric data;
- work best when the predictor variables
are categorical;
Example Decision Tree
Splitting Attributes
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
125K
No
Yes
Single
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
The splitting attribute at a node is
determined based on the Gini index.
Classification: Neural Networks
- efficiently model large and complex problems;
- may be used in classification problems or for
regressions;
- Starts with input layer => hidden layer =>
output layer
3
1
4
6
2
Inputs
5
Hidden Layer
Output
Neural Networks (cont.)
- can be easily implemented to run on
massively parallel computers;
- can not be easily interpret;
- require an extensive amount of training
time;
- require a lot of data preparation (involve
very careful data cleansing, selection,
preparation, and pre-processing);
- require sufficiently large data set and high
signal-to noise ratio.
Kohonen Network
Description
unsupervised
seeks to
describe
dataset in
terms of
natural
clusters of
cases
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
Classification Application
Direct Marketing
Fraud Detection
Customer Attrition/Churn
Sky Survey Cataloging
Data Mining Tasks:
Clustering
Goal is to identify categories
Natural grouping of customers
by processing all the available
data about them.
Other applications
market segmentation, discovering
affinity groups, and defect
analysis
Data Mining Tasks:
Association Rule Discovery
Given a set of records each of which contain
some number of items from a given collection;
Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Association Rule
Discovery Application
Marketing and Sales Promotion
Supermarket Shelf Management
Inventory Management
Deviation Detection & Pattern
Discovery
Deviation Detection:
…discovering most significant changes in data from
previously measured or normative values…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
Sequential Pattern Discovery:
…process of looking for patterns and rules that
predict strong sequential dependencies among
different events…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
Sequential Patterns
Identify frequently occurring sequences
from given records
40 percent of female customers buy a
gray skirt six months after buying a red
jacket
Data Mining Methodology:
SAS
Sample
Extract a portion of the dataset for data mining
Explore
Modify
create, select and transform variables with the intention of
building a model
Model
Specify a relationship of variables that reliably predicts a
desired goal
Assess
Evaluate the practical value of the findings and the model
resulting from the data mining effort
Data Mining Methodology:
CRISP-DM
Data understanding
Data preparation
Modeling
Evaluation
Deployment
CRISP-DM Phases
Phases and Tasks
Business
Understanding
Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Produce Project Plan
Project Plan
Initial Asessment of
Tools and Techniques
Data
Understanding
Collect Initial Data
Initial Data Collection
Report
Data
Preparation
Data Set
Data Set Description
Select Data
Data Description Report
Rationale for Inclusion /
Exclusion
Explore Data
Clean Data
Describe Data
Data Exploration Report
Verify Data Quality
Data Quality Report
Data Cleaning Report
Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data
Format Data
Reformatted Data
Modeling
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description
Assess Model
Model Assessment
Revised Parameter
Settings
Evaluation
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
Deployment
Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation
Major Application Areas for
Data Mining Solutions
Fraud/Non-Compliance
Anomaly detection
Isolate the factors that lead
to fraud, waste and abuse
Target auditing and
investigative efforts more
effectively
Credit/Risk Scoring
Intrusion detection
Parts failure prediction
Recruiting/Attracting
customers
Maximizing profitability
(cross selling, identifying
profitable customers)
Service Delivery and
Customer Retention
Build profiles of customers
likely to use which services
Web Mining
Health Care
Case Study: Search
Engines
Early search engines used mainly keywords
on a page – were subject to manipulation
Google success is due to its algorithm which
uses mainly links to the page
Google founders Sergey Brin and Larry Page
were students in Stanford doing research in
databases and data mining in 1998 which led
to Google
Case Study:
Direct Marketing and CRM
Most major direct marketing companies
are using modeling and data mining
Most financial companies are using
customer modeling
Modeling is easier than changing
customer behaviour
Some successes
Verizon Wireless reduced churn rate from 2%
to 1.5%
Biology: Molecular
Diagnostics
Leukemia: Acute Lymphoblastic (ALL) vs
Acute Myeloid (AML)
72 samples, about 7,000 genes
ALL
AML
Results: 33 correct (97% accuracy),
1 error (sample suspected mislabelled)
Outcome predictions?
Case Study:
Security and Fraud Detection
Credit Card Fraud Detection
Money laundering
FAIS (US Treasury)
Securities Fraud
NASDAQ Sonar system
Phone fraud
AT&T, Bell Atlantic, British
Telecom/MCI
Bio-terrorism detection at Salt
Lake Olympics 2002
3D example by MineSet
Data Mining and Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
Multi-party computation – distributed data
…
The Hype Curve for
Data Mining and Knowledge Discovery
Over-inflated
expectations
Growing acceptance
and mainstreaming
rising
expectations
Disappointment
Performance
Expectations
1990
1998
2000
2002
Final Remarks
Data Mining can be utilized for any
field that needs to find patterns or
relationships in their data.
Questions?
Special Data Types
Spatial Data
Streamed Data
Multimedia data
Spatial Mining
Spatial Data and Structures
Images
Spatial Mining Algorithms
Definitions
Spatial data is about instances located in
a physical space
Spatial data has location or georeferenced features
Some of these features are:
Address, latitude/longitude (explicit)
Location-based partitions in databases
(implicit)
Applications and Problems
Geographic information systems (GIS) store
information related to geographic locations on
Earth
Weather, community infrastructure needs, disaster
management, and hazardous waste
Homeland security issues such as prediction of
unexpected events and planning of evacuation
Remote sensing and image classification
Biomedical applications include medical imaging
and illness diagnosis
Use of Spatial Data
Map overlay – merging disparate data
Different views of the same area: (Level 1) streets, power
lines, phone lines, sewer lines, (Level 2) actual elevations,
building locations, and rivers
Spatial selection – find all houses near WSU
Spatial join – nearest for points, intersection for
areas
Other basic spatial operations
Region/range query for objects intersecting a region
Nearest neighbor query for objects closest to a given
place
Distance scan asking for objects within a certain radius
Spatial Data Structures
Minimum bounding rectangles
(MBR)
Different tree structures
Quad tree
R-Tree
kd-Tree
Image databases
MBR
Representing a spatial object by the
smallest rectangle [(x1,y1), (x2,y2)] or
rectangles
(x2,y2)
(x1,y1)
R-Tree
Indexing MBRs in a tree
An R-tree of order m has at most m entries
R8
R6 one node
R8
in
An
example (order of 3)
R1
R6
R7
R7
R2
R3
R4
R5
R1
R2
R3
R4
R5
Common Tasks dealing with
Spatial Data
Data focusing
Spatial queries
Identifying interesting parts in spatial data
Progress refinement can be applied in a tree
structure
Feature extraction
Extracting important/relevant features for an
application
Classification or others
Using training data to create classifiers
Many mining algorithms can be used
Classification, clustering, associations
Spatial Mining Tasks
Spatial classification
Spatial clustering
Spatial association rules
Spatial Classification
Use spatial information at different
(coarse/fine) levels (different indexing
trees) for data focusing
Determine relevant spatial or non-spatial
features
Perform normal supervised learning
algorithms
e.g., Decision trees,
Spatial Clustering
Use tree structures to index spatial data
DBSCAN: R-tree
CLIQUE: Grid or Quad tree
Clustering with spatial constraints
(obstacles need to adjust notion of
distance)
Spatial Association Rules
Spatial objects are of major interest, not
transactions
A B
A, B can be either spatial or non-spatial (3
combinations)
What is the fourth combination?
Association rules can be found w.r.t. the 3
types
Summary
Spatial data can contain both spatial and nonspatial features.
When spatial information becomes dominant
interest, spatial data mining should be applied.
Spatial data structures can facilitate spatial
mining.
Standard data mining algorithms can be
modified for spatial data mining, with a
substantial part of preprocessing to take into
account of spatial information.
Characteristics of Data
Streams
Data Streams
Data streams—continuous, ordered, changing, fast,
huge amount
Traditional DBMS—data stored in finite, persistent data
sets
Characteristics
Huge volumes of continuous data, possibly infinite
Fast changing and requires fast, real-time response
Data stream captures nicely our data processing needs
of today
Random access is expensive—single linear scan
Stream Data Applications
Telecommunication calling records
Business: credit card transaction flows
Network monitoring and traffic engineering
Financial market: stock exchange
Engineering & industrial processes: power supply
& manufacturing
Sensor, monitoring & surveillance: video streams
Security monitoring
Web logs and Web page click streams
Massive data sets (even saved but random
access is too expensive)
Architecture: Stream Query
Processing
SDMS (Stream Data
Management System)
User/Application
Continuous Query
Results
Multiple streams
Stream Query
Processor
Scratch Space
(Main memory and/or Disk)
Challenges of Stream Data
Processing
Multiple, continuous, rapid, time-varying,
ordered streams
Main memory computations
Queries are often continuous
Evaluated continuously as stream data arrives
Answer updated over time
Queries are often complex
Beyond element-at-a-time processing
Beyond stream-at-a-time processing
Beyond relational queries (scientific, data mining,
Processing Stream Queries
Query types
One-time query vs. continuous query (being
evaluated continuously as stream continues to arrive)
Predefined query vs. ad-hoc query (issued on-line)
Unbounded memory requirements
For real-time response, main memory algorithm
should be used
Memory requirement is unbounded if one will join
future tuples
Approximate query answering
With bounded memory, it is not always possible to
Stream Data Mining vs. Stream
Querying
Stream mining—A more challenging task
It shares most of the difficulties with stream
querying
Patterns are hidden and more general than
querying
It may require exploratory analysis
Not necessarily continuous queries
Stream data mining tasks
Multi-dimensional on-line analysis of streams
Stream Data Mining Tasks
Multi-dimensional (on-line) analysis of streams
Clustering data streams
Classification of data streams
Mining frequent patterns in data streams
Mining sequential patterns in data streams
Mining partial periodicity in data streams
Mining notable gradients in data streams
Mining outliers and unusual patterns in data
streams
Challenges for Mining Dynamics in
Data Streams
Most stream data are at pretty low-level or
multi-dimensional in nature: needs ML/MD
processing
Analysis requirements
Multi-dimensional trends and unusual patterns
Capturing important changes at multidimensions/levels
Fast, real-time detection and response
Multi-Dimensional Stream Analysis:
Examples
Analysis of Web click streams
Raw data at low levels: seconds, web page addresses,
user IP addresses, …
Analysts want: changes, trends, unusual patterns, at
reasonable levels of details
E.g., Average clicking traffic in North America on
sports in the last 15 minutes is 40% higher than that
in the last 24 hours.”
Analysis of power consumption streams
Raw data: power consumption flow for every
household, every minute
Patterns one may find: average hourly power
A Key Step—Stream Data
Reduction
Challenges of OLAPing stream data
Raw data cannot be stored
Simple aggregates are not powerful enough
History shape and patterns at different levels are
desirable: multi-dimensional regression analysis
Proposal
A scalable multi-dimensional stream “data cube”
that can aggregate regression model of stream
data efficiently without accessing the raw data
Data Warehouse
Data Warehouse
Architecture
Data Warehouse Options