Tutorial on High Performance Data Mining

Download Report

Transcript Tutorial on High Performance Data Mining

Data Mining
Dr. Mohsen Kahani
Email: [email protected]
http://www.um.ac.ir/~kahani/
Overview
Introduction
Data Mining Functions and Models
Data Mining Methodologies
Data Mining Case Studies
Final Remarks
Motivation:
“Necessity is the Mother of Invention”
Data explosion problem:
Automated data collection tools and mature
database technology lead to tremendous
amounts of data stored in databases, data
warehouses and other information
repositories
We are drowning in data, but starving for
knowledge!
Data pyramid
Wisdom
Knowledge + experience
Knowledge
Information + rules
Information
Data + context
Data
Related Fields
Machine
Learning
Visualization
Data Mining and
Knowledge Discovery
Statistics
Databases
Knowledge Discovery Process
Integration
Interpretation
& Evaluation
Knowledge
Knowledge
__ __ __
__ __ __
__ __ __
DATA
Ware
house
Transformed
Data
Target
Data
Patterns
and
Rules
Understanding
Raw
Dat
a
Data Mining and Business Intelligence
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
DBA
Definition of Data Mining
“…The non-trivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns in
data…”
Fayyad, Piatetsky-Shapiro, Smyth [1996]
The Evolution of Data Analysis
Evolutionary Step Business Question Enabling
Technologies
Product Providers Characteristics
Data Collection
(1960s)
"What was my total Computers, tapes,
revenue in the last disks
five years?"
IBM, CDC
Retrospective,
static data delivery
Data Access
(1980s)
"What were unit
sales in New
England last
March?"
Relational
databases
(RDBMS),
Structured Query
Language (SQL),
ODBC
Oracle, Sybase,
Informix, IBM,
Microsoft
Retrospective,
dynamic data
delivery at record
level
Data Warehousing
& Decision
Support
(1990s)
"What were unit
sales in New
England last
March? Drill down
to Boston."
On-line analytic
processing
(OLAP),
multidimensional
databases, data
warehouses
SPSS, Comshare, Retrospective,
Arbor, Cognos,
dynamic data
Microstrategy,NCR delivery at multiple
levels
Data Mining
(Emerging Today)
"What’s likely to
happen to Boston
unit sales next
month? Why?"
Advanced
algorithms,
multiprocessor
computers, massive
databases
SPSS/Clementine,
Lockheed, IBM,
SGI, SAS, NCR,
Oracle, numerous
startups
Prospective,
proactive
information
delivery
Need for Data Mining
 Data accumulate and double every 9 months
 There is a big gap from stored data to knowledge; and
the transition won’t occur automatically.
 Manual data analysis is not new but a bottleneck
 Fast developing Computer Science and Engineering
generates new demands
 Seeking knowledge from massive data
Any personal experience?
When is DM useful
Data rich world
Large data (dimensionality and size)
Image data (size)
Gene chip data (dimensionality)
Little knowledge about data (exploratory
data analysis)
What if we have some knowledge?
Challenges
Increasing data dimensionality and data size
Various data forms
New data types
Streaming data, multimedia data
Efficient search and access to data/knowledge
Intelligent update and integration
Data Mining Survey
Industry Pioneers
 23%
 19%
 17%
 13%
 12%
Manufacturing
Financial Serv.
Tele/Data communication
Media
Retail/Wholesaler
Objectives
 21.4% Understanding Customer Segments and
Preferences,
 19,5% Identifying Profitable Customers and Acquiring New
ones,
 14,1% Increasing Revenue From Customers.
World Data Mining Survey, 6 August, 2002.
Results of Data Mining
Include:
Forecasting what may happen in the
future
Classifying people or things into groups by
recognizing patterns
Clustering people or things into groups
based on their attributes
Associating what events are likely to occur
together
Sequencing what events are likely to lead
to later events
Data Mining versus OLAP
OLAP - On-line
Analytical Processing
Provides you with a
very good view of what
is happening, but can
not predict what will
happen in the future or
why it is happening
Data Mining Versus Statistical Analysis
Data Mining
Originally developed to act
as expert systems to solve
problems
Less interested in the
mechanics of the technique
If it makes sense then let’s
use it
Does not require
assumptions to be made
about data
Can find patterns in very
large amounts of data
 Requires understanding of
data and business problem
Data Analysis
Tests for statistical
correctness of models
Are statistical assumptions of
models correct?
• Eg Is the R-Square good?
Hypothesis testing
Is the relationship
significant?
• Use a t-test to validate
significance
Tends to rely on sampling
Techniques are not optimised
for large amounts of data
Requires strong statistical
skills
Data Mining Taxonomy
Predictive Method
- …predict the value of a particular
attribute…
Descriptive Method
- …foundation of human-interpretable
patterns that describe the data…
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Deviation Detection [Predictive]
Data Mining Tasks:
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
Classification: Linear
Regression
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
Not flexible enough
Classification: Decision
Trees
if X > 5 then blue
else if Y > 3 then blue
else if X > 2 then green
else blue
Y
3
2
5
X
Decision Trees
-a way of representing a series of rules that
lead to a class or value;
-basic components of a decision tree:
decision node, branches and leaves;
Income>40,000
No
Yes
Job>5
Yes
Low Risk
High Debt
No
High Risk
Yes
High Risk
No
Low Risk
Decision Trees (cont.)
- handle very well non-numeric data;
- work best when the predictor variables
are categorical;
Example Decision Tree
Splitting Attributes
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
125K
No
Yes
Single
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
The splitting attribute at a node is
determined based on the Gini index.
Classification: Neural Networks
- efficiently model large and complex problems;
- may be used in classification problems or for
regressions;
- Starts with input layer => hidden layer =>
output layer
3
1
4
6
2
Inputs
5
Hidden Layer
Output
Neural Networks (cont.)
- can be easily implemented to run on
massively parallel computers;
- can not be easily interpret;
- require an extensive amount of training
time;
- require a lot of data preparation (involve
very careful data cleansing, selection,
preparation, and pre-processing);
- require sufficiently large data set and high
signal-to noise ratio.
Kohonen Network
Description
unsupervised
seeks to
describe
dataset in
terms of
natural
clusters of
cases
Classification Example
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Learn
Classifier
Test
Set
Model
Classification Application
Direct Marketing
Fraud Detection
Customer Attrition/Churn
Sky Survey Cataloging
Data Mining Tasks:
Clustering
 Goal is to identify categories
 Natural grouping of customers
by processing all the available
data about them.
 Other applications
market segmentation, discovering
affinity groups, and defect
analysis
Data Mining Tasks:
Association Rule Discovery
Given a set of records each of which contain
some number of items from a given collection;
Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Association Rule
Discovery Application
Marketing and Sales Promotion
Supermarket Shelf Management
Inventory Management
Deviation Detection & Pattern
Discovery
Deviation Detection:
…discovering most significant changes in data from
previously measured or normative values…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
Sequential Pattern Discovery:
…process of looking for patterns and rules that
predict strong sequential dependencies among
different events…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
Sequential Patterns
Identify frequently occurring sequences
from given records
40 percent of female customers buy a
gray skirt six months after buying a red
jacket
Data Mining Methodology:
SAS
 Sample
Extract a portion of the dataset for data mining
 Explore
 Modify
create, select and transform variables with the intention of
building a model
 Model
Specify a relationship of variables that reliably predicts a
desired goal
 Assess
Evaluate the practical value of the findings and the model
resulting from the data mining effort
Data Mining Methodology:
CRISP-DM
Data understanding
Data preparation
Modeling
Evaluation
Deployment
CRISP-DM Phases
Phases and Tasks
Business
Understanding
Determine
Business Objectives
Background
Business Objectives
Business Success
Criteria
Situation Assessment
Inventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
Determine
Data Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Produce Project Plan
Project Plan
Initial Asessment of
Tools and Techniques
Data
Understanding
Collect Initial Data
Initial Data Collection
Report
Data
Preparation
Data Set
Data Set Description
Select Data
Data Description Report
Rationale for Inclusion /
Exclusion
Explore Data
Clean Data
Describe Data
Data Exploration Report
Verify Data Quality
Data Quality Report
Data Cleaning Report
Construct Data
Derived Attributes
Generated Records
Integrate Data
Merged Data
Format Data
Reformatted Data
Modeling
Select Modeling
Technique
Modeling Technique
Modeling Assumptions
Generate Test Design
Test Design
Build Model
Parameter Settings
Models
Model Description
Assess Model
Model Assessment
Revised Parameter
Settings
Evaluation
Evaluate Results
Assessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review Process
Review of Process
Determine Next Steps
List of Possible Actions
Decision
Deployment
Plan Deployment
Deployment Plan
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan
Produce Final Report
Final Report
Final Presentation
Review Project
Experience
Documentation
Major Application Areas for
Data Mining Solutions
Fraud/Non-Compliance
Anomaly detection
Isolate the factors that lead
to fraud, waste and abuse
Target auditing and
investigative efforts more
effectively
Credit/Risk Scoring
Intrusion detection
Parts failure prediction
Recruiting/Attracting
customers
Maximizing profitability
(cross selling, identifying
profitable customers)
Service Delivery and
Customer Retention
 Build profiles of customers
likely to use which services
Web Mining
Health Care
Case Study: Search
Engines
Early search engines used mainly keywords
on a page – were subject to manipulation
Google success is due to its algorithm which
uses mainly links to the page
Google founders Sergey Brin and Larry Page
were students in Stanford doing research in
databases and data mining in 1998 which led
to Google
Case Study:
Direct Marketing and CRM
Most major direct marketing companies
are using modeling and data mining
Most financial companies are using
customer modeling
Modeling is easier than changing
customer behaviour
Some successes
Verizon Wireless reduced churn rate from 2%
to 1.5%
Biology: Molecular
Diagnostics
Leukemia: Acute Lymphoblastic (ALL) vs
Acute Myeloid (AML)
72 samples, about 7,000 genes
ALL
AML
Results: 33 correct (97% accuracy),
1 error (sample suspected mislabelled)
Outcome predictions?
Case Study:
Security and Fraud Detection
Credit Card Fraud Detection
Money laundering
FAIS (US Treasury)
Securities Fraud
NASDAQ Sonar system
Phone fraud
AT&T, Bell Atlantic, British
Telecom/MCI
Bio-terrorism detection at Salt
Lake Olympics 2002
3D example by MineSet
Data Mining and Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
Multi-party computation – distributed data
…
The Hype Curve for
Data Mining and Knowledge Discovery
Over-inflated
expectations
Growing acceptance
and mainstreaming
rising
expectations
Disappointment
Performance
Expectations
1990
1998
2000
2002
Final Remarks
Data Mining can be utilized for any
field that needs to find patterns or
relationships in their data.
Questions?
Special Data Types
Spatial Data
Streamed Data
Multimedia data
Spatial Mining
Spatial Data and Structures
Images
Spatial Mining Algorithms
Definitions
Spatial data is about instances located in
a physical space
Spatial data has location or georeferenced features
Some of these features are:
Address, latitude/longitude (explicit)
Location-based partitions in databases
(implicit)
Applications and Problems
Geographic information systems (GIS) store
information related to geographic locations on
Earth
Weather, community infrastructure needs, disaster
management, and hazardous waste
Homeland security issues such as prediction of
unexpected events and planning of evacuation
Remote sensing and image classification
Biomedical applications include medical imaging
and illness diagnosis
Use of Spatial Data
Map overlay – merging disparate data
Different views of the same area: (Level 1) streets, power
lines, phone lines, sewer lines, (Level 2) actual elevations,
building locations, and rivers
Spatial selection – find all houses near WSU
Spatial join – nearest for points, intersection for
areas
Other basic spatial operations
Region/range query for objects intersecting a region
Nearest neighbor query for objects closest to a given
place
Distance scan asking for objects within a certain radius
Spatial Data Structures
Minimum bounding rectangles
(MBR)
Different tree structures
Quad tree
R-Tree
kd-Tree
Image databases
MBR
Representing a spatial object by the
smallest rectangle [(x1,y1), (x2,y2)] or
rectangles
(x2,y2)
(x1,y1)
R-Tree
Indexing MBRs in a tree
An R-tree of order m has at most m entries
R8
R6 one node
R8
in
An
example (order of 3)
R1
R6
R7
R7
R2
R3
R4
R5
R1
R2
R3
R4
R5
Common Tasks dealing with
Spatial Data
Data focusing
Spatial queries
Identifying interesting parts in spatial data
Progress refinement can be applied in a tree
structure
Feature extraction
Extracting important/relevant features for an
application
Classification or others
Using training data to create classifiers
Many mining algorithms can be used
Classification, clustering, associations
Spatial Mining Tasks
Spatial classification
Spatial clustering
Spatial association rules
Spatial Classification
Use spatial information at different
(coarse/fine) levels (different indexing
trees) for data focusing
Determine relevant spatial or non-spatial
features
Perform normal supervised learning
algorithms
e.g., Decision trees,
Spatial Clustering
Use tree structures to index spatial data
DBSCAN: R-tree
CLIQUE: Grid or Quad tree
Clustering with spatial constraints
(obstacles  need to adjust notion of
distance)
Spatial Association Rules
Spatial objects are of major interest, not
transactions
A  B
A, B can be either spatial or non-spatial (3
combinations)
What is the fourth combination?
Association rules can be found w.r.t. the 3
types
Summary
Spatial data can contain both spatial and nonspatial features.
When spatial information becomes dominant
interest, spatial data mining should be applied.
Spatial data structures can facilitate spatial
mining.
Standard data mining algorithms can be
modified for spatial data mining, with a
substantial part of preprocessing to take into
account of spatial information.
Characteristics of Data
Streams
Data Streams
Data streams—continuous, ordered, changing, fast,
huge amount
Traditional DBMS—data stored in finite, persistent data
sets
Characteristics
Huge volumes of continuous data, possibly infinite
Fast changing and requires fast, real-time response
Data stream captures nicely our data processing needs
of today
Random access is expensive—single linear scan
Stream Data Applications
Telecommunication calling records
Business: credit card transaction flows
Network monitoring and traffic engineering
Financial market: stock exchange
Engineering & industrial processes: power supply
& manufacturing
Sensor, monitoring & surveillance: video streams
Security monitoring
Web logs and Web page click streams
Massive data sets (even saved but random
access is too expensive)
Architecture: Stream Query
Processing
SDMS (Stream Data
Management System)
User/Application
Continuous Query
Results
Multiple streams
Stream Query
Processor
Scratch Space
(Main memory and/or Disk)
Challenges of Stream Data
Processing
Multiple, continuous, rapid, time-varying,
ordered streams
Main memory computations
Queries are often continuous
Evaluated continuously as stream data arrives
Answer updated over time
Queries are often complex
Beyond element-at-a-time processing
Beyond stream-at-a-time processing
Beyond relational queries (scientific, data mining,
Processing Stream Queries
Query types
One-time query vs. continuous query (being
evaluated continuously as stream continues to arrive)
Predefined query vs. ad-hoc query (issued on-line)
Unbounded memory requirements
For real-time response, main memory algorithm
should be used
Memory requirement is unbounded if one will join
future tuples
Approximate query answering
With bounded memory, it is not always possible to
Stream Data Mining vs. Stream
Querying
Stream mining—A more challenging task
It shares most of the difficulties with stream
querying
Patterns are hidden and more general than
querying
It may require exploratory analysis
Not necessarily continuous queries
Stream data mining tasks
Multi-dimensional on-line analysis of streams
Stream Data Mining Tasks
Multi-dimensional (on-line) analysis of streams
Clustering data streams
Classification of data streams
Mining frequent patterns in data streams
Mining sequential patterns in data streams
Mining partial periodicity in data streams
Mining notable gradients in data streams
Mining outliers and unusual patterns in data
streams
Challenges for Mining Dynamics in
Data Streams
Most stream data are at pretty low-level or
multi-dimensional in nature: needs ML/MD
processing
Analysis requirements
Multi-dimensional trends and unusual patterns
Capturing important changes at multidimensions/levels
Fast, real-time detection and response
Multi-Dimensional Stream Analysis:
Examples
Analysis of Web click streams
Raw data at low levels: seconds, web page addresses,
user IP addresses, …
Analysts want: changes, trends, unusual patterns, at
reasonable levels of details
E.g., Average clicking traffic in North America on
sports in the last 15 minutes is 40% higher than that
in the last 24 hours.”
Analysis of power consumption streams
Raw data: power consumption flow for every
household, every minute
Patterns one may find: average hourly power
A Key Step—Stream Data
Reduction
 Challenges of OLAPing stream data
 Raw data cannot be stored
 Simple aggregates are not powerful enough
 History shape and patterns at different levels are
desirable: multi-dimensional regression analysis
 Proposal
 A scalable multi-dimensional stream “data cube”
that can aggregate regression model of stream
data efficiently without accessing the raw data
Data Warehouse
Data Warehouse
Architecture
Data Warehouse Options