Steven F. Ashby Center for Applied Scientific Computing Month DD

Download Report

Transcript Steven F. Ashby Center for Applied Scientific Computing Month DD

High Performance Data Mining
Vipin Kumar
Army High Performance Computing Research Center
Department of Computer Science
University of Minnesota
http://www.cs.umn.edu/~kumar
Research sponsored by AHPCRC/ARL, DOE, NASA, and NSF
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Overview
 Introduction
to Data Mining (What, Why,
and How?)
 Issues
and Challenges in Designing
Parallel Data Mining Algorithms
 Case
Study: Discovery of Patterns in
Global Climate Data using Data Mining
 Summary
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
What is Data Mining?
 Many
Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information
from data
– Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
What is (not) Data Mining?
What is not Data
Mining?


What is Data Mining?
– Look up phone
number in phone
directory
– Certain names are more
prevalent in certain US
locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web
search engine for
information about
“Amazon”
– Group together similar
documents returned by
search engine according to
their context (e.g. Amazon
rainforest, Amazon.com,)
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Why Mine Data? Commercial Viewpoint

Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions

Computers have become cheaper and more powerful

Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Why Mine Data? Scientific Viewpoint

Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data
– scientific simulations
generating terabytes of data


Traditional techniques infeasible for raw data
Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Mining Large Data Sets - Motivation



There is often information “hidden” in the data that is
not readily evident
Human analysts may take weeks to discover useful
information
Much of the data is never analyzed at all
4,000,000
3,500,000
The Data Gap
3,000,000
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
1998
1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
Origins of Data Mining
Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
 Traditional Techniques
may be unsuitable due to
Statistics/
Machine Learning/
– Enormity of data
AI
Pattern
Recognition
– High dimensionality
of data
Data Mining
– Heterogeneous,
distributed nature
Database
systems
of data

© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Role of Parallel and Distributed
Computing

Many algorithms use computation time more than O(n)

High Performance Computing (HPC) is often critical for
scalability to large data sets

Sequential computers
have limited memory
– This may required multiple,
expensive I/O passes over data

Data may be distributed
– due to privacy reasons
– physically dispersed over
many different geographic
locations
© Vipin Kumar
Statistics/
AI
Machine Learning/
Pattern
Recognition
Data
Mining
Database
systems
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
High
Performance
Computing
‹#›
Data Mining Tasks...
Data
10
Milk
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
11
No
Married
60K
No
12
Yes
Divorced 220K
No
13
No
Single
85K
Yes
14
No
Married
75K
No
15
No
Single
90K
Yes
60K
Predictive Modeling

Find a model for class attribute as a function of
the values of other attributes
Model for predicting tax evasion
Married
Yes
Tid Refund
Marital
Status
Taxable
Evade
Income
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
NO
Income100K
Yes
Yes
Yes
Learn
Classifier
No
Income
 80K
NO
NO
No
YES
10
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Predictive Modeling: Applications

Targeted Marketing

Customer Attrition/Churn

Classifying Galaxies
Early
Class:
• Stages of Formation
Intermediate
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Late
Sky Survey Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Courtsey: http://aps.umn.edu
Clustering

Given a set of data points, find groupings such that
– Data points in one cluster are more similar to
one another
– Data points in separate clusters are less similar
to one another
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Clustering: Applications

Market Segmentation

Gene expression clustering

Document Clustering
Category
Total
Articles
Correctly
Placed
555
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
Financial
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Association Rule Discovery


Given a set of records, find dependency rules
which will predict occurrence of an item based
on occurrences of other items in the record
TID
Items
1
Bread, Coke, Milk
2
3
4
5
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Applications
– Marketing and Sales Promotion
– Supermarket shelf management
– Inventory Management
© Vipin Kumar
Rules Discovered:
{Milk} --> {Coke} (s=0.6, c=0.75)
{Diaper, Milk} --> {Beer}
(s=0.4, c=0.67)
Support, s 
# transacti ons that contain X and Y
Total transacti ons
Confidence , c 
# transacti ons that contain X and Y
# transacti ons that contain X
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Deviation/Anomaly Detection
Detect significant deviations from normal behavior
 Applications:
– Credit Card Fraud Detection

– Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections per day
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
General Issues and Challenges in
Parallel Data Mining

Dense vs. Sparse

Structured versus Unstructured

Static vs. Dynamic

Data mining computations tend to be unstructured,
sparse and dynamic.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Specific Issues and Challenges in
Parallel Data Mining

Disk I/O
– Data is often too large to fit in main memory
– Spatial locality is critical

Hash Tables
– Many efficient data mining algorithms require fast
access to large hash tables.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Constructing a Decision Tree
Tid Refund Marital
Status
Taxable
Income Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Marital
Status
Single/Divorced
Pay: 3
Evade: 3
Refund
Married
Yes
Pay: 4
Evade: 0
Pay: 3
Evade: 0
Pay
10
No
Pay: 4
Evade: 3
Evade
Refund
3
0
No Refund
4
3
Key Computation
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Constructing a Decision Tree
Tid Refund Marital
Status
Taxable
Income Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
60K
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Refund: Yes
Tid Refund Marital
Status
Taxable
Income Evade
1
Yes
Single
125K
No
4
Yes
Married
120K
No
7
Yes
Divorced 220K
No
10
Refund: No
10
Tid Refund Marital
Status
Taxable
Income Evade
2
No
Married
100K
No
3
No
Single
70K
No
5
No
Divorced 95K
Yes
6
No
Married
60K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Constructing a Decision Tree in Parallel

m categorical attributes
n records
Partitioning of data
only
– global reduction per
node is required
– large number of
classification tree
nodes gives high
communication cost
Pay
© Vipin Kumar
Evade
Refund
3
0
No Refund
4
3
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Constructing a Decision Tree in Parallel
10,000 training records
7,000 records
2,000

3,000 records
5,000
2,000
1,000
Partitioning of
classification tree nodes
– natural concurrency
– load imbalance as the
amount of work
associated with each
node varies
– child nodes use the same
data as used by parent
node
© Vipin Kumar
–
loss of locality
–
high data movement cost
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Challenges in Constructing Parallel Classifier

Partitioning of data only
– large number of classification tree nodes gives high
communication cost

Partitioning of classification tree nodes
– natural concurrency
– load imbalance as the amount of work associated with each node
varies
– child nodes use the same data as used by parent node
•
–
loss of locality
–
high data movement cost
Hybrid algorithms: partition both data and tree
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Experimental Results
(Srivastava, Han, Kumar, and Singh, 1999)


Data set
– function 2 data set discussed in SLIQ paper
(Mehta, Agrawal and Rissanen, EDBT’96)
– 2 class labels, 3 categorical and 6 continuous
attributes
IBM SP2 with 128 processors
– 66.7 MHz CPU with 256 MB real memory
– AIX version 4
– high performance switch
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Speedup Comparison of the Three
Parallel Algorithms
0.8 million examples
© Vipin Kumar
1.6 million examples
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Splitting Criterion Verification in the
Hybrid Algorithm
Splitting Criterion Ratio 
Moving Cost  Load Balancing
0.8 million examples on 8 processors
© Vipin Kumar
 Communication Cost
1.6 million examples on 16 processors
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Speedup of the Hybrid Algorithm with
Different Size Data Sets
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Scaleup of the Hybrid Algorithm
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Hash Table Access


Some efficient decision tree algorithms require random access to
large data structures.
Example: SPRINT (Shafer, Agrawal, Mehta)
Hash Table
ID
Income
ID
Age
ID
Left/
Right
0
25K
2
25
1
Left
2
28K
5
31
2
Left
8
30K
8
33
3
Right
4
30K
1
37
4
Right
5
35K
3
41
5
Right
1
50K
6
52
6
Left
3
520K
4
55
7
Right
6
55K
7
60
8
Left
7
70K
0
61
9
Left
10
10
10
Storing the entire has table on one processor makes the algorithm unscalable.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
ScalParC
(Joshi, Karypis, Kumar, 1998)

ScalParC is a scalable parallel decision tree
construction algorithm
– Scales to large number of processors
– Scales to large training sets

ScalParC is memory efficient
– The hash-table is distributed among the processors

ScalParC performs minimum amount of
communication
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
This Design is Inspired by..

Communication Structure of Parallel Sparse Matrix-Vector Algorithms.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Parallel Runtime
(Joshi, Karypis, Kumar, 1998)
Runtime (seconds)
120
100
0.2M
0.4M
0.8M
1.6M
3.2M
6.4M
80
60
40
20
0
0
50
100
150
Processors
128 Processor Cray T3D
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Computing Association Patterns
1. Market-basket transactions
TID
Items
1
Bread, Diaper,Milk
2
3
4
5
Beer, Diaper, Bread, Eggs
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Bread, Diaper, Milk
3. Generate association
rules
© Vipin Kumar
2. Find item combinations (itemsets)
that occur frequently in data
Item Combination
Bread
Coke
Milk
…
Bread & Coke
Bread & Milk
…
Bread & Milk & Diaper
…
Count
4
2
4
…
1
3
…
3
…
{Diaper, Milk}  {Beer}
{Bread}  {Diaper}
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Computing Association Require
Exponential Computation
{a}
{a,b}
{b}
{a,c}
{a,b,c}
{c}
{a,d}
{a,b,d}
{b,c}
{d}
{b,d}
{a,c,d}
{c,d}
{b,c,d}
{a,b,c,d}
Given m items, there are 2m-1 possible item combinations
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Handling Exponential Complexity

Given n transactions and m different items:
m 1
– number of possible association rules: O(m2 )
m
O
(
nm
2
)
– computation complexity:

Systematic search for all patterns, based on support
constraint [Agarwal & Srikant]:
– If {A,B} has support at least a, then both A and B have
support at least a.
– If either A or B has support less than a, then {A,B} has
support less than a.
– Use patterns of n-1 items to find patterns of n items.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Illustrating Apriori Principle
(Agrawal and Srikant, 1994)
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
5
1
Items (1-itemset candidates)
Minimum Support = 3
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
© Vipin Kumar
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Count
4
2
3
2
4
3
Pairs (2-itemset candidates)
Triplets (3-itemset candidates)
Itemset
{Bread,Milk,Diaper}
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
Count
3
‹#›
Counting Candidates


Frequent Itemsets are found by counting candidates.
Simple way:
– Search for each candidate in each transaction.
Expensive!!!
Transactions
N
© Vipin Kumar
Candidates
M
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Parallel Formulation of Association Rules
(Han, Karypis, and Kumar, 2000)

Need:
– Huge Transaction Datasets (10s of TB)
– Large Number of Candidates.

How?
– Partition the Transaction Database among processors

communication needed for global counts

local memory on each processor should be large enough to
store the entire hash tree
– Partition the Candidates among processors

redundant I/O for transactions
– Partition both Candidates and Transaction Database
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Parallel Association Rules: Scaleup Results
(100K,0.25%)
© Vipin Kumar
(Han, Karypis, and Kumar, 2000)
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Parallel Association Rules: Response Time
(np=64,50K)
© Vipin Kumar
(Han, Karypis, and Kumar, 2000)
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Discovery of Patterns in the Global
Climate System
Ocean and Land Temperature (Jan 1982)
Research Goals:
NPP
.
Pressure
© Vipin Kumar
Longitude
Find global climate patterns of
interest to Earth Scientists

Global snapshots of values for
a number of variables on land
surfaces or water.

Monthly over a range of 10 to
50 years.
NPP
.
Pressure
.
Precipitation
Precipitation
SST
SST
# grid points: 67K Land, 40K Ocean
Current data size range: 20 – 400 MB
Latitude
grid cell

Time
zone
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Importance of Global Climate Patterns
and NPP

Net Primary Production (NPP) is the net assimilation
of atmospheric carbon dioxide (CO2) into organic
matter by plants.

Keeping track of
NPP is important
because it includes
the food source of
humans and all
other organisms.

NPP is impacted by
global climate
patterns.
© Vipin Kumar
Image from http://www.pmel.noaa.gov/co2/gif/globcar.png
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Patterns of Interest

Zone Formation
– Find regions of the land or ocean which have similar behavior.

Associations
– Find relations between climate events and land cover.

Teleconnections
– Teleconnections are the simultaneous variation in climate and
related processes over widely separated points on the Earth.
– El Nino associated with droughts in Australia and Southern
Africa and heavy rainfall along the western coast of South
America.
Sea Surface
Temperature
Anomalies off
Peru
(ANOM 1+2)
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Clustering of Raw NPP and Raw SST
(Num clusters = 2)
Clusters for Raw SST and Raw NPP
90
60
Land Cluster 2
latitude
30
Land Cluster 1
0
Ice or No NPP
-30
Sea Cluster 2
-60
Sea Cluster 1
-90
-180
-150
-120
-90
-60
-30
0
30
60
90
120
150
180
longitude
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
Cluster
‹#›
K-Means Clustering of Raw NPP and Raw SST
(Num clusters = 2)
Land Cluster Cohesion:
North = 0.78
South = 0.59
Ocean Cluster Cohesion:
North = 0.77
South = 0.80
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Ocean Climate Indices:
Connecting the Ocean and the Land

An OCI is a time series of temperature or
pressure
– Based on Sea Surface Temperature (SST) or Sea
Level Pressure (SLP)

OCIs are important because
– They distill climate variability at a regional or
global scale into a single time series.
– They are related to well-known climate
phenomena such as El Niño.
©©Vipin
V. Kumar
Kumar
Keynote
Discovery
Talk
of Patterns
at VECPAR-2002,
in the Global
Porto,
Climate
Portugal,
System
June
using
27, 2002
Data Mining
‹#›
46
Ocean Climate Indices – ANOM 1+2

ANOM 1+2 is associated with El Niño and La Niña.
El Nino Events

Defined as the Sea Surface Temperature (SST) anomalies in a
regions off the coast of Peru

El Nino is associated with
– Droughts in Australia and Southern Africa
– Heavy rainfall along the western coast of South America
– Milder winters in the Midwest
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Connection of ANOM 1+2 to Land Temp
Correlation Between ANOM 1+2 and Land Temp (>0.2)
90
0.8
0.6
60
0.4
30
latitude
0.2
0
0
-0.2
-30
-0.4
-60
-0.6
-0.8
-90
-180 -150 -120 -90
-60
-30
0
30
60
90
120
150
180
longitude
OCIs capture teleconnections, i.e., the simultaneous variation in climate and
related processes over widely separated points on the Earth.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Ocean Climate Indices - NAO

The North Atlantic Oscillation (NAO) is associated with climate
variation in Europe and North America.
North Atlantic Oscillation
3
Iceland
2
Azores
1
0
-1
-2
82 83 84 85 86 87 88 89 90 91 92 93 94

Normalized pressure differences between Ponta Delgada,
Azores and Stykkisholmur, Iceland.

Associated with warm and wet winters in Europe and in cold and
dry winters in northern Canada and Greenland

The eastern US experiences mild and wet winter conditions.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Connection of NAO to Land Temp
Correlation Between NAO and Land Temperature (>0.3)
90
1
0.8
60
0.6
0.4
latitude
30
0.2
0
0
-0.2
-30
-0.4
-0.6
-60
-0.8
-1
-90
-180 -150
-120
-90
-60
-30
0
30
60
90
120
150
180
Correlation
longitude
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Discovery of Ocean Climate Indices

Use clustering to find areas of the oceans that have high
density, I.e., relatively homogeneous behavior.
– Cluster centroids are potential OCIs.
– For SLP pairs of cluster centroids are potential OCIs.

Evaluate the “influence” of potential OCIs on land points.

Determine if the potential OCI matches a known OCI.

For potential OCIs that are not well-known, conduct further
evaluation.
– Are there land points that have higher correlation for the potential
OCI than for known indices?
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
SNN Clustering - Advantages


Finding clusters of different shapes and sizes,
especially in the presence of noise, is a difficult
clustering problem.
SNN clustering
– Handles problems of varying density, shape and size.
– Is resistant to noise.

Earth Science data is noisy
– SNN clustering finds the number of clusters automatically.

Requires O(n2) time
– Need to calculate the pairwise similarity matrix
– This is a highly parallel operation.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
SST Clusters
107 SNN Clusters for Detrended Monthly Z SST (1958-1998)
90
60
latitude
30
0
-30
-60
-90
-180
© Vipin Kumar
-150
-120
-90
-60
-30
0
longitude
30
60
90
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
120
150
180
‹#›
SST Clusters that Correspond to El Nino
Climate Indices
EL Nino Related SST Clusters
90
60
latitude
30
0
75
78 67 94
Niño
Region
Range
Longitude
Range
Latitude
1+2 (94)
90°W-80°W
10°S-0°
3 (67)
150°W-90°W
5°S-5°N
3.4 (78) 170°W-120°W
5°S-5°N
4 (75) 160°E-150°W
5°S-5°N
El Nino Regions Defined
by Earth Scientists
-30
-60
-90
-180 -150 -120 -90
-60
-30
0
30
longitude
60
90
120
150
180
SNN clusters of SST that are highly correlated with El Nino indices, ~ 0.93 correlation.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
SST Clusters Highly Correlated to Known
Indices …
Examples of some SST clusters that are highly correlated to known
OCIs and have high area weighted correlation with land temperature.
These indices have a significant correlation with El Nino indices.
Cluster 29
Cluster 11
90
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
-90
-180
-140
-100
-60
-20
20
60
100
140
180
-90
-180
-140
-100
-60
Cluster 31
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
© Vipin Kumar
-140
-100
-60
-20
20
20
60
100
140
180
60
100
140
180
Cluster 37
90
-90
-180
-20
60
100
140
180
-90
-180
-140
-100
-60
-20
20
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
SST Clusters Highly Correlated to Known
Indices…
However, there are areas (yellow) where these clusters correlate better.
Cluster 11 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)
Cluster 29 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)
90
90
0.6
0.5
70
70
0.4
50
0.3
0.2
30
0.4
50
30
0.2
0.1
10
10
0
-10
0
-10
-0.1
-30
-0.2
-0.3
-50
-0.2
-30
-50
-0.4
-0.4
-70
-70
-0.5
-0.6
-90
-180
-140
-100
-60
-20
20
60
100
140
-90
-180
180
Cluster 31 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)
-140
-100
-60
-20
20
60
100
140
180
Cluster 37 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)
90
90
0.4
70
0.4
70
50
50
0.2
30
0.2
30
10
10
0
-10
0
-10
-30
-30
-0.2
-0.2
-50
-50
-70
-70
-0.4
-0.4
-90
-180
-140
© Vipin Kumar
-100
-60
-20
20
60
100
140
180
-90
-180
-140
-100
-60
-20
20
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
60
100
140
180
‹#›
SST Cluster Moderately Correlated to
Known Indices
Cluster 62
Cluster 62 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)
90
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
-90
-180
-90
-180
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-140
-100
© Vipin Kumar
-60
-20
20
60
100
140
180
-140
-100
-60
-20
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
20
60
100
140
180
‹#›
Mining Associations in Earth Science Data
(Tan, Steinbach, Kumar, Potter, Klooster, Torregrosa, 2001)
(Lat,Long,time)
Events
(10N,10E,1) {Temp-Hi, Prec-Lo}
(10N,10E,2)
(10N,11E,2)
(10N,11E,5)
(10N,11E,10)


{Temp-Hi,Prec-Lo,NPP-Lo}
{Temp-Hi, NPP-Lo}
{Solar-Hi, NPP-Lo}
{Prec-Hi, PET-LO}
First, transform Earth Science data into transactions.
Find patterns using association discovery algorithms.
1 FPAR-HI PET-HI PREC-HI SOLAR-HI TEMP-HI ==> NPP-HI (support count=145, confidence=100%)
2 FPAR-HI PET-HI PREC-HI TEMP-HI ==> NPP-HI (support count=933, confidence=99.3%)
3 FPAR-HI PET-HI PREC-HI ==> NPP-HI (support count=1655, confidence=98.8%)
4 FPAR-HI PET-HI PREC-HI SOLAR-HI ==> NPP-HI (support count=268, confidence=98.2%)
…
75 FPAR-HI ==> NPP-HI (support count = 216924, confidence = 55.7%)
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Example of Interesting Association Rules
FPAR-Hi ==> NPP-Hi
(sup=5.9%, conf=55.7%)
© Vipin Kumar
Shrubland areas
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Land Cover Types
Barren
Croplands
90
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
-90
-180 -140 -100 -60
-20
20
60
100
140 180
-90
-180 -140 -100 -60
Shrublands/ Grasslands
90
70
70
50
50
30
30
10
10
-10
-10
-30
-30
-50
-50
-70
-70
-20
20
60
20
60
100
140 180
60
100
140 180
Forests
90
-90
-180 -140 -100 -60
-20
100
140 180
-90
-180 -140 -100 -60
-20
20
Example of Interesting Association Rules…
Support Count
Land Cover
• Temp-Hi  NPP-Hi tends to occur in the forest and cropland regions in the northern
hemisphere
(Forests (33.5%), Grassland(8.7%), Cropland (24.5%), Desert (0.4%) )
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Need for Parallel Computing

Satellites are providing
measurements of finer
granularity.
Earth Observing System - EOS AM 1
– Finer spatial grids

1 by 1 grid produces 64,800
data points

0.1 by 0.1 grid produces
6,480,000 data points
– More frequent measurements


Daily measurements multiply
monthly data by a factor of 30
Looking at weather instead of
climate requires finer resolution
– Detection of movement of
fronts
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Need for Parallel Computing

SNN clustering analyses require
O(n2) comparisons.
– Evaluate correlation of every
ocean point with every land
point.

Association rule algorithms can
also be very compute intensive.
– Potentially very much greater
than O(n2)

Amount of memory required
exceeds for clustering and
association rule algorithms can
exceed 4GB of traditional
sequential servers.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Conclusion



Data mining techniques are increasingly being used for
discovering useful and previously unknown information
from data
HPC holds the promise of making data mining applicable
for massive datasets
HPC challenges
– Parallelization of existing data mining algorithms
– Development of novel parallel/distributed formulations
– Efficient implementation of hash tables in
parallel/distributed environment
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Bibliography

Large-Scale Parallel Data Mining.
Mohammed J. Zaki and Ching-Tien Ho (editors),
Springer-Verlag, 2000.

Advances in Distributed and Parallel Knowledge Discovery.
Hillol Kargupta and Philip Chan (editors),
AAAI Press/ MIT Press, 2000.

Data Mining for Scientific and Engineering Applications.
Robert L. Grossman, Chandrika Kamath, Philip Kegelmeyer, Vipin Kumar, and Raju
Namburu, Kluwer Academic Publishers, October 2001.

“Parallel Formulations of Decision-Tree Classification Algorithms,” Anurag Srivastava,
Eui-Hong (Sam) Han, Vipin Kumar, and Vineet Singh, Data Mining and Knowledge
Discovery: An International Journal, vol. 3, no. 3, pp 237-261, September 1999.

“Data Mining for the Discovery of Ocean Climate Indices”, Michael Steinbach, PangNing Tan, Vipin Kumar, Chris Potter, Steven Klooster, Workshop on Mining Scientific
Data, SDM 2002
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›
Bibliography …

“ScalParC: A New Scalable and Efficient Parallel Classification Algorithm for
Mining Large Datasets.” Mahesh V. Joshi, George Karypis and Vipin Kumar,
Proc. of 1998 International Parallel Processing Symposium, April 1998.

“Scalable Parallel Data Mining for Association Rules,” Eui-Hong (Sam) Han,
George Karypis and Vipin Kumar, IEEE Transactions on Knowledge and Data
Engineering, Vol. 12, No. 3, May/June 2000.

“Finding Spatio-Temporal Patterns in Earth Science Data,” Pang-Ning Tan,
Michael Steinbach, Vipin Kumar, Steven Klooster, Christopher Potter, Alicia
Torregrosa, KDD 2001 Workshop on Temporal Data Mining.

“Fast algorithms for mining association rules,” R. Agrawal and R. Srikant, In
Proc. of the 20th International Conference Very Large Data Bases, pages
487--499. Morgan Kaufmann, 1994.
© Vipin Kumar
Keynote Talk at VECPAR-2002, Porto, Portugal, June 27, 2002
‹#›