Query Processing, Resource Management and Approximate in a
Download
Report
Transcript Query Processing, Resource Management and Approximate in a
Three Challenges in
Data Mining
Anne Denton
Department of
Computer Science
NDSU
Why Data Mining?
Parkinson’s Law of Data
Data expands to fill the space
available for storage
Disk-storage version of Moore’s law
Capacity 2
t / 18 months
Available data grows exponentially!
Outline
Motivation of 3 challenges
More records (rows)
More attributes (columns)
New subject domains
Some answers to the challenges
Thesis work
Generalized P-Tree structure
Kernel-based semi-naïve Bayes classification
KDD-cup 02/03 and with Csci 366
students
Data with graph relationship
Outlook: Data with time dependence
Examples
More records
Many stores save each transaction
Data warehouses keep historic data
Monitoring network traffic
Micro sensors / sensor networks
More attributes
Items in a shopping cart
Keywords in text
Properties of a protein (multi-valued
categorical)
New subject domains
Data mining hype increases audience
Algorithmic Perspective
More records
Standard scaling problem
More attributes
Different algorithms needed for 1000 vs. 10 attributes
New subject domains
New techniques needed
Joining of separate fields
Algorithms should be domain-independent
Need for experts does not scale well
Twice as many data sets
Twice as many domain experts??
Ignore domain knowledge?
No! Formulate it systematically
Some Answers to Challenges
Large data quantity (Thesis)
Many records
P-Tree concept and its generalization to
non-spatial data
Many attributes
Algorithm that defies curse of dimensionality
New techniques / Joining separate fields
Mining data on a graph
Outlook: Mining data with time dependence
Challenge 1: Many Records
Typical question
How many records satisfy given conditions on
attributes?
Typical answer
In record-oriented database systems
Database scan: O(N)
Sorting / indexes?
Unsuitable for most problems
P-Trees
Compressed bit-column-wise storage
Bit-wise AND replaces database scan
P-Trees: Compression Aspect
P-Trees: Ordering Aspect
Compression relies on long
sequences of 0 or 1
Images
Neighboring pixels are probably similar
Peano-ordering
Other data?
Peano-ordering can be generalized
Peano-order sorting
Peano-Order Sorting
Impact of Peano-Order Sorting
Speed
120
Unsorted
100
80
Simple Sorting
60
40
Generalized Peano
Sorting
cr
op
Less than O(N)
scaling for all
algorithms
improvement
especially for
large data sets
80
Time per Test Sample in
Milliseconds
fu
nc
tio
n
us
hr
oo
m
m
sp
am
20
0
ad
ul
t
Time in Seconds
Impact of Sorting on Execution Speed
60
40
20
0
0
5000
10000
15000
20000
Num ber of Training Points
25000
30000
So Far
Answer to challenge 1: Many records
P-Tree concept allows scaling better than O(N)
for AND (equivalent to database scan)
Introduced effective generalization to non-spatial
data (thesis)
Challenge 2: Many attributes
Focus: Classification
Curse of dimensionality
Some algorithms suffer more than others
Curse of Dimensionality
Many standard classification algorithms
E.g., decision trees, rule-based classification
For each attribute 2 halves: relevant irrelevant
How often can we divide by 2 before small size of
“relevant” part makes results insignificant?
Inverse of
Double number of rice grains for each square of
the chess board
Many domains have hundreds of attributes
Occurrence of terms in text mining
Properties of genes
Possible Solution
Additive models
Each attribute contributes to a sum
Techniques exist (statistics)
Computationally intensive
Simplest: Naïve Bayes
x(k) is value
P(x | C ci ) P( x ( k ) | C ci )
of kth attribute
k 1
Considered additive model
M
Logarithm of probability additive
Semi-Naïve Bayes Classifier
Correlated attributes are joined
Has been done for categorical data
Kononenko ’91, Pazzani ’96
Previously: Continuous data discretized
New (thesis)
0.1
Kernel-based
evaluation of correlation
distribution
function
0.08
0.06
N
Corr(a, b)
K
(k )
K
(k )
t 1 k a ,b
N
k a ,b t 1
( x ( k ) , xt( k ) )
0.04
1
(k )
kernel
density
estimate
data points
0.02
(k )
t
(x , x )
0
Results
Error decrease in units of standard deviation
for different parameter sets
Improvement for wide range of correlation
thresholds: 0.05 (white) to 1 (blue)
Semi-Naive Classifier Compard with P-Tree Naive Bayes
Decrease in Error Rate
25
20
15
Parameters (a)
10
Parameters (b)
Parameters (c)
5
0
spam
-5
crop
adult
sickeuthyroid
mushroom
genefunction
splice
So Far
Answer to challenge 1: More records
Generalized P-tree structure
Answer to challenge 2: More attributes
Additive algorithms
Example: Kernel-based semi-naïve Bayes
Challenge 3: New subject domains
Data on a graph
Outlook: Data with time dependence
Standard Approach to Data Mining
Conversion to a relation (table)
Domain knowledge goes into table
creation
Standard table can be mined with
standard tools
Does that solve the problem?
To some degree, yes
But we can do better
“Everything should be
made as simple as
possible, but not simpler”
Albert Einstein
Claim: Representation as single
relation is not rich enough
Example:
Contribution of a
graph structure to
standard mining
problems
Genomics
Protein-protein
interactions
WWW
Link structure
Scientific publications
Citations
Scientific American 05/03
Data on a Graph: Old Hat?
Common Topics
Analyze edge structure
Google
Biological Networks
Sub-graph matching
Chemistry
Visualization
Focus on graph structure
Our work
Focus on mining node data
Graph structure provides connectivity
Protein-Protein Interactions
Protein data
From Munich Information Center for Protein
Sequences (also KDD-cup 02)
Hierarchical attributes
Function
Localization
Pathways
Gene-related
properties
Interactions
From experiments
Undirected graph
Questions
Prediction of a property
(KDD-cup 02: AHR*)
Which properties in
neighbors are relevant?
How should we integrate
neighbor knowledge?
What are interesting
patterns?
Which properties say
more about neighboring
nodes than about the
node itself?
*AHR: Aryl Hydrocarbon Receptor Signaling Pathway
But not:
Possible Representations
OR-based
At least one neighbor has property
Example: Neighbor essential true
AND-based
All neighbors have property
Example: Neighbor essential false
Path-based
(depends on maximum hops)
One record for each path
Classification: weighting?
Association Rule Mining:
Record base changes
AHR
essential
AHR
essential
AHR
not essential
Association Rule Mining
OR-based representation
Conditions
Association rule involves AHR
Support across a link greater than within a
node
Conditions on minimum confidence and support
Top 3 with respect to support:
AHR essential
AHR nucleus (localization)
AHR transcription (function)
(Results by Christopher Besemann, project CSci 366)
Classification Results
Problem
(especially path-based representation)
Varying amount of information per record
Many algorithms unsuitable in principle
E.g., algorithms that divide domain space
KDD-cup 02
Very simple additive model
Based on visually identifying relationship
Number of interacting essential genes adds to
probability of predicting protein as AHR
KDD-Cup 02: Honorable Mention
NDSU Team
Outlook: Time-Dependent Data
KDD-cup 03
Prediction of citations of scientific papers
Old: Time-series prediction
New: Combination with similarity-based
prediction
Conclusions and Outlook
Many exciting problems in data mining
Various challenges
Scaling of existing algorithms (more records)
Different types of algorithms gain importance
(more attributes)
Identifying and solving new challenges in a
domain-independent way (new subject areas)
Examples of general structural components
that apply to many domains
Graph-structure
Time-dependence
Relationships between attributes
Software design of scientific applications
Rows vs. columns
Software engineering aspects