Query Processing, Resource Management and Approximate in a

Download Report

Transcript Query Processing, Resource Management and Approximate in a

Three Challenges in
Data Mining
Anne Denton
Department of
Computer Science
NDSU
Why Data Mining?
 Parkinson’s Law of Data
Data expands to fill the space
available for storage
 Disk-storage version of Moore’s law
Capacity  2
t / 18 months
 Available data grows exponentially!
Outline
 Motivation of 3 challenges
 More records (rows)
 More attributes (columns)
 More subject domains
 Some answers to the challenges
 Thesis work
 Generalized P-Tree structure
 Kernel-based semi-naïve Bayes classification
 KDD-cup 02/03 and with Csci 366
students
 Data with graph relationship
 Outlook: Data with time dependence
Examples
 More records




Many stores save each transaction
Data warehouses keep historic data
Monitoring network traffic
Micro sensors / sensor networks
 More attributes
 Items in a shopping cart
 Keywords in text
 Properties of a protein (multi-valued
categorical)
 More subject domains
 Data mining hype increases audience
Algorithmic Perspective
 More records

Standard scaling problem
 More attributes

Different algorithms needed for 1000 vs. 10 attributes
 More subject domains

 New techniques needed
 Joining of separate fields
Algorithms should be domain-independent
 Need for experts does not scale well
 Twice as many data sets
 Twice as many domain experts??

Ignore domain knowledge?
 No! Formulate it systematically
Some Answers to Challenges
 Large data quantity (Thesis)
 Many records
 P-Tree concept and its generalization to
non-spatial data
 Many attributes
 Algorithm that defies curse of dimensionality
 New techniques / Joining separate fields
 Mining data on a graph
 Outlook: Mining data with time dependence
Challenge 1: Many Records
 Typical question
 How many records satisfy given conditions on
attributes?
 Typical answer
 In record-oriented database systems
 Database scan: O(N)
 Sorting / indexes?
 Unsuitable for most problems
 P-Trees
 Compressed bit-column-wise storage
 Bit-wise AND replaces database scan
P-Trees: Compression Aspect
P-Trees: Ordering Aspect
 Compression relies on long
sequences of 0 or 1
 Images
 Neighboring pixels are probably similar
 Peano-ordering
 Other data?
 Peano-ordering can be generalized
 Peano-order sorting
Peano-Order Sorting
Impact of Peano-Order Sorting
 Speed
120
Unsorted
100
80
Simple Sorting
60
40
Generalized Peano
Sorting
cr
op
 Less than O(N)
scaling for all
algorithms
improvement
especially for
large data sets
80
Time per Test Sample in
Milliseconds
fu
nc
tio
n
us
hr
oo
m
m
sp
am
20
0
ad
ul
t
Time in Seconds
Impact of Sorting on Execution Speed
60
40
20
0
0
5000
10000
15000
20000
Num ber of Training Points
25000
30000
So Far
 Answer to challenge 1: Many records
 P-Tree concept allows scaling better than O(N)
for AND (equivalent to database scan)
 Introduced effective generalization to non-spatial
data (thesis)
 Challenge 2: Many attributes
 Focus: Classification
 Curse of dimensionality
 Some algorithms suffer more than others
Curse of Dimensionality
 Many standard classification algorithms
 E.g., decision trees, rule-based classification
 For each attribute 2 halves: relevant  irrelevant
 How often can we divide by 2 before small size of
“relevant” part makes results insignificant?
 Inverse of
 Double number of rice grains for each square of
the chess board
 Many domains have hundreds of attributes
 Occurrence of terms in text mining
 Properties of genes
Possible Solution
 Additive models
 Each attribute contributes to a sum
 Techniques exist (statistics)
 Computationally intensive
 Simplest: Naïve Bayes
 x(k) is value
(k )
P (x | C  c i )   P ( x | C  c i )
of kth attribute
k 1
 Considered additive model
M
 Logarithm of probability additive
Semi-Naïve Bayes Classifier
 Correlated attributes are joined
 Has been done for categorical data
 Kononenko ’91, Pazzani ’96
 Previously: Continuous data discretized
 New (thesis)
0.1
 Kernel-based
evaluation of correlation
distribution
function
0.08
0.06
N
K
Corr ( a , b ) 
(k )
(x
(k )
(k )
, xt )
t 1 k  a , b
k  a , b t 1
data points
1
N
K
0.04
(k )
(x
(k )
,x
(k )
t
kernel
density
estimate
0.02
)
0
Results
 Error decrease in units of standard deviation
for different parameter sets
 Improvement for wide range of correlation
thresholds: 0.05 (white) to 1 (blue)
Semi-Naive Classifier Compard with P-Tree Naive Bayes
Decrease in Error Rate
25
20
15
Parameters (a)
10
Parameters (b)
Parameters (c)
5
0
spam
-5
crop
adult
sickeuthyroid
mushroom
genefunction
splice
So Far
 Answer to challenge 1: More records
 Generalized P-tree structure
 Answer to challenge 2: More attributes
 Additive algorithms
 Example: Kernel-based semi-naïve Bayes
 Challenge 3: More subject domains
 Data on a graph
 Outlook: Data with time dependence
Standard Approach to Data Mining
 Conversion to a relation (table)
 Domain knowledge goes into table
creation
 Standard table can be mined with
standard tools
 Does that solve the problem?
 To some degree, yes
 But we can do better
“Everything should be
made as simple as
possible, but not simpler”
Albert Einstein
Claim: Representation as single
relation is not rich enough
 Example:
Contribution of a
graph structure to
standard mining
problems
 Genomics
 Protein-protein
interactions
 WWW
 Link structure
 Scientific publications
 Citations
Scientific American 05/03
Data on a Graph: Old Hat?
 Common Topics
 Analyze edge structure
 Google
 Biological Networks
 Sub-graph matching
 Chemistry
 Visualization
 Focus on graph structure
 Our work
 Focus on mining node data
 Graph structure provides connectivity
Protein-Protein Interactions
 Protein data
 From Munich Information Center for Protein
Sequences (also KDD-cup 02)
 Hierarchical attributes
 Function
 Localization
 Pathways
 Gene-related
properties
 Interactions
 From experiments
 Undirected graph
Questions
 Prediction of a property
(KDD-cup 02: AHR*)
 Which properties in
neighbors are relevant?
 How should we integrate
neighbor knowledge?
 What are interesting
patterns?
 Which properties say
more about neighboring
nodes than about the
node itself?
*AHR: Aryl Hydrocarbon Receptor Signaling Pathway
But not:
Possible Representations
 OR-based
 At least one neighbor has property
 Example: Neighbor essential true
 AND-based
 All neighbors have property
 Example: Neighbor essential false
 Path-based
(depends on maximum hops)
 One record for each path
 Classification: weighting?
 Association Rule Mining:
Record base changes
AHR
essential
AHR
essential
AHR
not essential
Association Rule Mining
 OR-based representation
 Conditions
 Association rule involves AHR
 Support across a link greater than within a
node
 Conditions on minimum confidence and support
 Top 3 with respect to support:
AHR  essential
AHR  nucleus (localization)
AHR  transcription (function)
(Results by Christopher Besemann, project CSci 366)
Classification Results
 Problem
(especially path-based representation)
 Varying amount of information per record
 Many algorithms unsuitable in principle
 E.g., algorithms that divide domain space
 KDD-cup 02
 Very simple additive model
 Based on visually identifying relationship
 Number of interacting essential genes adds to
probability of predicting protein as AHR
KDD-Cup 02: Honorable Mention
NDSU Team
Outlook: Time-Dependent Data
 KDD-cup 03
 Prediction of citations of scientific papers
 Old: Time-series prediction
 New: Combination with similarity-based
prediction
Conclusions and Outlook
 Many exciting problems in data mining
 Various challenges
 Scaling of existing algorithms (more records)
 Different properties in algorithms become
relevant (more attributes)
 Identifying and solving new domainindependent challenges (more subject areas)
 Examples of general structural
components that apply to many domains
 Graph-structure
 Time-dependence
 Relationships between attributes