Similarity Searching in Large Image DataBases
Download
Report
Transcript Similarity Searching in Large Image DataBases
Memory-Constrained Data Mining
Slobodan Vucetic
Assistant Professor
Department of Computer and Information Sciences
Center for Information Science and Technology
Temple University, Philadelphia
Scientific Data Mining Lab
Dr. Slobodan Vucetic, Assistant Professor
CIS Department, IST Center,
Temple University, Philadelphia, USA
Need: (see Nature of March 23, 2006)
Amount of data in science every year
Shift from computers supporting scientists to playing central role in
testing, and even formulation, of scientific hypothesis
Lab Mission:
Developing an interface between data analysis and applied sciences
Working on collaborative projects at the interface between computer
science and other disciplines (sciences, engineering, business)
Training students to become computational research scientists
Research Tasks:
Predictive Modeling
Pattern Discovery
Summarization
Scientific Data Mining Lab:
Research Challenges
Spatial and temporal dependency
High dimensional data
Data collection bias
Data and knowledge fusion from multiple sources
Large-scale data
Missing/noisy/unstable attributes …
Scientific Data Mining Lab:
Current Projects
Data Mining
Resource-Constrained Data Mining (NSF)
Earth Science Applications
Estimation of geophysical parameters from satellite data (NSF)
Biomedical Applications
Gene expression data analysis (NIH, PA Dept. of Health)
Bioinformatics of protein disorder (PA Dept. of Health)
Bioinformatics core facility (PA Dept. of Health)
Text mining and Information retrieval (NSF)
Spatial modeling of disease and infection spread
Spatial and Temporal Knowledge Discovery
Spatial-temporal data reduction (NSF)
Analysis of deregulated electricity markets
Analysis of highway traffic data
Scientific Data Mining Lab:
Multiple-Source Spatial-Temporal Data
Analysis
Aim: Accurate and efficient estimation of geophysical parameters
from MISR and MODIS instruments on Terra satellite and
ground based observations (huge data streams)
MISR: Multi-angle
Imaging SpectroRadiometer
9 view angles at
Earth surface
4 Spectral bands
70.5º D
60.0º
45.6º
26.1º
C a
0.0º
26.1º
B a
45.6º
A
60.0º
A a a
70.5º
D
f
A n
B f
C f
400-km swath width
f
Vucetic, S., Han, B., Mi, W., Li, Z., Obradovic, Z., A
Data Mining Approach for the Validation of Aerosol
Retrievals, IEEE Geoscience and Remote Sensing
Letters, 2008.
Scientific Data Mining Lab:
Temporal Data Mining
Aim: analyze price vs. load dependences by discovering semi-
stationary segments in multivariate time series
40
30
20
0
2000
4000
6000
8000
10000
Size
(hours)
1
2
3
4
5707
4630
1425
1191
Price prediction (R2)
Local
Global
0.79
0.76
0.81
0.75
0.72
-0.49
0.48
-0.03
Price
Volatility
40
19
9
56
12000
50
40
0
2000
4000
6000
8000
10000
12000
Price [$/MWh]
150
100
50
0
Regime
4
30
1
20
2
3
10
A P R 8, 98
JULY 1, 98
OCT 1, 98
JA N 1, 98
A P R 1, 99
JULY 1, 98
OCT 1, 99
0
15
20
25
30
Load [GWh]
35
40
Result: several pricing regimes existed in California market
Vucetic, S., Obradovic, Z. and Tomsovic, K. (2001) “Price-Load Relationships in California's
Electricity Market," IEEE Trans. on Power Systems.
Scientific Data Mining Lab:
Text Mining: Re-Ranking of Articles Retrieved by a
Search Engine
When topic is difficult to express as a query, often
No relevant articles are found by keyword search
Too many irrelevant articles are returned
Biomedical Example:
“Apurinic/apyrimidinic
endonuclease”:
638 citations returned by PubMed
“Apurinic/apyrimidinic
endonuclease disorder”:
1 citation (irrelevant) returned
Result: Large lift of relevant retrievals in top 10
Han, B., Obradovic, Z., Hu, Z.Z., Wu, C. H. and Vucetic, S. (2006)
“Substring Selection for Biomedical Document Classification,” Bioinformatics.
Scientific Data Mining Lab:
Collaborative filtering
Aim: Predict preferences of an active customer given his/her
preferences on some items and a database of preferences of
other customers
Result: Regression-based collaborative filtering algorithm is superior to the neighborbased approach. It is two orders of magnitude faster on-line predicting; more accurate;
more robust to small number of observed votes.
Vucetic, S., Obradovic, Z., Collaborative Filtering Using a Regression-Based Approach, Knowledge and
Information Systems, Vol. 7, No. 1, pp. 1-22, 2005.
Scientific Data Mining Lab:
Bioinformatics: Protein Disorder Analysis
Aim: Understanding protein
disorder and its functions
Results:
• Protein disorders are very
common (contrary to a 20th
century belief)
• Fraction of disorder varies a lot by
genomes
• Different types of disorder exist in
proteins
• Involved with many important
functions
Kissinger et al, 1995
Scientific Data Mining Lab:
Analysis of Highway Traffic Data
Aim: understand traffic patters, predict traffic congestion and
delays
In progress…
Scientific Data Mining Lab:
Spatio-Temporal Disease Modelling
Aim: predict infection or disease risk, given
5
the information about population
movement
Activity Type
10
15
20
25
30
5
10
15
20
10
15
20
Infection Risk
0.04
0.03
0.02
0.01
0
5
Location Type
Figure 1. Illustration of location clusters and the
associated risks.
Result: movement information is very useful in prediction of the infection risk
Vucetic, S,. Sun, H., Aggregation of Location Attributes for Prediction of Infection Risk, Workshop on
Spatial Data Mining: Consolidation and Renewed Bearing, SDM, Bethesda, MD, 2006.
Scientific Data Mining Lab:
Resource-Constrained Data Mining
Aim:
Efficient knowledge discovery from large data by limited-capacity
computing devices
Approach:
Integration of data mining and data compression
Figure1. left) Noisy checkerboard data – the goal is to discriminate between black and yellow dots and the
achievable accuracy is 90%, middle) 100 randomly selected examples and the trained prediction model that has
76% accuracy, right) 100 examples selected by the reservoir algorithm and the trained prediction model that has
88% accuracy
Resource-Constrained Data Mining:
Motivation
Data mining objective:
Efficient and accurate algorithms for learning from large data
Performance measures:
Accuracy
Scaling with data size (# examples, #attributes)
Mainstream data mining:
many accurate learning algorithms that scale linearly or even sublinearly with data size and dimension, in both runtime and space
Caveat:
linear space scaling is often not sufficient it implies an
unbounded growth in memory with data size
Challenge:
how to learn from large, or practically infinite, data sets/streams
using limited memory resources
Resource-Constrained Data Mining:
Learning Scenario
Examples are observed
sequentially
in a single pass
Data stream examples
independent and identically
distributed (IID)
Could store the data
summary in
reservoir with fixed memory
Resource-Constrained Data Mining:
Approaches
Model-Free: Reservoir Approach
Maintains a random sample of size R from data stream
Add xt with min(1, R/t), remove randomly
Caveat: random sampling often not optimal
Data-Free: Online algorithms
Updates the model as examples are observed
Perceptron: wt+1 = wt + (yt - f(xt))xt , where f(x) = wTx
Caveat: sensitive to data ordering
Hybrid: Data + Model
Implicitly done with Support Vector Machines (SVMs)
Resource-Constrained Data Mining:
Objective
Develop a memory-constrained SVM algorithm
What is SVM?
Popular data mining algorithm for classification
The most accurate on many problems
Theoretically and practically appealing
Computationally expensive
Cubic training time cost O(N3) (e.g. neural nets are O(N))
Quadratic training memory cost O(N2) (e.g. neural nets are O(N))
Linear prediction cost O(N) (e.g. neural nets are O(1))
Resource-Constrained Data Mining:
SVM Overview
Goal:
Use x1 and x2 to predict class
y {-1, 1}
Assume linear prediction
function f(x) = w1x1+w2x2+b
sign(f(x)) is final prediction
Challenge:
What is better, f1(x) or f2(x)
What is the best choice for f(x)?
Answer:
Best f(x) has the most wiggle
space it has largest margin
x2
f1(x)
x1
f2(x)
Resource-Constrained Data Mining:
SVM Overview
Maximizing margin is equivalent to:
minimize ||w||2
such that yi f(xi) 1
What if data are noisy?
minimize ||w||2 + Cii
such that yi f(xi) 1 - i, i 0
What if problem is nonlinear?
X (X)
Resource-Constrained Data Mining:
SVM Overview
Standard approach convert to dual problem
1
minimize ||w||2 + Cii
min : W i Qij j i b yi i
such that yi f(xi) 1 - i, i 0 0 C
2 i, j
i
i
i
where Qij = yiyj(xi)(xj) = yiyjK(xi, xj) , K is the Kernel function
Gaussian kernel: K(xi,xj) = exp(||xi – xj||2/A)
i are Lagrange multipliers
Optimization becomes the Quadratic Programming Problem
(minimizing convex function with linear constraints)
There is the optimal solution in O(N3) time and O(N2) space
SVM predictor:
N
f ( x) yi i K ( xi , x) b
i 1
To predict class of example x, we should compare it with all training
examples with i > 0
Resource-Constrained Data Mining:
SVM Overview
N
f ( x) yi i K ( xi , x) b
i 1
Support vectors 0<i<C
Error vectors i=C
Reserve vectors i=0
f(x) = -1
f(x) = +1
Resource-Constrained Data Mining:
Incremental-Decremental SVM
Standard SVM solution is “batch”, meaning that all training
data should be available for learning
Alternative is “online” SVM that can be update when new
training data are available
Incremental-Decremental SVM [Cauwenberghs, Poggio, 2000]
For each new example, the update takes
O(Ns2) time, Ns – number of support vectors (0<i<C)
O(NsN) memory. Considering Ns = O(N), memory is O(N2)
Total cost for online training on N examples is
O(N3) time
O(N2) memory
The same as for batch mode
Resource-Constrained Data Mining:
Memory-Constrained IDSVM
Idea
Modify IDSVM by upper-bounding number of support vectors
How Twin Vector Machine (TVM)
Define budget B and a set of pivot vectors q1…qB
Quantize each example to its nearest pivot,
Q(x) = {qk, k = arg minj=1:B ||x-qj||}
D = {(xi,yi), i = 1…N} Q(D) = {(Q(xi),yi), i = 1…N}
minimize ||w||2 + Cii
such that yi f(xi) 1 - i,
i 0, i = 1…N
minimize ||w||2 + Cj(nj+j+ + nj-j-)
such that f(qj) 1 - j, -f(qj) 1 - j-,
j+, j- 0, j = 1…B
Training SVM on Q(D) is equivalent to SVM on TV,
TV = {TVj, j = 1…B}
(Twin Vector Set)
TVj = {(qj,+1,nj+}, (qj,-1,nj-)}
(Twin Vector)
O(N3) O(B3) (constant) time; O(N2) O(B2) (constant) memory
Resource-Constrained Data Mining:
Online TVM
Online-TVM
Input: Data stream D = {(xi,yi), i = 1…N}, budget B, kernel
function K, slack parameter C
Output: TVM with parameters 1+,1-,… B+,B-, and b
1.
2.
3.
4.
5.
Initialize TVM = 0, TV =
for i = 1 to N
if Beneficial(xi)
Update-TV
Update-TVM
Resource-Constrained Data Mining:
Online TVM
Beneficial
1.
2.
3.
4.
if size(TV) < B or |f(xi)| m1
return 1
else
return 0
-2
+1
buffer
m1
Online-TVM
Input: Data stream D = {(xi,yi), i = 1…N},
budget B, kernel function K, slack parameter
C
Output: TVM with parameters 1+,1-,…
B+,B-, and b
0
-1
buffer -2
1.
2.
3.
4.
5.
Initialize TVM = 0, TV =
for i = 1 to N
if Beneficial(xi)
Update-TV
Update-TVM
Resource-Constrained Data Mining:
Online TVM
Update-TV
m2
s = size(TV)
-2
TVB+1 = {(xi,yi,1), (qi,-yi,0)}
+1
if s < B
buffer
0
TVs+1 = TVB+1
-1
elseif maxi=1:B|f(qi)| > m2
buffer -2
k = arg maxi=1:B |f(qi)|
TVk = TVB+1
else
find best pair TVi, TVj to merge
use (**) to calculate qnew
TVi = {(qnew,+1, si+ + sj+), (qnew,-1,si- + sj-)}
TVj = TVB+1
( si si )qi ( s j s j )q j
qnew
si si s j s j
(**)
Resource-Constrained Data Mining:
Online TVM
Merging Heuristics:
Nearest versus Weighted
+1
Global versus One-Sided
0
+1
0
Rejection merging
-1
-1
OneSideMerge
GlobalMerge
Resource-Constrained Data Mining:
Results
100
-1.5
6
-1
4
-0.5
2
0
0
0.5
-2
1
-4
1.5
-6
-1
0
1
10000
400
-1.5
-1.5
-1
-1
5
-0.5
10
-0.5
0
0
0
0
0.5
0.5
-10
1
1
-5
1.5
-1
0
-20
1.5
-1
1
Budget B = 100
0
1
Resource-Constrained Data Mining:
Results
Checkerboard (noisy)
Checkerboard (noisy)
1
2500
CPU time (in seconds)
Accuracy
0.95
0.9
0.85
0.8
0.75
2
10
TVM
IDSVM
LIBSVM
Random Sampling
3
TVM
IDSVM
2000
1500
1000
500
0
4
10
10
Length of data stream (in log scale)
Budget B = 100
0
2000 4000 6000 8000 10000
Length of data stream
Resource-Constrained Data Mining:
Results
Checkerboard (noisy)
Checkerboard (noisy)
80
1
Accuracy
0.9
CPU time (in seconds)
TVM budget 50
TVM budget 100
TVM budget 200
0.95
0.85
0.8
0.75
0.7
TVM budget 50
TVM budget 100
TVM budget 200
60
40
20
0.65
0
1
10
2
3
4
10
10
10
Length of data stream (in log scale)
0
2000 4000 6000 8000
Length of data stream
10000
Resource-Constrained Data Mining:
Results
Adult
1
0.84
0.95
0.82
0.9
Accuracy
Accuracy
Checkerboard (noisy)
0.85
0.8
with buffer
without buffer
0.75
0.7
0
5000
Length of data stream
10000
Budget B = 100
OneSideMerge
GlobalMerge
0.8
0.78
0.76
0.74
2
3
4
5
10
10
10
10
Length of data stream (in log scale)
Resource-Constrained Data Mining:
Results
Banana
0.84
0.91
0.82
0.9
0.78
0.76
TVM
IDSVM
LIBSVM
Random Sampling
TVM
IDSVM
LIBSVM
Random Sampling
0.8
0.78
0.76
2
3
4
10
10
10
Length of data stream (in log scale)
TVM
IDSVM
LIBSVM
Random Sampling
0.9
0.85
0.85
2
3
4
10
10
10
Length of data stream (in log scale)
IJCNN
1
TVM
IDSVM
LIBSVM
Random Sampling
0.8
2
3
4
10
10
10
Length of data stream (in log scale)
Pendigits
1
0.99
0.98
0.98
Accuracy
Accuracy
0.82
0.88
0.86
Gauss
0.84
0.89
0.87
0.74
2
3
4
5
10
10
10
10
Length of data stream (in log scale)
0.95
Accuracy
0.8
Checkerboard
1
Accuracy
0.92
Accuracy
Accuracy
Adult
0.86
0.96
0.94
0.92
TVM
IDSVM
LIBSVM
Random Sampling
0.9
2
3
4
5
10
10
10
10
Length of data stream (in log scale)
0.97
0.96
0.95
TVM
IDSVM
LIBSVM
Random Sampling
0.94
2
3
4
10
10
10
Length of data stream (in log scale)
Resource-Constrained Data Mining:
Results
Resource-Constrained Data Mining:
Results
Resource-Constrained Data Mining:
Conclusions
Memory-Constrained SVM is successful
Significantly higher accuracy than baseline
Close to the optimal approach
Merging heuristics are very important
Future work
Further improvements
Forgetting
Probabilistic merging
Use data compression
Non-IID streams
Thank You!
More information:
http://www.ist.temple.edu/~vucetic/
Collaboration/assistantship contact:
Slobodan Vucetic
CIS Department, IST Center,
Temple University
[email protected]