Beyond Streams and Graphs: Dynamic Tensor Analysis

Download Report

Transcript Beyond Streams and Graphs: Dynamic Tensor Analysis

Beyond Streams and Graphs:
Dynamic Tensor Analysis
Jimeng Sun
Dacheng Tao Christos Faloutsos
Speaker: Jimeng Sun
Motivation
Goal: incremental pattern discovery on streaming
applications
Streams:

E1: Environmental sensor networks
E2: Cluster/data center monitoring

E3: Social network analysis

Graphs:
Tensors:



E4: Network forensics
E5: Financial auditing
E6: fMRI: Brain image analysis
How to summarize streaming data effectively
and incrementally?
E3: Social network analysis
Traditionally, people focus on static networks
and find community structures
We plan to monitor the change of the
community structure over time and identify
abnormal individuals
Keywords
2004
DM
DB
Authors
1990
DB
E4: Network forensics
Directional network flows
A large ISP with 100 POPs, each POP 10Gbps link
capacity [Hotnets2004]
450 GB/hour with compression
Task: Identify abnormal traffic pattern and find out
the cause
normal traffic
destination
destination
abnormal traffic
source
source
Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie
Static Data model
Time = 0
Destination
Source
For a timestamp, the stream
measurements can be modeled
using a tensor
Dimension: a single stream
E.g, <Christos, “graph”>
Mode: a group of dimensions of
the same kind.
E.g., Source, Destination, Port
Static Data model (cont.)
Tensor
Formally,
Generalization of matrices
Represented as multi-array, data cube.
1st
2nd
3rd
Correspondence
Vector
Matrix
3D array
Keywords
Destinations
Sensors
Authors
Example
Po
rts
Order
Sources
Dynamic Data model (our focus)
Destination
Source
time
Streams come with structure
(time, source, destination, port)
(time, author, keyword)
Dynamic Data model (cont.)
Tensor Streams
A sequence of Mth order tensor
where
n is increasing over time
1st
2nd
3rd
Correspondence
Multiple streams
Time evolving graphs
3D arrays
Sources
…
…
Destinations
author
Example
keyword
time
Sensors
Po
rts
Order
Dynamic tensor analysis
Destination
Old Tensors New Tensor
Old cores
UDestination
USource
Roadmap
Motivation and main ideas
Background and related work
Dynamic and streaming tensor analysis
Experiments
Conclusion
Background – Singular value
decomposition (SVD)
SVD
n
kR
kR
n
VT
m
A
m
U

kR
UT
Y
Best rank k approximation in L2
PCA is an important application of SVD
Latent semantic indexing (LSI)
Singular vectors are useful for
clustering or correlation detection
cluster
frequent cache
query
pattern
DM
DB
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
0.18 0
concept-association
0.36 0
9.64 0
0.18 0
x 0 5.29 x
= 0.90 0
0
0.53
0
0.80
0.58 0.58 0.58 0
0
0
0.27
0
0
0
0.71 0.71
document-concept
concept-term
Tensor Operation: Matricize X(d)
Unfold a tensor into a matrix
5 7
1 3
6 8
2 4
Acknowledge to Tammy Kolda for this slide
Tensor Operation: Mode-product
destination
source
destination
Multiply a tensor with a matrix
source
source
group
group
Related work
Low Rank approximation
PCA, SVD: orthogonal
based projection
Stream mining
Scan data once to
identify patterns
Sampling: [Vitter85],
[Gibbons98]
Sketches: [Indyk00],
[Cormode03]
Multilinear analysis
Tensor decompositions:
Tucker, PARAFAC, HOSVD
Our Work
Graph mining
Explorative:
[Faloutsos04][Kumar99]
[Leskovec05]…
Algorithmic:
[Yan05][Cormode05]…
Roadmap
Motivation and main ideas
Background and related work
Dynamic and streaming tensor analysis
Experiments
Conclusion
Tensor analysis
Destinations
Destination
Projection
Po
rts
Given a sequence of tensors
find the projection matrices
such that the reconstruction error e
is minimized:
t
…
Core Tensor
Source
Projection
…
Sources
Port
Projection
Note that this is a generalization of PCA when n is a constant
Why do we care?
Anomaly detection
Reconstruction error driven
Multiple resolution
Multiway latent semantic indexing (LSI)
Keywords
2004
Philip Yu
DM
Michael
Stonebreaker
UA
DB
Authors
1990
2004
UK
DB
1990
Pattern
Query
1st order DTA - problem
Given x1…xn where each xi RN, find
URNR such that the error e is
small:
N
n
….
time
x1
Y
UT
R
?
Sensors
xn
Sensors
indoor
outdoor
Note that Y = XU
time
1st order DTA
Old X
x
Input: new data vector x RN, old variance matrix
C RN N
Output: new projection matrix U RN R
Algorithm:
1. update variance matrix Cnew = xTx + C
2. Diagonalize UUT = Cnew
3. Determine the rank R and return U
x
xT
C
Cnew
U
Diagonalization has to be done for every new x!
UT
1st order STA
Adjust U smoothly when new data arrive without
diagonalization [VLDB05]
For each new point x
For each new point x and for i = 1, …, k :
yi := UiTx
(proj. onto Ui)
di  di + yi2
(energy  i-th
eigenval.)
ei := x – yiUi
(error)
Ui  Ui + (1/di) yiei (update estimate)
x  x – yiUi
(repeat with remainder)
error
U
Sensor 1
Sensor 2
Project onto current line
Estimate error
Rotate line in the direction of the error and in proportion to
its magnitude
Mth order DTA
Reconstruct Variance Matrix
UTd
Ud
Sd

Cd
Construct Variance Matrix of
Incremental Tensor
Diagonalize
Variance Matrix
Matricizing,
Transpose
Matricizing
Cd

XT d 

X d 
T
X(d)X(d)
Update Variance Matrix
U Td
Ud
Sd
Mth order DTA – complexity
Storage:
O( Ni), i.e., size of an input tensor at a single
timestamp
Computation:
 Ni3 (or  Ni2) diagonalization of C
+  Ni  Ni
matrix multiplication X (d)T X(d)
For low order tensor(<3), diagonalization is the main cost
For high order tensor, matrix multiplication is the main
cost
Mth order STA
XT d 
Matricizing
U1 updated
x
e1
U1
y1
Run 1st order STA along each mode
Complexity:
Storage: O(
Ni)
Computation:  Ri  Ni which is
smaller than DTA
Roadmap
Motivation and main ideas
Background and related work
Dynamic and streaming tensor analysis
Experiments
Conclusion
Experiment
Objectives
Computational efficiency
Accurate approximation
Real applications


Anomaly detection
Clustering
Data set 1: Network data
TCP flows collected at CMU backbone
Raw data 500GB with compression
Construct 3rd order tensors with hourly windows
with <source, destination, port>
Each tensor: 500500100
1200 timestamps (hours)
Sparse data
10AM to 11AM on 01/06/2005
value
Power-law distribution
Data set 2: Bibliographic data (DBLP)
Papers from VLDB and KDD
conferences
Construct 2nd order tensors with
yearly windows with <author, keywords>
Each tensor: 45843741
11 timestamps (years)
Computational cost
3rd order network tensor
OTA is the offline tensor analysis
Performance metric: CPU time (sec)
Observations:
2nd order DBLP tensor
DTA and STA are orders of magnitude faster than OTA
The slight upward trend in DBLP is due to the increasing number
of papers each year (data become denser over time)
Accuracy comparison
3rd order network tensor
2nd order DBLP tensor
Performance metric: the ratio of reconstruction
error between DTA/STA and OTA; fixing the error
of OTA to 20%
Observation: DTA performs very close to OTA in
both datasets, STA performs worse in DBLP due to
the bigger changes.
Network anomaly detection
Abnormal traffic
Reconstruction error
over time
Normal traffic
Reconstruction error gives indication of anomalies.
Prominent difference between normal and abnormal
ones is mainly due to unusual scanning activity
(confirmed by the campus admin).
Multiway LSI
Authors
Keywords
Year
michael carey, michael
stonebreaker, h. jagadish,
hector garcia-molina
queri,parallel,optimization,concurr,
objectorient
1995
surajit chaudhuri,mitch
cherniack,michael
stonebreaker,ugur etintemel
DB
distribut,systems,view,storage,servic,
process,cache
jiawei han,jian pei,philip s. yu, streams,pattern,support, cluster,
DMindex,gener,queri
jianyong wang,charu c.
aggarwal
2004
2004
Two groups are correctly identified: Databases and
Data mining
People and concepts are drifting over time
Conclusion
Tensor stream is a general data model
DTA/STA incrementally decompose
tensors into core tensors and
projection matrices
The result of DTA/STA can be used in
other applications
Anomaly detection
Multiway LSI
Final word: Think structurally!
The world is not flat, neither should
data mining be.
Contact:
Jimeng Sun
[email protected]