Transcript Main Title

Charu C. Aggarwal (I2 Contributions)
Scalable Graph Querying and Indexing
Task I2.2
Charu C. Aggarwal
IBM
Collaborators (across all tasks): Jiawei Han (UIUC), Guojun Qi
(UIUC), Thomas Huang (UIUC), Tarek Abdelzaher (UIUC), Xifeng
Yan (UCSB), Arjit Khan (UCSB), Nan Li (UCSB), Amotz Bar-Noy
(CUNY), Simon Shamoun (CUNY)
Other Tasks: I1.2 (funded), I3 (collaborator)
INARC
Overview of Contributions
 Project I1 Contributions:
 Methods for Sensor Selection in Dynamic Information Networks (Collaboration
with A. Bar-Noy (CUNY) and S. Shamoun (CUNY))=>Submitted to DCOSS 2011
 Data fusion of heterogeneous data with the use of network links: specific
application to text and visual data
 QoI robust inference with the use of heterogeneous data fusion
 Joint work with G. Qi (UIUC) and T. Huang (UIUC)
 Papers accepted in CVPR, WWW 2011; one submitted to KDD 2011
 Collaboration with Tarek Abdelzaher (UIUC) on data selection for regression
and fact-finders for fusion=> Submitted to DCOSS 2011, Fusion 2011
 Project I2 Contributions: Large Scale Indexing (Focus of this talk):
 Methods for Indexing Massive Disk-Resident Graphs (Aggarwal (IBM), Zhao
(UIUC), and Han (UIUC))=> Submit to PVLDB 2012
 Methods for Indexing Dynamic Network Streams (Aggarwal (IBM), Khan
(UCSB), Yan (UCSB))=> Submit to PVLDB 2012
 Methods for label-based query index (Joint work with Li (UCSB))=>SDM 2011
INARC
Overview of Contributions
 Project I3 Contributions:
 Unfunded Collaborator for project I3=> Actively collaborated with
Jiawei Han on mining of networks with heterogeneous links and
incomplete attributes (Task I3.1)
 Designed methods for clustering heterogeneous information networks
with heterogeneous links and incomplete attributes (joint work with
Yizhou Sun (UIUC) and Jiawei Han (UIUC)) => Submitted to KDD 2011
 Designed methods for link inference in the noise and heterogeneous
information network scenario (joint work with Barbier (UIUC), Gupta
(UIUC), Sun (UIUC) and Han (UIUC))=> Submitted to ASONAM 2011
INARC
Scalable Indexing and Querying Massive Graphs (I2.2)
 Indexing Methods for Large Scale Static and Dynamic Networks
 Methods for Indexing Massive Disk-Resident Graphs (Aggarwal
(IBM), Zhao (UIUC), and Han (UIUC))
Application to Shortest Path Queries
Can be extended to connectivity and other structural
queries
 Methods for Indexing Dynamic Network Streams (Aggarwal
(IBM), Khan (UCSB), Yan (UCSB))
Applications such as social and information networks show
continuous edge-base activity
Results in edge streams
Methods for frequency-based and structural indexing of
graph streams
INARC
Label-based Querying of Massive Dynamic Graphs
 Many networks have labels associated with some of the nodes which
may need to be learned or queried
 Eg. Which node belongs to a specific topic?
 Challenges:
 Network may be dynamic with edges and nodes which
continuously evolve over time
 The nodes in the network may contain content, which needs to
be used in the classification process
 The classification-queries need to be resolved in real-time.
 Designed methods for constructing dynamic structural index, which
can be continuously updated and used for label-based queries
(Aggarwal (IBM) and Li (UCSB)): Accepted to SIAM Conference on
Data Mining, 2011.
INARC
Indexing Graphs: The compression approach
 Core idea is to coarsen the graph into a smaller number of nodes
 Method for coarsening depends upon application:
 In static (disk-resident) methods, we use dense pattern mining
methods in order to coarsen graph into a smaller number of
nodes
 In dynamic (stream-based) methods, we use a hash-based
probabilistic sketch in order to coarsen graph into a smaller
number of nodes.
 Use probabilistic query-processing on the coarsened graph in order
to provide approximate responses to queries
INARC
Static Indexing of Disk-Resident Graphs
 Design methods for indexing massive disk-resident networks
 Typical social and information networks may be massive and may
need to be stored on disk
 For example, for a network containing 10^7 nodes, we may have
of the order of 10^(13) edges
 Network cannot be held in main memory, and in some cases not
even on disk
 First present methods where network cannot be held in main
memory, but is stored on disk
 Design methods for shortest path indexing methods: For a given
pair of nodes s and t, determine shortest distance between them
 Very fast for memory resident graphs=> Very slow for diskresident case
INARC
Static Indexing of Disk-Resident Graphs
 Core Approach for static indexing: Determine dense sets of nodes
in the network using a pattern mining approach
 Compress sets of nodes into supernodes
 Node compression will result in self-loops: Eliminate self-loops
 Perform query-processing on compressed graph
 Note that passing through super-node is equivalent to passing
through a set of edges: associate node-penalty with super-node
 Observation: Vast majority of edges lie in dense regions which are
compressed into self-loops
 Massive reduction in size of network: allows main-memory storage
for query-processing
INARC
Setting Node Penalties (Observations)
 When we run the query-processing technique on the coarsened
graph as a surrogate, passing through a super-node is modeled as
entering through a random node in the compressed subgraph and
exiting another random node
 Use probabilistic node penalty associated with such nodes
corresponding to pairwise shortest-path distances in the
compressed portion
 Modified Query Processing: Determine shortest-path distance
between s’ and t’ in compressed graph with node-penalties
 Note that node-penalties are random variables and the shortestpath distance could be average or worst-case.
 Important: Dense region compression leads to histogram with small
number of buckets (most distances are 1 or 2 within region).
INARC
Setting Node Penalties
INARC
Query Processing
 Start at node s’ and progressively maintain the random variables
corresponding to the shortest average (or worst-case) distances to
the nodes.
 Maintain probabilistic label with each node containing the
probabilistic histogram with shortest-path distance
 Grow set S of nodes with probabilistic shortest path distances
 In each iteration, we grow set S by one node after updating the
node labels of its nearest neighbors (requires probabilistic addition
of distances of shortest path to edge weight and node penalty).
 Terminate on reaching sink node t’
 Provides either approximation using compressed graph or access
compressed fragments from disk to get exact distances
INARC
Adding shortest path distances to node penalty
INARC
Other details of approach
 Other algorithmic components of approach
 A dense pattern mining algorithm which is designed in order to
determine the dense regions of the network for compression
purposes
Design a sampling-based approach which uses a limited
number of passes over the disk-resident data set
 A method for reconstructing the shortest path from the
compressed shortest path distance
Retrieve fragments from disk-resident data set
Use shortest path distance within each fragment to
reconstruct the distance
INARC
Experimental Results (DBLP Data Set)
INARC
Dynamic Indexing of Massive Graph Streams (Joint work with A. Khan
(UCSB) and X. Yan (UCSB))
 The previous methods were designed for static indexing of massive
graphs
 In many applications, we need indexes for the case in which the
graphs may continuously be updated over time (edge activity)
 Edge-based activity in networks result in a stream of edges
 Activity in a social network can be modeled as an edge-stream
 Assumption is much more rigorous in previous case
 Not enough space to hold even the edges on disk and their
arrival times
 Not enough space to hold even the total number of possible
distinct edges on disk
 Hard to perform structural analysis when the network is so
dynamic and cannot even be stored at a given point in time
INARC
Dynamic Indexing of Massive Graph Streams
 Use a 3-dimensional hash-based sketch structure in order to
create the dynamic index
 Borrows the concept of a sketch structure from data stream
analysis and generalizes it to the case of structural data
 Queries which can be handled:
 Determine the frequency of a given edge in the graph
 Determine the aggregate frequency of edges in a subgraph
 Determine the minimum frequency of edges in a subgraph
 Determine all edges with frequency above a given threshold
 Determine the connected components with frequency above a
given threshold
INARC
The sketch based approach: Core Idea
 Model: Assume that edges are received continuously over time
 Use a 3-dimensional matrix of size h X h X w
 The parameter h indicates the range of the hashing
 The parameter w indicates the number of hash functions
 For an edge (i, j) the node i is hashed into the first dimension
using k-th hash function g^k() with range [0, h-1]
 The node j is hashed into the second dimension using k-th hash
function g^k() in the range [0, h-1]
 There are w slices of size h X h corresponding to different
hash functions g^k() where k lies in the range [1, w]
 Multiple hash functions provide robustness
 The entire frequency behavior of edges is mapped into the
structure of size h X h X w
INARC
The sketch based approach: Core Idea
 Model: Assume that edges are received continuously over time
 Use a 3-dimensional matrix of size h X h X w
 The parameter h indicates the range of the hashing
 The parameter w indicates the number of hash functions
 For an edge (i, j) the node i is hashed into the first dimension
using k-th hash function g^k() with range [0, h-1]
 The node j is hashed into the second dimension using k-th hash
function g^k() in the range [0, h-1]
 There are w slices of size h X h corresponding to different
hash functions g^k() where k lies in the range [1, w]
 Multiple hash functions provide robustness
 The entire frequency behavior of edges is mapped into the
structure of size h X h X w
INARC
The sketch based approach: Stream Incremental Update
 The update process is straightforward and requires an
incremental step involving the application of w different
hash functions for each incoming edge (i, j)
 Start off by setting each cell in the h X h x w synopsis
matrix V to 0
 Assume incoming edge (i, j) and frequency r(i, j)
 For each incoming edge (i, j) and k-th hash function
g^k(), update the cell V(g^k(i), g^k(j), k) by r(i, j)
 The update process is applied for each of the w
different hash functions
INARC
Subgraph Frequency-based Queries
 Query: Determine frequency of edges in subgraph with
node set S
 For each edge (i, j) in subgraph S, determine the value
of V(g^k(i), g^k(j), k) for different hash functions k
 Determine the minimum value of V(g^k(i), g^k(j), k) over
the different hash functions k
 Sum up value over different edges
 Key Result: Derived mathematical error bound => With
probability at least 1 – (|S|*|S|/(h*h*f))^w the error is
at most L.f, where L is the total frequency of edges
received=> Details of proof available in paper.
INARC
Inverse Frequency-based Queries
 Query: Determine all edges with frequency above a given
threshold
 Key Augmentation to Data Structure: Each hash cell contains
an inverted list of all the nodes which map to it
 For a given frequency threshold determine all hash cells (for
each hash function) above the threshold
 Determine the intersection of all the inverted lists pointed
to by determined hash cells=> using multiple hash functions
to reduce collision error.
 Report only edges which are present in the intersection of
the hash functions => error reduces logarithmically with
number of hash functions
INARC
Connected-subgraph Queries
Query: Determine all connected subgraphs
containing only edges with frequency above a
given threshold
Determine edges using the approach discussed in
previous query
Determine connected components with the use
of the found edges
Numerous other structural queries can be
handled with the use of the compressed graph.
Error bounds hold for a subset of queries
INARC
Experimental Results (last.fm Data Set)
INARC
Label-based querying of Massive Graphs (Joint work With N. Li
(UCSB))
 We assume that we have a massive graph with nodes
interconnected by edges, and nodes which contain content.
 The nodes and edges of the graph are continuously received over
time.
 Need to respond to label-based queries dynamically using both the
structure and content
 Need to use the content in the classification process
 Core Idea:
 Construct semi-bipartite augmentation of the network which
uses the content in the form of synthetic word nodes
 Create a fast inverted index which can respond to label-based
queries.
INARC
Augmented Bipartite Network
INARC
Key Results
 Method is designed for dynamic classification queries
(no static pre-processing) A dynamic classification
index
 Maintain the semi-bipartite augmentation dynamically
 Maintain inverted representation with the word
nodes=>Incorporation of content with structure
 Perform random walk starting at query node and report
majority label as the appropriate class
 Theoretical Result: Sampling a certain number of walks
reduces the error estimation exponentially=> Proof
available in paper => Use of Hoeffding bound
 Paper accepted in SDM, 2011.
INARC
Experimental Results
INARC
Military Relevance
 Military information networks are extremely large and
require dynamic and time-varying methods for querying
and indexing
 Require real-time responses to structural queries for
data which is large and may be added to rapidly =>
network stream scenario
 Methods are also applicable to resource-constrained
environments even for very large data sets
 Classification query provides methods to identify nodes
of relevance in a given information network query =>
which is the node most relevant to a given topic? =>
Where is the information I want?
INARC
The Path Ahead: Future Work
 Extend the kinds of queries which can be handled by
different kinds of methods:
Structural connectivity queries
Shortest paths in dynamic graph streams
Extend first method to dynamic case
 Extend methods to uncertain graphs
Challenge of uncertainty: Expected path length methods
do not work!
Even a small probability of disconnection of source and
sink leads to expected path length of infinity
Design probabilistic model which uses the shortest path
with threshold probability
INARC
Publications (Accepted)
 C. C. Aggarwal (IBM), N. Li (UCSB). On Dynamic Node-Classification in
Content-based Networks, Accepted to the SDM Conference, 2011.
 G. Qi (UIUC), C. C. Aggarwal (IBM), T. Huang. (UIUC) Towards Semantic
Knowledge Propagation between Text and Web Images, Accepted to the WWW
Conference, 2011.
 G. Qi (UIUC), C. C. Aggarwal (IBM), T. Huang (UIUC). Towards CrossCategory Learning of Visual Concepts. Accepted to the CVPR Conference,
2011.
 C. C. Aggarwal (IBM), A. Khan (UCSB), X. Yan (UCSB). On Flow Authority
Discovery in Social Networks, Accepted to the SDM Conference, 2011.
 C. C. Aggarwal, Y. Xie, P. Yu. On Community Discovery in Locally
Heterogeneous Networks, Accepted to the SDM Conference, 2011.
 C. C Aggarwal, Y. Zhao, P. Yu. On Wavelet Decomposition of Uncertain
Streams, CIKM Conference, 2010.
 M. Masud, L. Khan, B. Thuraisingham, C. Aggarwal (IBM), J Gao (UIUC), J.
Han (UIUC), On Novel Class Detection in Concept Drifting Data Streams,
ICDM Conference, 2010
INARC
Publications (Submitted)
 G. Qi (UIUC), C. C. Aggarwal (IBM), T. Huang (UIUC). Community
Detection with Edge Content, Submitted to the SIGIR Conference,
2011.
 Y. Sun (UIUC), C. C. Aggarwal (IBM), J. Han (UIUC), On
Community Discovery in Heterogeneous Networks with Incomplete
Attributes, Submitted to the KDD Conference, 2011.
 G. Qi (UIUC), C, C, Aggarwal (IBM), T. Huang (UIUC), Transfer
Learning with Distance Functions between Text and Images,
Submitted to the KDD Conference, 2011.
 C. C. Aggarwal (IBM), A. Bar-Noy (CUNY), S. Shamoun (CUNY).
On Sensor Selection in Linked Information Networks, Submitted to
the DCOSS Conference, 2011.
INARC
Publications (Submitted)
 Y. Sun (UIUC), C. C. Aggarwal (IBM), J. Han (UIUC), On Link Inference
in Bibliographic Networks, Submitted to the ASONAM Conference, 2011.
 M. Gupta (UIUC), C, C, Aggarwal (IBM), J. Han (UIUC), On Evolutionary
Clustering and Analysis of Heterogeneous Information Networks, Submitted
to the ASONAM Conference, 2011.
 C. C. Aggarwal (IBM), A. Bar-Noy (CUNY), S. Shamoun (CUNY). On
Sensor Selection in Linked Information Networks, Submitted to the
DCOSS Conference, 2011.
 C. C. Aggarwal (IBM), P. Zhao (UIUC), J. Han (UIUC). On Shortest Path
Indexing with Disk-Resident Graphs, Submitted to the PVLDB Conference,
2012.
 C. C. Aggarwal (IBM), A. Khan (UCSB), X. Yan (UCSB). On Query
Processing of Massive Graph Streams, Submitted to the PVLDB
Conference, 2012.
INARC
Publications (Submitted)
 T. Abdelzaher (UIUC), D. Wang (UIUC), H. Ahmadi (UIUC), J.
Pasternack (UIUC), M. Gupta (UIUC), J. Han (UIUC), O. Fetemieh
(UIUC), H. Le (UIUC) C. C. Aggarwal (IBM), On Bayesian Information of
Fact-Finding in Information networks, Submitted to the Fusion Conference,
2011.
 D. Wang (UIUC), H. Ahmadi (UIUC), T. Abdelzaher (UIUC), H. Chenji. R.
Stoleru, C, C, Aggarwal (IBM). Data Models for Optimizing Quality-ofInformation in Cost-Sensitive Data Fusion, Submitted to the DCOSS
Conference, 2011.
INARC