Slides for INARC mid

Transcript Slides for INARC mid

I2.2 Large-Scale Information Network Processing
Mid-Year Report
Charu Aggarwal (IBM)
Christos Faloutsos (CMU)
Ambuj Singh (UCSB)
Xifeng Yan (UCSB)
March, 2011
Task Setting
Indexing, Partitioning, and Distributed Processing
on Time-Varying Networks
2
INARC I2.2 Mid-Year Report


Objectives
– Novel graph index model and advanced graph distributed computing
theory to facilitate processing of (military) linked data that becomes a
bottleneck for many research tasks in network science
Key Technical Innovations:
− Dynamic graph indexing models and structures
− Scalable graph processing
− Graph partition overlapping and re-balancing theory


Primary Members
– Xifeng Yan (UCSB), Ambuj Singh (UCSB), Charu Aggarwal (IBM), Christos
Faloutsos (CMU)
Collaborative Members
– Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa
IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army
3
I2.2: Large-Scale Information Network Processing
Key Objective:
−
Novel graph index model and advanced graph distributed
computing theory to facilitate processing of (military) linked
data that becomes a bottleneck for many research tasks in
network science
Deliverables:
Q1: Data collection and cleaning for graph indexing and
distributed graph computing
Q2: Design graph indices with time-varying concerns
Q3: Design and test distributed graph computing strategies
Q4: Hypotheses validation and research paper submission
Impact:
−
Provide fast, scalable, and linked information access to
soldiers and commanders
Key Technical Innovations
−
Dynamic graph indexing models and structures to resolve
graph queries in time-varying information networks
−
Query cost models for distributed graph processing
−
Graph partition overlapping and re-balancing theory to (1)
improve locality of data for parallel computing, and (2)
accommodate dynamic network data updates and query
workload changes
−
Self evolving distributed graph processing environment to
adjust graph partitions dynamically
Role
Researchers
Lead
Xifeng Yan, UCSB
Primary
Ambuj Singh, UCSB
Primary
Charu Aggarwal, IBM
Primary
Christos Faloutsos, CMU
Collab
Z. Wen, IBM, SCNARC
Collab
G. Cao, PSU, CNARC
Collab
J. Han, UIUC, INARC
Total
$326.7K
4
Advance State-of-the-Art Network Science




Large-Scale Information Network Processing: Invent scalable
information network infrastructure
Facilitate processing of (military) linked data that becomes a
bottleneck for many research tasks in network science
Advance our understanding of scalability challenges, not only for
information networks but also for other genres of complex networks
The models and the proposed experimental systems provide
fundamental analysis of
– How indexing of dynamic network data affects query performance,
– How graph partitioning schemes affect distributed query processing,
– How the models and laws of real networks affect the design of graph
indexing and partitioning strategy
5
5
Military Relevance



Subtask 1: Graph Index and Search (UCSB, IBM)
– Fast access and processing of time-varying information networks is the
key for tasks such as intelligence service and query processing. Simply
speaking, we cannot access networks nodes by nodes!
Subtask 2: Graph over MapReduce (CMU)
– To process overwhelming amount of data on the Web, social networks,
emails, telecommunications, to distill important information such as
people’s opinion about extremists, to find potential radical groups, to
identify influential nodes, we need powerful graph processing methods.
– Needed by any large-scale network data processing including
information, social and communication networks
Subtask 3: Graph Partitioning/Distributed Graph Processing (UCSB, CMU)
– Military information is often distributed in many devices, distributed
graph processing run graph algorithms without putting all data
together in the same machine
6
6
Subtask 1: Graph Index and Search

Indexing Methods for Large Scale Static and Dynamic Networks
 Methods for Indexing Massive Disk-Resident Graphs (Aggarwal
(IBM), Zhao (UIUC), and Han (UIUC))
 Methods for Indexing Dynamic Network Streams (Aggarwal (IBM),
Khan (UCSB), Yan (UCSB))

Dynamic structural index for label-based queries (Aggarwal
(IBM) and Li (UCSB)): SDM 2011 accepted.

Analysis of significant substructures in time-varying networks
(Singh (UCSB) et al.)
 Find highest scoring substructures combines structure and time
7
gDensity: Model-Based Indexing

Problem definition (labeled proximity search)
– Label-based graph proximity search, seeks to find
the top-k vertex subsets with the smallest diameters,
for a given query of distinct labels. Each subset must
cover all the labels specified in the query.
Q=(“reconnaissance”,
Q=(a,
b, c)
“biometric matching”,
“failure modeling”)
d=2
d=3
Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal
8
gDensity: Ideas and Results

Can we do better?
Which one is more
promising?
u’s density
distribution

v’s density
distribution
10 – 300 times faster
Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal
9
Graph Search: a Model-Based Approach

Align two networks

Ideas




Use information
propagation model to
(a) linkedin
propagate labels in
information networks
Convert vertices to vectors
Align sets of vectors
(b) facebook
Query Speed: 0.1 sec for
WebGraph:10M vertices, 213M
edges
Information Propagation Model
A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11
10
SEARCH ALGORITHM

Step 1: Match a node u of target
graph G with some node v of query
graph Q, if L(v) ⊆ L(u) and cost(u,v) is
less than a predefined cost threshold ε.
u1
 Step 2: Discard the labels of the
f
unmatched nodes in the target graph.
u4
u2
v1
v3
v2
u3
v4
u5
u6
 Step 3: Propagate the labels only
among the matched nodes from the
previous step. Repeat steps 1 and 2
until no node can be discarded further.
G
Q
11
Dynamic Updates
Dynamic Update in Index vs. Re-indexing (DBLP)

Indexing is performed for h=2 hops.
12
Subtask 2: Graph Over MapReduce

Investigate graph properties and graph algorithms using
MapReduce
– Spectral Analysis of Billion-Scale Graphs
– Patterns on the Connected Components of Terabyte-Scale
Graphs

Study the limitation of the MapReduce architecture on
processing network-centric data
– Using the discovered patterns of terabyte-scale real-life
graphs.
13
13
13
Spectral Analysis of Billion-Scale Graphs

Billion-Scale Eigen-solver





Computes top-k eigen-values and eigenvectors
Find anomalies in large graphs.
Many application: SVD, triangle counting, spectral clustering, …
A careful implementation of Lanczos on hadoop can give
excellent accuracy as well as scalability
Contribution:


HEigen: a billion-scale eigensolver which can handle 1000x
larger matrices than previous methods
Application of the eigensolver on the twitter graph helps us
spot abnormal users (adult advertisers)
U Kang, et al. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11
14
Patterns on the Connected Components of
Terabyte-Scale Graphs

A large graph is composed of many connected
components
– Q1: static patterns?
– Q2: evolution patterns?
Count
– Q3: model?
Metric:
Graph Fractal Dimension(G): log |E| / log |V|
Size
YahooWeb graph
|V| = 1.4 billion
|E| = 6.7 billion
120 GBytes
U Kang, et al. Patterns on the Connected Components of Terabyte-Scale Graphs. ICDM 2010
15
Subtask 3: Graph Partition for Distributed Graph Computing



Are typical techniques efficient for graph
queries?
Graph partitioning and distribution
techniques (e.g., Pregel) Limitations:
– Unavailable to the public
– Unbalanced workload due to skewed
uniformly distributed graph queries.
– Communication overhead due to
inter‐machine (cross partition)
communication.
Sedge: distributed graph processing
– Model-based Graph Partitioning Techniques
– First-of-Its Kind Distributed Graph Computing Platform for
Information, Social, and Communication Networks
Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processing
submitted to VLDB 2011
16
Graph Partition Models
 Complementary Partitions
- Generate partitions sets that are
complementary to each other

Dynamic Workload: Replicate Partitions
- Replicate partitions that are intensively
accessed by many queries

Dynamic Workload: New Partitions
- Generate new partitions that are
intensively accessed by many
cross-partition queries
Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processing
submitted to VLDB 2011
17
Graph partitioning with region constraint

Optimal Solution:

Where:

NP-hard
18
Global Optimization

Iteratively repartition
the graph

Before each iteration, increase
the weight of edges in each
region wrt. its priority
19
Graph Partition for Distributed Graph Computing
# of Machines vs. Throughput Improvement Ratio

10,000 random queries. Increase partition number by adding
more machines.
20
Collaborations and Path Ahead


Collaborations within I2
– Monthly meeting
– Strong connection between I2.1 and I2.2: One problem, two sides.
information network processing on DTN and Clusters
– (I2.1) Work with Arun Iyengar and Mudahakar Srivatsa (IBM), who has
done much work on DTN and Storage. Shengqi Yang will intern at IBM
this summer.
Collaborations with researchers in other networks
– (S1.1) Work with Zhen Wen (IBM), on the social network application of
graph density indexing. U Kang was a summer intern at IBM
– (E1.1, R2.3) Work with Jie Bao (RPI), on RDF queries using neighborhoodbased graph search.
– (T2.3) Work with Vikas Kawadia (BBN), on using graph query processing
for distributed trust computing. Ziyu Guan is collaborating with Vikas
– Graph search has connection with (T2.4) M. Goldberg’s work on trust
structure.
– Work with Sachi Desai (Army) on graph query language/system.
21
Next Six Months and Path Ahead to 2012

Continue research on large-scale information network
processing (more specific)
(1) Graph indexing on multiple time-varying graph snapshots
(2) Compression-based, Model-based Info Network Processing
(3) Edge lay-out on Hadoop file system for better compression
and better performance
(4) Complementary graph partitioning theories.

Other research topics planned
– Models and methods for building complex graph queries
– Models and methods for routing complex graph queries to
data sources (for both I2.1 and I2.2)
– Tensor analysis on Hadoop
22
Research Papers (Accepted/Published)






A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao, Neighborhood Based
Fast Graph Search in Large Networks, Proc. 2011 Int. Conf. on Management of
Data (SIGMOD'11), 2011.
Nicholas D Larusso and Ambuj K. Singh, "Synopses for Probabilistic Data over
Large Domains", EDBT'11
C. C. Aggarwal, N. Li, On Dynamic Node-Classification in Content-based
Networks, SIAM International Conference on Data Mining (SDM) 2011
U Kang, Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Patterns on the
Connected Components of Terabyte-Scale Graphs. IEEE International Conference
on Data Mining (ICDM) 2010, Sydney, Australia.
U Kang, Brendan Meeder, Christos Faloutsos, Spectral Analysis of Billion-Scale
Graphs: Discoveries and Implementation, PAKDD'11
U Kang, Duen Horng Chau, and Christos Faloutsos. Mining Large Graphs:
Algorithms, Inference, and Discoveries. IEEE International Conference on Data
Engineering (ICDE) 2011, Hannover, Germany.
23
Research Papers





Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing LargeScale Graphs for Efficient Distributed Processing, submitted to VLDB 2011
Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity
Search on Large Graphs, to be submitted to VLDB Journal
Petko Bogdanov, Misael Mogiovi, Ambuj Singh, Mining Heavy-Edges Subnetworks
in Time, to be submitted to VLDB Journal
C. C. Aggarwal, P. Zhao, J. Han. On Shortest-Path Indexing of Massive Disk
Resident Graphs, Research Report, to be submitted to VLDB Journal
C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic
Graph Streams, Research Report, to be submitted to VLDB Journal
24
Big Picture
Stage 1: How to distribute graphs (we are here)
Stage 2: How to construct queries
Stage 3: How to execute/route queries
Make Information Network Accessible by Soldiers and Commanders
25
Questions?
26

Slides for INARC mid

Transcript Slides for INARC mid

Directory