Transcript Sahin

Presented by
Ozgur D. Sahin
Outline







Introduction
Neighborhood Functions
ANF Algorithm
Modifications
Experimental Results
Data Mining using ANF
Conclusions
Introduction & Motivation

Graph-based data is becoming more importatnt


Example Questions:





Internet modeling, academic citations, phone records,
movie databases, CAD circuits
How robust is the Internet to failures?
What are the most influential database papers?
What is the best opening move in tic-tac-toe?
Are phone call patterns in Asia similar to those in the U.S.?
Goal: Quickly answer questions on graphrepresented data
Answering Questions

We can answer these questions if we can
compute following three properties related to
connectivity and neighborhood structure:




Graph Similarity: Decide if two graphs have similar
connectivity/neighborhood structure
Subgraph Similarity: Compare how two subgraphs of a
given graph are connected
Vertex Importance: Assign an importance to each node
based on its connectivity
This paper provides such a tool: ANF
(Approximate Neighborhood Function)
Challenges

Following properties should be satisfied:







Error Guarantees: Accurate estimates
Fast: Scale linearly with n (# of nodes) and m (# of edges)
Low Storage
Adapts to available memory
Parallelizable
Sequential scan of the edge file
Estimates per node
Definitions - Neighborhood Functions
dist(u,v): # of edges on the shortest path from u to v
Define following neighborhood functions:
Definitions - Neighborhood Functions
Generalize these two definitions to deal with subgraphs:
Basic ANF Algorithm

N(h) can be computed by a graph traversal



Graph traversal accesses edges in random order
Running time is O(nm)
Access edges in sequential order:

M(x,h) is the set of nodes within distance h of node x
Basic ANF Algorithm

How to compute the number of distinct elements in
the set M(x,h):
 A dictionary data structure: O(n2log n) time/space
 Use bits to mark membership: O(n2) space
 Use ‘probabilistic counting algorithm’

Approximate set sizes using ‘log n+r’ bits
Probabilistic counting algorithm




Approximate set sizes using ‘log n+r’ bits
Instead of one bit per node, give half the nodes bit 0, a
quarter of them bit 1, and so on (A node is given bit i with
probability 1/2i+1)
The approximation of the size of a set is proportional to 2b,
where b is the least bit that has not been set in the bit
representation of this set
Use k parallel approximations

M(x,h) is represented by k(log n+r) bits
Basic ANF Algorithm

Consider a ring with 5 nodes



M(2,1) is the union of M(2,0), M(1,0), and M(3,0):


Example for k=3 and r=0
Bit 0 is the leftmost bit in each 3-bit mask
M(2,1)=M(2,0) OR M(1,0) OR M(3,0)
IN(2,1) is computed from the average of the least zero bit positions:

Avg=(2+1+1)/3=4/3  IN(2,1) = (24/3)/0.77359 = 3.25
Basic ANF Algorithm
Modifications



M(x,h) uses M(y,h-1) but not M(y,h-2), so just
keep the M(y,h-1) during iteration h.
Include a mark bit to handle generalized
neighborhood functions
Break bit masks into smaller pieces if they
are larger than the available memory
Leading Ones Compression



As ANF runs, most bit masks will have many
leading 1’s
Compress bit masks by including a counter of
the leading ones
Bit shuffling of k parallel bit masks enables
further compression:


11010,11100  1111011000
Provides up to 23% speed-up
Experiments

Data Sets: 3 real (Router, Cornell, Cora) and 4 synthetic

Evaluation Metric:
Experiments - Accuracy
k=64:
- ANF achieves less than 7% error
- ANF’s error is independent of the data set
Experiments - Time
Experiments - Scalability
Data Mining with ANF

ANF tool can be used to answer graph mining
problems:




Best opening move for Tic-Tac-Toe game
Clustering movie classes
Measuring the robustness of the Internet
Use summarized statistics derived from
neighborhood function:

Many real graphs follow a power law:


N(h) hH, where H is defined as the ‘hop exponent’
Use ‘individual hop exponent’ as a measure of importance
Tic-Tac-Toe



Show: The best opening move is the center square
Each possible board configuration is a node and
there is an edge from board x to board y if it is a
possible move
Compute individual neighborhood functions for each
of the 9 possible first moves
Clustering Movies


Consider IMDB (Internet Movie Data Base) where each
movie is identified as being in one or more classes (such as
documentaries, dramas, comedies, etc)
Construct a graph for each class and cluster similar ones
Internet Router Data

How robust the Internet is to router failures

Delete some number of routers and measure connectivity
-Random failures do
not disrupt the Internet
-Targeted failures can
dramatically disrupt it
Conclusions


ANF uses an efficient and accurate
approximation algorithm
ANF tool provides several advantages
including following:





Accurate
Fast
Low storage requirements
Parallelizable
ANF makes it possible to answer many
interesting questions