pptx - Cohen-Wang family page
Download
Report
Transcript pptx - Cohen-Wang family page
Guest lecture I: Amos Fiat’s
Social Networks class
Edith Cohen
TAU, December 2014
Topics Today
Intro to summary structures and streaming
Counting distinct elements
MinHash sketches: Approximate distinct count
and set relations
Intro to working with large graphs
Computing MinHash sketches of all reachability
sets
Applications to estimating reachability cardinality,
influence, similarity
Synopsis (Summary) Structures
A small summary of a large data set that
(approximately) captures some statistics/properties
we are interested in.
Examples: random samples, sketches/projections,
histograms, …
Data 𝒙
Synopsis 𝑺
Query a synopsis: Estimators
A function 𝑓 we apply to a synopsis 𝑺 in order to
obtain an estimate 𝑓(𝑺) of a
property/statistics/function 𝑓(𝒙) of the data 𝒙
Data 𝒙
? 𝑓(𝒙)
Synopsis 𝑺
𝑓(𝑺)
Synopsis Structures
A small summary of a large data set that
(approximately) captures some statistics/properties
we are interested in.
Useful features:
Easy to add an element
Mergeable : can create summary of union
from summaries of data sets
Deletions/“undo” support
Flexible: supports multiple types of queries
Mergeability
Data 1
Synopsis 1
Data 2
Synopsis 2
Data 1 + 2
Synopsis 12
Enough to consider merging two sketches
Why megeability is useful
Synopsis 1
Synopsis 2
Synopsis 3
Synopsis 5
Synopsis 4
S. 1 ∪ 2
S. 3 ∪ 4
S. 1 ∪ 2 ∪ 5
1∪2∪3∪4∪5
Synopsis Structures: Why?
Data can be too large to:
Keep for long or even
short term
Transmit across the
network
Process queries over
in reasonable
time/computation
Data, data, everywhere. Economist 2010
The Data Stream Model
Data is read sequentially in
one (or few) passes
We are limited in the size of
working memory.
We want to create and
maintain a synopsis which
allows us to obtain good
estimates of properties
Streaming Applications
Data Sources
Traffic going through high speed
routers (data can not be revisited)
Feeds from user interactions with
Web services (social, commerce,
search)
Scientific data, satellite feeds
Applications:
Network monitoring
Efficient synopsis for queries (anomaly
detection, access patterns, ad
placement models)
I/O efficiency (sequential access is
cheaper than random access)
Streaming model
Sequence of elements from some domain
<x1, x2, x3, x4, ..... >
Bounded storage:
working memory << stream size
usually O(log 𝑘 𝑛) or O(𝑛𝛼 ) for 𝛼 < 1
Fast processing time per stream element
What can we compute over a stream ?
32, 112, 14, 9, 37, 83, 115, 2,
Some functions are easy: min, max, sum, …
We use a single register 𝒔, simple update:
• Maximum: Initialize 𝒔 ← 0
For element 𝒙 , 𝒔 ← max 𝒔, 𝒙
• Sum: Initialize 𝒔 ← 0
For element 𝒙 , 𝒔 ← 𝒔 + 𝒙
The “synopsis” here is a single value.
It is also mergeable.
Counting Distinct Elements
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
Keys occur multiple times, we want to count the
number of distinct keys in the stream
In this example:
Number of distinct key is 𝒏 = 6
Number of stream elements is 11
Counting Distinct Elements:
Some Applications
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
IP Packet streams: Number of distinct IP
addresses or IP flows (source+destination IP, port,
protocol)
Anomaly detection, traffic monitoring
Search: Find how many distinct search queries
were issued to a search engine (on a certain
topic) yesterday
Web services: How many distinct users (cookies)
searched/browsed a certain term/item
advertising, marketing, trends
Distinct Elements: Exact Solution
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
Exact solution:
Maintain an array/associative array/ hash table
Hash/place each element to the table
Query: count number of entries in the table
Problem: For 𝑛 distinct elements, size of table is Ω(𝑛)
But this is the best we can do (Information theoretically) if
we want an exact distinct count.
Distinct Elements: Approximate Counting
32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4,
Exact counting of 𝑛 distinct element requires a
structure of size Ω 𝑛
We are often happy with an approximate count
obtained using a small-size working memory.
We want to be able to compute and maintain
a small sketch 𝑠(𝑁) of the set 𝑁 of distinct
items seen so far 𝑵 = {𝟑𝟐, 𝟏𝟐, 𝟏𝟒, 𝟕, 𝟔, 𝟒}
Wanted: Distinct Elements Sketch
Small size s(𝑁) ≪ 𝑁 = 𝑛
Can query s(N) to get a good estimate 𝑛(𝑠) of 𝑛
(small relative error)
Streaming: For a new element 𝑥, easy to compute
s(𝑁 ∪ 𝑥) from s 𝑁 and 𝑥
Mergeability: If 𝑁1 and 𝑁2 are (possibly
overlapping) sets then we can compute the union
sketch from their sketches: 𝑠(𝑁1 ∪ 𝑁2 ) from
𝑠(𝑁1 ) and s 𝑁2
MinHash Sketch: [Flajolet & Martin 85, …]
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
ℎ 𝑥 ∼ 𝑈[0,1] ℎ is a random hash function from
keys to uniform random numbers in [0,1]
Maintain the Min-Hash value 𝑦:
Initialize 𝑦 ← 1
Processing an element with key 𝑥:
𝑦 ← min {𝑦, ℎ 𝑥 }
Distinct Elements: Approximate Counting
𝑥
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
𝑛 0 1 2 3 3 4 4 4 4 5 5 6
ℎ(𝑥) 0.45 0.35 0.74 0.45 0.21 0.35 0.45 0.21 0.14 0.35 0.92
𝑦 1
0.45 0.35 0.35
0.35 0.21 0.21
0.21
0.21
0.14
0.14
The minimum hash value 𝑦 = min ℎ x is:
Non-increasing and unaffected by repeated elements.
1
Precise relation: E 𝑦 =
𝑛+1
0.14
Distinct Elements: Approximate Counting
How does the minimum hash 𝑦 give information on
the number of distinct elements 𝑛 ?
1
0
minimum
The expectation of the minimum is 𝐄 𝐦𝐢𝐧 𝒉 𝒙
=
𝟏
𝒏+𝟏
A single value gives only limited information.
To boost information, we maintain 𝒌 ≥ 𝟏 values
Why expectation is
1
?
𝑛+1
Take a circle of length 1
Throw a random red point to “mark” the start of a
segment of length 1 (circle points map to [0,1] )
Throw another 𝑛 points independently at random
The circle is cut into 𝑛 + 1 segments by these points.
The expected length of each segment is
1
𝑛+1
Same also for the segment clockwise from the red
point.
MinHash Sketches
These sketches maintain 𝑘 values 𝒚𝟏 , 𝒚𝟐 , … , 𝒚𝒌 from the
range of the hash function (distribution).
k-mins sketch: Use 𝑘 “independent” hash functions: ℎ1 , ℎ2 , … , ℎ𝑘
Track the respective minimum 𝑦1 , 𝑦2 , … , 𝑦𝑘 for each function.
Bottom-k sketch: Use a single hash function: ℎ
Track the 𝑘 smallest values 𝑦1 , 𝑦2 , … , 𝑦𝑘
k-partition sketch: Use a single hash function: ℎ′
Use the first log 2 𝑘 bits of ℎ′(𝑥) to map 𝑥 uniformly to one of 𝑘
parts. Call the remaining bits ℎ(x).
For 𝑖 = 1, … , 𝑘 : Track the minimum hash value 𝑦𝑖 of the elements
in part 𝑖.
All sketches are the same for 𝑘 = 1
Min-Hash Sketches: Examples
k-mins, k-partition, bottom-k 𝒌 = 𝟑
𝑁 = { 32 , 12 , 14 , 7 , 6 , 4 }
The min-hash value and sketches only depend on
The random hash function/s
The set 𝑁 of distinct elements
Not on the order elements appear or their multiplicity
MinHash Sketches
k-mins, bottom-k, k-partition
Why study all 3 variants ? Different tradeoffs between
update cost, accuracy, usage…
Beyond distinct counting:
MinHash sketches can be used as samples of the distinct
key population
Estimating statistics over segments of the keys
When we sketch different data sets using same hash:
Similarity queries between datasets
Size of union of data sets
Complex relations
Min-Hash Sketches: Example
k-mins 𝒌 = 𝟑
32 12 14 7 6 4
𝑥
ℎ1 (𝑥) 0.45 0.35 0.74 0.21 0.14 0.92
ℎ2 (𝑥) 0.19
ℎ3 (𝑥) 0.10
0.51
0.07 0.70 0.55 0.20
0.71
0.93 0.50 0.89 0.18
(𝑦1 , 𝑦2 , 𝑦3 ) = ( 0.14 ,
0.07 , 0.10 )
MinHash Sketches: k-mins
k-mins sketch: Use 𝑘 “independent” hash functions: ℎ1 , ℎ2 , … , ℎ𝑘
Track the respective minimum 𝑦1 , 𝑦2 , … , 𝑦𝑘 for each function.
Processing a new element 𝑥 :
For 𝑖 = 1, … , 𝑘 : 𝑦𝑖 ← min{ 𝑦𝑖 , ℎ𝑖 𝑥 }
ℎ1 𝑥 = 0.35
ℎ2 𝑥 = 0.51
ℎ3 𝑥 = 0.71
Computation: 𝑂(𝑘)
Whether sketch is actually updated or not.
MinHash Sketches: Example
k-partition 𝒌 = 𝟑
32 12
𝑥
3
𝑖(𝑥) 2
14
7
6
4
1
1
2
3
ℎ(𝑥)
0.07 0.70 0.55 0.20
0.19
0.51
(𝑦1 , 𝑦2 , 𝑦3 ) = ( 0.07 ,
0.19 , 0.20 )
part-hash
value-hash
MinHash Sketches: k-partition
k-partition sketch: Use a single hash function: ℎ′
Use the first log 2 𝑘 bits of ℎ′(𝑥) to map 𝑥 uniformly to one of 𝑘
parts. Call the remaining bits ℎ(x).
For 𝑖 = 1, … , 𝑘 : Track the minimum hash value 𝑦𝑖 of the elements
in part 𝑖.
Processing a new element 𝑥 :
𝑖 ← first log 2 𝑘 bits of ℎ′(𝑥)
ℎ ← remaining bits of ℎ′(𝑥)
𝑦𝑖 ← min{𝑦𝑖 , ℎ}
𝑖 𝑥 =2
ℎ 𝑥 = 0.19
𝑦2 ← min{𝑦2 , 0.19}
Computation: 𝑂(1) to test or update
MinHash Sketches: Example
Bottom-k 𝒌 = 𝟑
𝑥
32
12
14
ℎ(𝑥)
0.19
0.51
0.07 0.70 0.55 0.20
7
(𝑦1 , 𝑦2 , 𝑦3 ) = ( 0.07 ,
6
4
0.19 , 0.20 )
MinHash Sketches: bottom-k
Bottom-k sketch: Use a single hash function: ℎ
Track the 𝑘 smallest values 𝑦1 < 𝑦2 < ⋯ < 𝑦𝑘
Processing a new element 𝑥 :
If ℎ 𝑥 < yk :
(𝑦1 , … , 𝑦𝑘 ) ← sort{𝑦1 , … , 𝑦𝑘−1 , ℎ(𝑥)}
Computation:
The sketch (𝑦1 , … , 𝑦𝑘 ) is maintained as a sorted list or as a
priority queue.
𝑂(1) to test if an update is needed
𝑂(𝑘) to update a sorted list. 𝑂(log 𝑘) to update a priority
queue.
We will see that #changes ≪ #distinct elements
MinHash Sketches: Number of updates
Claim: The expected number of actual updates
(changes) of the MinHash sketch is O(𝑘 ln 𝑛)
Proof: First Consider 𝒌 = 𝟏. Look at distinct elements in the
order they first occur.
The 𝒊th distinct element has lower hash value than the current
𝟏
minimum with probability . This is the probability of being
𝐢
first in a random permutation of 𝑖 elements.
⟹ Total expected number of updates is
𝑛 1
𝑖=1 𝑖
= 𝐻𝑛 ≤ ln 𝑛.
32, 12, 14, 32, 7, 12, 32, 7, 6,
Update
Prob.
1
1
2
1
3
0
1
4
0
0
0
1
5
12, 4,
0
1
6
MinHash Sketches: Number of updates
Claim: The expected number of actual updates
(changes) of the MinHash sketch is O(𝑘 ln 𝑛)
Proof (continued): Recap for 𝑘 = 1: The 𝑖th distinct
1
element causes an update with probability ⟹ expected
i
𝑛 1
total is 𝑖=1 ≤ ln 𝑛.
𝑖
k-mins: 𝑘 min-hash values (apply 𝑘 times)
Bottom-k: We keep the 𝑘 smallest elements, so update
𝑘
th
probability of the 𝑖 distinct element is min{1, }
𝑖
(probability of being in the first 𝑘 in a random permutation)
k-partition: 𝑘 min-hash values for ≈ 𝑛/𝑘 distinct values.
Merging MinHash Sketches
!! We apply the same set of hash function to all
elements/data sets/streams.
The union sketch 𝒚 from sketches of two sets 𝒚’,𝒚’’:
k-mins: take minimum per hash function
𝑦𝑖 ← min {𝑦𝑖′ , 𝑦𝑖′′ }
k-partition: take minimum per part 𝑦i ← min {𝑦𝑖′ , 𝑦𝑖′′ }
Bottom-k: The 𝑘 smallest in union of data must be in
the 𝑘 smallest of their own set:
{𝑦1 , … , 𝑦𝑘 } = bottom𝑘{𝑦1′ , … , 𝑦𝑘′ , 𝑦1′′ , … , 𝑦𝑘′′ }
Using MinHash Sketches
Recap:
We defined MinHash Sketches (3 types)
Adding elements, merging MinHash sketches
Some properties of these sketches
Next: We put MinHash sketches to work
Estimating Distinct Count from a MinHash Sketch
Estimating Jaccard Similarity of two sets from their
sketches
The Exponential Distribution Exp(𝑛)
−𝑛𝑥
PDF 𝑛e
−𝑛𝑥
, 𝑥 ≥ 0 ; CDF 1 − e
; 𝜇=𝜎=
Very useful properties:
Memorylessness:
∀𝑡, 𝑦 ≥ 0, Pr 𝑥 > 𝑦 + 𝑡 𝑥 > 𝑦] = Pr 𝑥 > 𝑡
Min-to-Sum conversion:
min Exp 𝑛1 , … , Exp 𝑛𝑡 ∼ Exp(𝑛1 + ⋯ + 𝑛𝑡 )
Relation with uniform:
ln 1−𝑢
−
𝑛
−𝑛𝑥
𝑢 ∼ 𝑈 0,1 ⇔
𝑥 ∼ Exp(𝑛) ⇔ 1 − e
∼ Exp(𝑛)
∼ 𝑈 0,1
1
n
Estimating Distinct Count from a MinHash Sketch: k-mins
• Change to exponential distribution ℎ 𝑥 ∼ Exp(1)
• Using Min-to-Sum property, 𝑦𝑖 ∼ Exp(𝑛)
– In fact, we can just work with ℎ 𝑥 ∼ U[0,1] and use
𝑦𝑖 ← −ln 1 − 𝑦𝑖 when estimating.
• Number of distinct elements becomes a parameter
estimation problem:
Given 𝑘 independent samples from Exp(𝑛) ,
estimate 𝑛
Estimating Distinct Count from a
MinHash Sketch: k-mins
1
𝑛
Each 𝑦𝑖 ∼ Exp(𝑛) has expectation and variance 𝜎 2 =
1
.
2
𝑛
The average 𝑎 =
2
variance 𝜎 =
𝑘
𝑖=1 𝑦𝑖
1
.
2
𝑘𝑛
𝑘
has expectation 𝜇 =
The cv is
𝜎
𝜇
=
𝑎 is a good unbiased estimator for
But
1
𝑛
1
.
𝑘
1
𝑛
1
𝑛
and
is the inverse of what we want.
To estimate 𝑛: 𝒏 =
𝒌−𝟏
𝒌 𝒚
𝒊=𝟏 𝒊
is unbiased with
𝜎
𝜇
=
1
𝑘
Inverse probability estimators
[Horvitz Thompson 1952]
Model: Value 𝑥 is observed/sampled with probability
𝑝 > 0. We want to estimate 𝑓 𝑥 ≥ 0.
If 𝑥 is sampled, we know both 𝑝, 𝑥 , can compute 𝑓(𝑥).
Inverse Probability Estimator:
If 𝑥 is sampled 𝑓 =
𝑓 𝑥
.
𝑝(𝑥)
Else, 𝑓 = 0
𝑓 is unbiased: 𝐸 𝑓 = 1 − 𝑝 ⋅ 0 +
𝑓 𝑥
𝑝
𝑝
= 𝑓(𝑥)
Inverse-Probability estimate for a sum
We want to estimate the sum: 𝑛 = 𝑥 𝑓(𝑥).
We have a sample 𝑆 of elements. 𝑓(𝑥) > 0 ⟹
𝑝 𝑥 > 0 and we know 𝑓 𝑥 , 𝑝(𝑥) when 𝑥 ∈ 𝑆.
We use: 𝑓𝑥 =
𝑓 𝑥
𝑝(𝑥)
when 𝑥 ∈ 𝑆. 𝑓𝑥 =0 otherwise.
Sum estimator: 𝑛 =
𝑥 𝑓𝑥
=
𝑥∈𝑆 𝑓𝑥
Unbiased
For distinct count 𝑓 𝑥 = I𝑥∈𝑁 (indicator function).
Bottom-k sketches:
Inverse probability estimator
We work with the uniform distribution
ℎ 𝑥𝑖 ∼ 𝑈[0,1]
For each distinct element, we consider the
probability that it is one of the lowest-hash 𝑘 −
1 elements.
For sketch 𝑦1 < ⋯ < 𝑦𝑘 , we say element 𝑥 is
“sampled” ⟺ for some 𝑖 ≤ 𝑘 − 1, 𝑦𝑖 = ℎ(𝑥)
Caveat: Probability is =
we do not know 𝑛.
𝑘−1
𝑛
for all elements, but
Bottom-k sketches:
Inverse probability estimator
We use an inverse probability estimate: If 𝒙 is not
sampled (not one of the 𝑘 − 1 smallest-hash
𝟏
elements) estimate is 0. Otherwise, it is
.
𝒑(𝒙)
But we do not know 𝒑 ! what can we do ?
We compute 𝒑(𝒙) conditioned on fixing 𝒉 on 𝑵 ∖
𝒙 but taking 𝐡 𝒙 ∼ 𝑼 𝟎, 𝟏
Need to be able to compute 𝒑(𝒙) only for
“sampled” elements.
Bottom-k sketches:
Inverse probability estimator
What is the probability 𝑝 that 𝑥 is sampled if we
fix ℎ on 𝑁 ∖ {𝑥} but take ℎ 𝑥 ∼ 𝑈 0,1 ?
𝑥 is sampled ⟺ ℎ 𝑥 < (𝑘 − 1)th ℎ 𝑧 |𝑧 ∈ 𝑁 ∖ 𝑥
For sampled 𝑥, (𝑘 − 1)th ℎ 𝑧 |𝑧 ∈ 𝑁 ∖ 𝑥 = 𝑦𝑘
⟹ 𝑝(𝑥) = 𝑦𝑘
⟹ Inverse probability estimate is
1
𝑝(𝑥)
=
1
𝑦𝑘
Summing over the 𝑘 − 1 “sampled” elements:
𝑘−1
𝑛=
𝑦𝑘
Explaining conditioning in Inverse
Probability Estimate for bottom-k
𝑁 = {𝒂, 𝑏, 𝑐, 𝑑, 𝑒}
𝑘=3
The probability that 𝒂 has one of the 𝒌 − 𝟏 = 𝟐
smallest values in ℎ 𝑎 , ℎ 𝑏 , … , ℎ 𝑒 is
𝟐
Pr 𝒂 ∈ 𝑆 = but we can not compute it since we
𝟓
do not know 𝒏(= 𝟓).
The conditional probability Pr[𝒂 ∈
Bottom-𝑘 sketches:
Inverse probability estimators
𝑘−1
𝑛=
𝑦𝑘
Unbiased estimator.
No need to track keys (sample view only used for
analysis).
How good is this estimator? We can (do not)
show:
𝜎
𝜇
CV is ≤
estimator
1
𝑘−2
at least as good as the k-mins
Better distinct count estimators ?
Recap:
Our estimators (k-mins, bottom-k) have CV
𝜎
𝜇
≤
1
𝑘−2
Lower bound for sketch (CRLB) says CV
𝜎
𝜇
≥
1
𝑘
Can we improve ? Also, what about k-partition?
CRLB applies when we are limited to using only the
information in the sketch.
Idea: Use information we discard along the way
HIP: “Historic” Inverse Probability
Estimators
We maintain an approximate count together with
the sketch: 𝒚𝟏 , … , 𝒚𝒌 , 𝒄
Initially 𝒚𝟏 , … , 𝒚𝒌 ← (𝟏, … , 𝟏) 𝒄 ← 𝟎
When the sketch 𝒚 is updated, we compute the
probability 𝑝 that a new distinct element would
cause an update to the current sketch.
We increase the counter 𝒄 ← 𝒄 +
1
𝑝
Easy to apply with all MinHash sketches
The estimate is unbiased
We can (do not) show CV
𝜎
𝜇
≤
1
2𝑘−2
<
1
1
2 𝑘−2
Maintaining a k-mins “historic” sketch
k-mins sketch: Use 𝑘 “independent” hash functions: ℎ1 , ℎ2 , … , ℎ𝑘
Track the respective minimum 𝑦1 , 𝑦2 , … , 𝑦𝑘 for each function.
Update probability: probability 𝑝 that at least for one 𝑖 =
1, … , 𝑘, we get ℎ𝑖 𝑥 < 𝑦𝑖 :
𝑘
𝑝=1−
(1 − 𝑦𝑖 )
𝑖
Processing a new element 𝑥 :
𝑝 ← 1 − 𝑘𝑖 (1 − 𝑦𝑖 )
For 𝑖 = 1, … , 𝑘 : 𝑦𝑖 ← min{ 𝑦𝑖 , ℎ𝑖 𝑥 }
If change in 𝑦 : 𝑐 ← 𝑐 +
1
𝑝
Maintaining a k-partition “historic” sketch
Processing a new element 𝑥 :
𝑖 ← first log 2 𝑘 bits of ℎ′(𝑥)
ℎ ← remaining bits of ℎ′(𝑥)
If 𝑦𝑖 < ℎ ,
𝑝←
1
𝑘
𝑐←𝑐
𝑘
𝑗=1 𝑦𝑗
1
+
𝑝
,
𝑦𝑖 ← ℎ
Update probability: probability 𝑝 that ℎ 𝑥 < 𝑦𝑖 for
part 𝑖 selected uniformly at random
Maintaining a bottom-𝑘 “historic” sketch
Bottom-k sketch: Use a single hash function: ℎ
Track the 𝑘 smallest values 𝑦1 < 𝑦2 < ⋯ < 𝑦𝑘
Processing a new element 𝑥 :
If ℎ 𝑥 < yk
c←𝑐+
1
𝑦𝑘
𝑦1 , 𝑦2 , … , 𝑦𝑘 ← sort {𝑦1 , 𝑦2 , … , 𝑦𝑘−1 , ℎ(𝑥)}
Probability of update is: yk
Summary: Historic distinct estimators
Recap:
1
2𝑘−2
Maintain sketch and count. CV is
<
Easy to apply. Trivial to query. Unbiased.
1
1
2 𝑘−2
More: (we do not show here)
CV is almost tight for this type of estimator
(estimate presence of each distinct element
1
entering sketch). ⟹ Can’t do better than
2𝑘
The HIP and other basic estimators have good
concentration (Chernoff style)
Jaccard Similarity
A common similarity measure of two sets
keys 𝑁1 of set 1
keys 𝑁2 of set 2
Ratio of size of intersection
to size of union:
𝐽 𝑁1 , 𝑁2
|𝑁1 ∩ 𝑁2 |
=
|𝑁1 ∪ 𝑁2 |
3
𝐽 = = 0.375
8
Jaccard Similarity from MinHash sketches
𝐽 𝑁1 , 𝑁2
|𝑁1 ∩ 𝑁2 |
=
|𝑁1 ∪ 𝑁2 |
For each 𝑁𝑖 we have a MinHash sketch 𝑠(𝑁𝑖 )
(use the same hash function/s ℎ for all sets)
Merge 𝑠(𝑁1 ) and 𝑠(𝑁2 ) to obtain 𝑠(𝑁1 ∪ 𝑁2 )
For each 𝑥 ∈ s(N1 ∪ N2 ) we know everything on its
membership in 𝑁1 or 𝑁2 :
𝒙 ∈ 𝒔(𝑵𝟏 ∪ 𝑵𝟐 ) is in 𝑵𝒊 if and only if 𝒙 ∈ 𝒔(𝑵𝒊 )
In particular, we know if 𝑥 ∈ 𝑁1 ∩ 𝑁2
𝐽 is the fraction of union members that are intersection
members: apply subset estimator to 𝑠(𝑁1 ∪ 𝑁2 )
k-mins sketches: Jaccard estimation
𝑘=4
𝑠 𝑁1 = (0.22, 0.11, 0.14, 0.22)
𝑠 𝑁2 = (0.18, 0.24, 0.14, 0.35)
𝑠 𝑁1 ∪ 𝑁2 = (0.18, 0.11, 0.14, 0.22)
∈ 𝑁1 ∖ 𝑁2
2 1
𝛼= =
4 2
⇒
∈ 𝑁2 ∖ 𝑁1
1
𝛼=
4
∈ 𝑁1 ∩ 𝑁2
1
𝛼=
4
|𝑁1 ∖𝑁2 |
|𝑁2 ∖𝑁1 | |𝑁1 ∩𝑁2 |
Can estimate 𝛼 =
,
,
|𝑁1 ∪𝑁2 |
|𝑁1 ∪𝑁2 | |𝑁1 ∪𝑁2 |
𝛼 1−𝛼
2
unbiasedely with 𝜎 =
𝑘
k-partition sketches: Jaccard estimation
𝑘=4
𝑠 𝑁1 = (1.00, 1.00, 0.14, 0.21) 𝑘′ = 2
𝑠 𝑁2 = (0.18, 1.00, 0.14, 0.35) 𝑘′ = 3
𝑠 𝑁1 ∪ 𝑁2 = (0.18, 1.00, 0.14, 0.21) 𝑘′ = 3
∈ 𝑁1 ∖ 𝑁2
1
𝛼=
3
⇒
∈ 𝑁2 ∖ 𝑁1
1
𝛼=
3
∈ 𝑁1 ∩ 𝑁2
1
𝛼=
3
|𝑁1 ∖𝑁2 |
|𝑁2 ∖𝑁1 | |𝑁1 ∩𝑁2 |
Can estimate 𝛼 =
,
,
|𝑁1 ∪𝑁2 |
|𝑁1 ∪𝑁2 | |𝑁1 ∪𝑁2 |
𝛼 1−𝛼
2
unbiasedely with 𝜎 =
(conditioned on
′
𝑘
𝑘’)
Bottom-k sketches: Jaccard estimation
𝑘=4
𝑠 𝑁1 = {0.09, 0.14, 0.18, 0.21}
𝑠 𝑁2 = {0.14, 0.17, 0.19, 0.35}
Smallest 𝑘 = 4 in
union of sketches
𝑠 𝑁1 ∪ 𝑁2 = {0.09, 0.14, 0.17, 0.18}
∈ 𝑁1 ∖ 𝑁2
2
𝛼=
4
⇒
∈ 𝑁2 ∖ 𝑁1
1
𝛼=
4
∈ 𝑁1 ∩ 𝑁2
1
𝛼=
4
|𝑁1 ∖𝑁2 |
|𝑁2 ∖𝑁1 | |𝑁1 ∩𝑁2 |
Can estimate 𝛼 =
,
,
|𝑁1 ∪𝑁2 |
|𝑁1 ∪𝑁2 | |𝑁1 ∪𝑁2 |
𝛼 1−𝛼
𝑘−1
2
unbiasedely with 𝜎 =
1−
𝑘
𝑛−1
Bottom-k sketches: better estimate
𝑘=4
𝑠 𝑁1 = {0.09, 0.14, 0.18, 0.21}
𝑠 𝑁2 = {0.14, 0.17, 0.19, 0.35}
𝑠 𝑁1 ∪ 𝑁2 = {0.09, 0.14, 0.17, 0.18} 0.19, 0.21
𝑘′ = 6 > 4
∈ 𝑁1 ∖ 𝑁2
∈ 𝑁2 ∖ 𝑁1
∈ 𝑁1 ∩ 𝑁2
We can look beyond the union sketch: We have complete
membership information on all elements with ℎ 𝑥 ≤
min {max 𝑠 𝑁1 , max 𝑠 𝑁2 }. We have 2k > 𝑘 ′ ≥ 𝑘 elements!
Bottom-k sketches: better estimate
𝑘=4
𝑠 𝑁1 = {0.09, 0.14, 0.18, 0.21}
𝑠 𝑁2 = {0.14, 0.17, 0.19, 0.35}
𝑠 𝑁1 ∪ 𝑁2 = {0.09, 0.14, 0.17, 0.18} 0.19, 0.21
𝑘′ = 6 > 4
∈ 𝑁1 ∖ 𝑁2
3 1
𝛼= =
6 2
⇒
∈ 𝑁2 ∖ 𝑁1
2 1
𝛼= =
6 3
∈ 𝑁1 ∩ 𝑁2
1
𝛼=
6
|𝑁1 ∖𝑁2 |
|𝑁2 ∖𝑁1 | |𝑁1 ∩𝑁2 |
Can estimate 𝛼 =
,
,
|𝑁1 ∪𝑁2 |
|𝑁1 ∪𝑁2 | |𝑁1 ∪𝑁2 |
′ −1
𝛼
1−𝛼
𝑘
unbiasedely with 𝜎 2 =
1−
(conditioned on 𝑘’)
′
𝑘
𝑛−1
Graph Datasets:
Represent relations between “things”
Bowtie structure of the Web Broder et. al. 2001
Dolphin interactions
Graph Datasets
Hyperlinks (the Web)
Social graphs (Facebook, Twitter, LinkedIn,…)
Email logs, phone call logs , messages
Commerce transactions (Amazon purchases)
Road networks
Communication networks
Protein interactions
…
Properties
Directed/Undirected
Snapshot or with time dimension (dynamic)
One or more types of entities (people,
pages, products)
Meta data associated with nodes
Some graphs are really large: billions of
edges for Facebook and Twitter graphs
Mining the link structure
• Centrality (who are the most important nodes?)
• Reach/Influence: coverage of one or more nodes
(facility location, influence maximization)
• Similarity of node pairs (link prediction, targeted
ads, friend/product recommendations, attribute
completion)
• Communities: set of nodes that are more tightly
related to each other than to others
Computing on Very Large Graphs
Many applications, platforms, algorithms
Clusters (Map Reduce, Hadoop) when applicable
iGraph/Pregel better with edge traversals
(Semi-)streaming : pass(es), keep small info (per-node)
General algorithm design principles :
settle for approximations
keep total computation/ communication/
storage “linear” in the size of the data
Parallelize (minimize chains of dependencies)
Localize dependencies
Next: Node sketches
MinHash sketches of reachability sets
Compute a sketch for each node, efficiently
From sketch(es) can estimate properties that
are “harder” to compute exactly
Sketching Reachability Sets
Reachability Set of
Size 4
Reachability Set of
Size 5
Reachability Set of
Size 13
Why sketch reachability sets ?
From reachability sketch(es) we can:
Estimate cardinality of reachability/influence set
of one or more nodes
Get a uniform sample of the reachable nodes
Estimate similarity of nodes by relations between
reachability sets (e.g., Jaccard similarity)
Exact computation is costly: 𝑂(𝑚𝑛) with 𝑛
nodes and 𝑚 edges, representation size is
massive: does not scale to large networks!
Influence of
∪
Size 9
MinHash sketches of all Reachability sets
0.37
0.12
0.45
0.28
0.34
0.23
0.85
0.06
0.95
0.93
0.77
0.69
0.32 hash values 𝐡(𝒗) ∼ 𝑼[𝟎, 𝟏]
MinHash sketches of all
Reachability Sets: 𝑘 = 1
For each 𝑣: 𝐬 𝒗 ← 𝐦𝐢𝐧 𝒉(𝒖)
𝒗↝𝒖
Depending on application, may also want
to include node ID in sketch:
𝐚𝐫𝐠𝐦𝐢𝐧 𝒉(𝒖)
𝒗↝𝒖
MinHash sketches of all
Reachability Sets: 𝑘 = 1
0.37 {0.23}
0.45
{0.23}
{0.23}
0.23
{0.23}
0.85
{0.06}
0.06
{0.06}
0.69
{0.06} {0.06}
0.32
0.95
0.12 {0.12} 0.28
{0.12}
0.34
{0.12}
{0.12}
0.93
{0.12}
0.77
𝐬 𝒗 ← 𝐦𝐢𝐧 𝒉(𝒖)
𝒗↝𝒖
MinHash sketches of all Reachability
Sets: bottom-2 (𝑘 = 2)
For each 𝑣: 𝐬 𝒗 ← 𝐛𝐨𝐭𝐭𝐨𝐦−𝟐 𝒉(𝒖)
𝒗↝𝒖
MinHash sketches of all
Reachability Sets: bottom-2
0.37
0.12
0.45
{0.23,0.37}
0.23
0.34
0.28
{0.12,0.23}
0.85
0.93
0.06 {0.06,0.12}
0.95
0.32
0.77
0.69
Next: Computing MinHash sketches of
all reachability sets efficiently
Sketch size for a node: 𝑂(𝑘)
Total computation ≈ 𝑂(𝑘𝑚)
Algorithms/methods:
Graphs searches (BFS)
Dynamic programming/ Distributed (DP)
Computing MinHash Sketches of all
Reachability Sets: 𝑘 = 1 BFS method
𝐬 𝒗 ← 𝐦𝐢𝐧 𝒉(𝒖)
𝒗↝𝒖
Iterate over nodes 𝑢 by increasing ℎ(𝑢):
Visit nodes 𝑣 through a reverse search from 𝑢:
IF s 𝑣 = ∅,
𝑠 𝑣 ← ℎ(𝑢)
Continue search on inNeighbors(𝑣)
ELSE, truncate search at 𝑣
Compute MinHash sketches of all
Reachability Sets: 𝑘 = 1, BFS
0.37 {0.23}
0.45
{0.23}
{0.23}
0.23
{0.23}
0.85
{0.06}
0.06
{0.06}
0.69
{0.06} {0.06}
0.32
0.95
0.12 {0.12} 0.28
{0.12}
0.34
{0.12}
{0.12}
0.93
{0.12}
0.77
𝐬 𝒗 ← 𝐦𝐢𝐧 𝒉(𝒖)
𝒗↝𝒖
Computing MinHash sketches of all
reachability sets: 𝑘 = 1 BFS method
Analysis:
Each arc is used exactly once 𝑂(𝑚),
where 𝑚 is the number of arcs
Note: sorting of nodes is not needed.
Random permutation of 𝑛 nodes can be generated in
𝑂(𝑛) time
Computing MinHash Sketches of all
Reachability Sets: bottom-𝑘, BFS method
Next: Computing sketches using the BFS method
for k>1
𝐬 𝒗 ← 𝐛𝐨𝐭𝐭𝐨𝐦−𝒌 𝒉(𝒖)
𝒗↝𝒖
Computing MinHash Sketches of all
Reachability Sets: bottom-𝑘, BFS method
𝐬 𝒗 ← 𝐛𝐨𝐭𝐭𝐨𝐦−𝒌 𝒉(𝒖)
𝒗↝𝒖
Iterate over nodes 𝑢 by increasing ℎ(𝑢):
Visit nodes 𝑣 through a reverse search from 𝑢:
IF s 𝑣 < 𝑘,
𝑠 𝑣 ← 𝑠 𝑣 ∪ {ℎ 𝑢 }
Continue search on inNeighbors(𝑣)
ELSE, truncate search at 𝑣
MinHash sketches of all
Reachability Sets: bottom-2
0.37
0.12
0.45
{0.23, 0.37 }
0.23
0.34
0.28
{0.12, 0.23 }
0.85
0.06
0.93
{0.06, 0.12 }
0.95
0.77
0.69
0.32
Computing MinHash Sketches of all Reachability
Sets: 𝑘 = 1 Distributed (DP)
Next: back to 𝑘 = 1.
We present another method to compute the
sketches. The algorithm has fewer dependencies.
It is specified for each node. It is suitable for
computation that is:
Distributed, Asynchronous
Dynamic Programming (DP)
Multiple passes on the set of arcs
Computing MinHash Sketches of all Reachability
Sets: 𝑘 = 1 Distributed (DP)
𝐬 𝒗 ← 𝐦𝐢𝐧 𝒉(𝒖)
𝒗↝𝒖
Initialize 𝐬 𝒗 ← 𝒉(𝒗)
IF s 𝑣 is initialized/updated, send 𝑠(𝑣)
to inNeighbors(𝑣)
IF value 𝑥 is received from neighbor:
𝑠 𝑣 ← min{𝑠 𝑣 , 𝑥}
DP computation of MinHash sketches 𝑘 = 1
0.37 {0.37}
{0.23}
0.23
{0.85}
0.85
{0.06}
0.06
0.45
{0.45}
0.12 {0.12} 0.28
{0.28}
0.34
{0.34}
{0.93}
0.93
{0.77}
0.77
{0.69}
0.69
{0.95} {0.32}
Initialize: 𝐬 𝒗 ← 𝒉(𝒗)
0.32
0.95
DP computation of MinHash sketches 𝑘 = 1
0.37 {0.37}
{0.23}
0.23
{0.85}
0.85
{0.06}
0.06
0.45
{0.45}
0.12 {0.12} 0.28
{0.28}
0.34
{0.34}
{0.93}
0.93
{0.77}
0.77
{0.69}
0.69
Send to inNeighbors
{0.95} {0.32}
0.32
0.95
DP computation of MinHash sketches 𝑘 = 1
0.37 {0.37}
0.45
{0.45}
{0.23}
0.23
{0.23}
0.85
{0.06}
0.06
{0.06}
0.69
{0.32} {0.32}
0.32
0.95
0.12 {0.12} 0.28
{0.12}
0.34
{0.12}
{0.28}
0.93
{0.12}
0.77
Update
DP computation of MinHash sketches 𝑘 = 1
0.37 {0.37}
0.45
{0.45}
{0.23}
0.23
{0.23}
0.85
{0.06}
0.06
{0.06}
0.69
{0.32} {0.32}
0.32
0.95
0.12 {0.12} 0.28
{0.12}
0.34
{0.12}
{0.28}
0.93
{0.12}
0.77
If updated, send
to inNeighbors
DP computation of MinHash sketches 𝑘 = 1
0.37 {0.37}
0.45
{0.45}
{0.23}
0.23
{0.23}
0.85
{0.06}
0.06
{0.06}
0.69
{0.32} {0.32}
0.32
0.95
0.12 {0.12} 0.28
{0.12}
0.34
{0.12}
{0.28}
0.93
{0.12}
0.77
Update
DP computation of MinHash sketches 𝑘 = 1
0.37 {0.23}
{0.23}
0.23
{0.23}
0.85
{0.06}
0.06
0.45
{0.23}
0.12 {0.12} 0.28
{0.12}
0.34
{0.12}
{0.12}
0.93
{0.12}
0.77
{0.06}
0.69
If updated, send to
{0.06} {0.06}
0.32
inNeighbors. Done.
0.95
Analysis of DP: Edge traversals
Lemma: Each arc is used in expectation < ln 𝒏 times.
Proof: We bound the expected number of
updates of 𝒔(𝒗).
Consider nodes 𝑣 = 𝑢1 , 𝑢2 , … in order that
ℎ(𝑢𝑖 ) is propagated to (can reach) 𝑣.
The probability that h(𝑢𝑖 ) updates s(𝑣) :
𝐏𝐫[𝒉 𝒖𝒊 < 𝐦𝐢𝐧 𝒉 𝒖𝒋 ] = 𝟏𝒊
𝒋<𝒊
Summing over nodes (linearity of expectation):
𝒏 𝟏
𝒊=𝟏 𝒊 = 𝑯𝒏 < ln 𝒏
Analysis of DP: dependencies
The longest chain of dependencies is at most
the longest shortest path (the diameter of the
graph)
Distinct counting/MinHash sketches bibliography 1
First use of k-mins Min-Hash sketches for distinct counting; first streaming algorithm for approximate
distinct counting; also proposes k-partition as an application of stochastic averaging:
P. Flajolet and N. Martin, N. “Probabilistic Counting Algorithms for Data Base
Applications” JCSS (31), 1985.
Use of Min-Hash sketches for similarity, union size, mergeability, size estimation (k-mins, propose
bottom-k):
E. Cohen “Size estimation framework with applications to transitive closure and
reachability”, JCSS (55) 1997
Use of shingling with k-mins sketches for Jaccard similarity of text documents:
A. Broder “On the Resemblance and Containment of Documents” Sequences 1997
A. Broder and S. Glassman and M. Manasse and G. Zweig “Syntactic Clustering of
the Web” SRC technical note 1997
Better similarity estimators (beyond the union sketch) from bottom-k samples:
E. Cohen and H. Kaplan “Leveraging discarded sampled for tighter estimation of
multiple-set aggregates: SIGMETRICS 2009.
Asymptotic Lower bound on distinct counter size (taking into account hash representation)
N. Alon Y. Matias M. Szegedy “The space complexity of approximating the frequency moments”
STOC 1996
Asymptotic epsilon,delta treatment of some k-partition and k-mins based distinct counting:
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, L. Trevisan “Counting distinct elements in a
data stream” RANDOM 2002.
Distinct counting/Min-Hash sketches bibliography 2
Practical distinct counters based on k-partition sketches:
P. Flajolet, E. Fusy, O. Gandouet, F. Meunier “Hyperloglog: The analysis of a near-optimal
cardinality estimation algorithm”
S. Heule, M. Nunkeser, A. Hall “Hyperloglog in practice” algorithmic engineering of a state of
the art cardinality estimation algorithm”, EDBT 2013
Theoretical algorithm with asymptotic bounds that match the AMS lower bound:
D.M. Kane, J. Nelson, D. P, Woodruff “An optimal algorithm for the distinct elements
problem”, PODS 2010
Inverse probability “historic” estimators, Application of Cramer Rao on min-hash sketches:
E. Cohen “All-Distances Sketches, Revisited: Scalable Estimation of the Distance Distribution
and Centralities in Massive Graphs” PODS 2014.
The concepts of min-hash sketches and sketch coordination are related to concepts from the
survey sampling literature: Order samples (bottom-k), coordination of samples using the PRN
method (Permanent Random Numbers).
More on Bottom-k sketches, ML estimator for bottom-k:
E. Cohen, H. Kaplan “Summarizing data using bottom-k sketches” PODS 2007. “Tighter
Estimation using bottom-k sketches” VLDB 2008.
Inverse probability estimator with priority (type of bottom-k) sketches:
N. Alon, N. Duffield, M. Thorup, C. Lund: “Estimating arbitrary subset sums with a few
probes” PODS 2005
Bibliography: Reachability sketches, All-Distances sketches
(extension of reachability sketches which captures distances)
E. Cohen “Size-Estimation Framework with
Applications to Transitive Closure and
Reachability” JCSS 1997
E. Cohen H. Kaplan “Spatially-decaying
aggregation over a network” JCSS 2007
E. Cohen H. Kaplan “Summarizing data using
bottom-k sketches” PODC 2007
E. Cohen: “All-Distances Sketches, Revisited: HIP
Estimators for Massive Graphs Analysis” PODS
2014 arXiv 1306.3284
Parallelizing BFS-based MinHash
computation
Each graph search depends on all previous
ones: seems like we need to perform 𝑛
searches sequentially.
How can we reduce dependencies ?
Parallelizing BFS-based MinHash
computation
Idea (𝑘 = 1):
Create a super-node of the 𝑛/2 lowest hash
nodes.
Perform a (reverse) search from super-node and
mark all nodes that are accessed.
Concurrently perform searches:
From the lowest-hash 𝑛/2 nodes (sequentially)
From the highest-hash 𝑛/2 (sequentially).
Prune searches also at marked nodes
Parallelizing BFS-based MinHash
computation
Correctness:
For the lower 𝒏/𝟐 hash values: computation is
the same.
For the higher 𝒏/𝟐:
We do not know the minimum reachable hash
from higher-hash nodes, but we do know it is
one of the lower 𝑛/2 hash values. This is all we
need to know for correct pruning.
Parallelizing BFS-based MinHash
computation
This only gives us 𝒏/𝟐 instead of 𝒏
sequential searches.
How can we obtain more parallelism ?
We recursively apply this to each of the
lower/higher sets:
Parallelizing BFS-based MinHash
computation
Nodes ordered by ℎ(𝑢)
Super-nodes created in recursion
The depth of dependencies is at most log 2 𝑛
The total number of edge traversals can
increase by a factor of log 2 𝑛