Transcript ppt
Aggregation Algorithms and
Instance Optimality
Moni Naor
Weizmann Institute
Joint work with
Ron Fagin
Lotem
Amnon
1
Aggregating information from
several lists/sources
• Define the problem
• Ways to evaluate algorithms
• New algorithms
• Further Research
2
The problem
• Database D of N objects
• An object R has m fields - (x1, x2, , xm)
– Each xi 0,1
• The objects are given in m lists L1, L2, , Lm
– list Li all objects sorted by xi value.
• An aggregation function t(x1,x2,…xm)
– t(x1,x2,…xm) - a monotone increasing function
Wanted: top k objects according to t
3
Goal
• Touch as few objects
as possible
• Access to object?
List L1
List L2
c1= 0.9
s2= 0.85
b1= 0.8
a2= 0.84
s1= 0.65
r2= 0.75
r1= 0.5
b2= 0.3
a1= 0.4
c2= 0.2
4
Where?
Problem arises when
combining
information from
several sources/criteria
Concentrate on
middleware
complexity without
changing subsystems
5
Example: Combining Fuzzy Information
Lists are results of query: ``find object with
color `red’ and shape `round’”
• Subsystems for color and for shape.
– Each returns a score in [0,1] for each object
• Aggregating function t is how the middleware
system should combine the two criteria
– Example: t(R=(x1,x2 )) could be min(x1,x2 )
6
Example: scheduling pages
Each object - page in a data broadcast system
• 1st field - of users requesting the page
• 2nd field - longest time user is waiting
Combining function t - product of the two
fields (geometric mean)
Goal: find the page with the largest product
7
Example: Information Retrieval
Documents
D1
T1
Terms
D2
Dk
W12
T2
Tn
Query T1, T2, T3: find documents with largest sum of entries
Aggregation function t is xi
8
Modes of Access to the Lists
• Sequential/sorted access: obtain next object in list Li
– cost cS
• Random access: for object R and 1i m obtain xi
– cost cR
Cost of an execution:
cS ( of seq. access)
cR ( of random access)
9
Interesting Cases
• cR /cS is small
• cS cR or
• cR >> cS
Number of lists m - small
10
Fagin’s Algorithm - FA
• For all lists L1, L2, , Lm get next object in
sorted order.
• Stop when there is set of k objects that appeared
in all lists.
• For every object R encountered
– retrieve all fields x1, x2, , xm.
– Compute t(x1,x2,…xm)
• Return top k objects
11
Correctness of FA...
For any monotone t and any database D of
objects, FA finds the top k objects.
Proof: any object in the real top k is better in
at least one field than the objects in
intersection.
12
Performance of FA
Performance : assuming that the fields are
independent (N(m-1)/m).
Better performance - correlation between
fields
Worse performance - negative correlation
Bad aggregating function: max
13
Goals of this work
• Improve complexity and analysis worst case not meaningful
Instead consider Instance Optimality
• Expand the range of functions
want to handle all monotone aggregating
functions
• Simplify implementation
14
Instance Optimality
A = class of algorithms,
D = class of legal inputs.
For AA and DD measure cost(A,D) 0.
• An algorithm AA is instance optimal over
A and D if there are constants c1 and c2 s.t.
For every A’A and DD
cost(A,D) c1 cost(A’,D) c2.
c1 is called the optimality ratio
15
…Instance Optimality
• Common in competitive online analysis
– Compare an online decision making algorithm
to the best offline one.
• Approximation Algorithms
– Compare the size that the best algorithm can
find to the one the approx. algorithm finds
In our case
– Offline Nondeterminism
16
…Instance Optimality
• We show algorithms that are instance
optimal for a variety of
– Classes of algorithms
• deterministic, Probabilistic, Approximate
– Databases
– access cost functions
17
Guidelines for Design of
Algorithms
• Format: do sequential/sorted access (with
random access on other fields) until you
know that you have seen the top k.
• In general: greedy gathering of information;
If a query might allow you to know top k
objects do it.
Works in all considered scenarios
18
The Threshold Algorithm - TA
• For all lists L1, L2, , Lm get next object in
sorted order.
• For each object R returned
– Retrieve all fields x1,x2,,xm.
– Compute t(x1,x2,…xm)
– If one of top k answers so far - remember it.
• 1im let xi be bottom value seen in Li (so far)
–
Define the threshold value to be t(x1,x2,…xm)
• Stop when found k objects with t value .
– Return top k objects
19
Example: m=2, k=1, t is min
• Top object (so far) =
• Bottom values x1 =
• Threshold
=
t(b)
1/11
cb r, ,t(c)
1/12
t(r)===1/8
0.7
0.9
x2 = 3/4
0.4
0.1
2/3
1/2
1/4
0.4
0.1
2/3
3/4
Maintained
Information
c = (0.9, 1/12)
c1= 0.9
s2= 3/4
s = (0.05,3/4)
b = (0.7, 1/11)
r = (0.4, 1/8)
b1= 0.7
r1= 0.4
a1= 0.1
w2= 2/3
w = (0.07, 2/3)
z2= 1/2
q2= 1/4
z = (0.09, 1/2)
q = (0.08, 1/4)
a = (0.1, 1/13)
20
Correctness of TA
For any monotone t and any database D of
objects, TA finds the top k objects.
Proof: If object z was not seen
1im zi xi
t(z1, z2,…zm) t(x1,x2,…xm)
21
Implementation of TA
Requires only bounded buffers:
• Top k objects
• Bottom m values x1,x2,…xm
22
Robustness of TA
Approximation: Suppose want an (1) approx.
- for any R returned and R’ not returned
t(R’) (1) t(R)
Modified stopping condition:
Stop when found k objects with t value at least /(1).
Early Stopping: can modify TA so that at any point
user is
– Given current view of top k list
– Given a guarantee about approximation
23
Instance Optimality
Intuition: Cannot stop any sooner, since the
next object to be explored might have the
threshold value.
But, life is a bit more delicate...
24
Wild Guesses
Wild guesses: random access for a field i of
object R that has not been sequentially
accessed before
• Neither FA nor TA use wild guesses
• Subsystem might not allow wild guesses
More exotic queries: jth position in ith list...
25
Instance Optimality- No Wild Guesses
Theorem: For any monotone t let
• A be the class of algorithms that
– correctly find top k answers for every database
with aggregation function t.
– Do not make wild guesses
• D be the class of all databases.
Then TA is instance optimal over A and D
Optimality ratio is m+m2 ·cR/cS - best possible!
26
Proof of Optimality
Claim: If TA gets to iteration d, then any
(correct) algorithm A’ must get to depth d-1
Proof: let Rmax be top object returned by TA
(d) t(Rmax) (d-1)
There exists D’ with R’ at level d-1
R’ (x1(d-1), x2 (d-1),…xm(d-1) )
Where A’ fails
27
Do wild guesses help?
Aggregation function - min, k=1
Database - 1 2 … n n1 … 2n1
1 1 … 1 1 0 0 …0
0 0 … 0 1 1 1 …1
L1 : 1 2 … n n1 … 2n1
L2 : 2n1 … n1 n …1
Wild guess: access object n1 and top
elements
28
Strict Monotonicity
• An aggregation function t is strictly monotone if
when 1im xi x’i
Then
t(x1, x2,…xm) t(x’1,x’2,…x’m)
Examples: min, max, avg...
29
Instance Optimality - Wild
Guesses
Theorem: For any strictly monotone t let
• A be the class of algorithms that
– correctly find top k answers for every database.
• D be the class of all databases with distinct
values in each field.
Then TA is instance optimal over A and D
Optimality Ratio is c · m where
c=max{cR /cS ,cS /cR }
30
Related Work
An algorithm similar to TA was discovered
independently by two other groups
• Nepal and Ramakrishna
• Gntzer, Balke and Kiessling
No instance optimality analysis
Hence proposed modifications that are not
instance optimal algorithm
Power of Abstraction?
31
Dealing with the Cost of Random
Access
In some scenarios random access may be impossible
Cannot ask a major search engine for it internal score on
some document
In some scenarios random access may be expensive
Cost corresponds to disk access (seq. vs. random)
Need algorithms to deal with these scenarios
• NRA - No Random Access
• CA - Combined Algorithm
32
No Random Access - NRA
March down the lists getting the next object
Maintain:
• For any object R with discovered fields S1,..,m:
– W(R) t(x1,x2,…,x|S|,,0…0)
Worst (smallest) value t(R) can obtain
– B(R) t(x1,x2,…,x|S|, x|S|+1,, …, xm)
Best (largest) value t(R) can obtain
33
…maintained information (NRA)
• Top k list, based on k largest W(R) seen so far
– Ties broken according to B values
Define Mk to be the kth largest W(R) in top k list
• An object R is viable if B(R) Mk
Stop when there are no viable elements left
I.e. B(R) Mk for all R top list
Return the top k list
34
Correctness
For any monotone t and any database D of
objects, NRA finds the top k objects.
Proof: At any point, for all objects t(R)B(R)
Once B(R) Ck for all but top list
no other objects with t(R) Ck
35
Optimality
Theorem: For any monotone t let
• A be the class of algorithms that
– correctly find top k answers for every database.
– make only sequential access
• D be the class of all databases.
Then NRA is instance optimal over A and D
Optimality Ratio is m !
36
Implementation of NRA
• Not so simple - need to update B(R) for all
existing R when x1,x2,…xm changes
• For specific aggregation functions (min)
good data structures
Open Problem: Which aggregation function
have good data structures?
37
Combined Algorithm CA
Can combine TA and NRA
Let h = cR /cS
Maintain information as in NRA
For every h sequential accesses:
• Do m random access on an objects
from each list. Choose top viable for
which not all fields are known
38
Instance Optimality
Instance optimality statement a bit more complex
Under certain assumptions (including t = min, sum)
CA is instance optimal with optimality ratio ~ 2m
39
Further Research
• Middleware Scenario:
– Better implementations of NRA
– Is large storage essential
– Additional useful information in each list?
• How widely applicable is instance optimality?
– String Matching, Stable Marriage...
• Aggregation functions and methods in other scenarios
– Rank Aggregation of Search Engines
• P=NP?
40
More Details
See
www.wisdom.weizmann.ac.il/~naor/PAPERS/middle_agg.html
41