The Geometric Method - 2013

Download Report

Transcript The Geometric Method - 2013

Monitoring Distributed
Data Streams
Assaf Schuster
3/21/2017
1
Distributed Stream Networks


Large scale and widespread networked systems
Continuous production of data
High volume
 Dynamic nature


Required to detect a global property

Often in (near) real time
2
Web Page Frequency Counts



Mirrored web site
Mirrors record the frequency of requests for pages
Detect when the global frequency of requests for a page
exceeds a predetermined threshold
Req #1
Req #2
Req #3
Europe
Asia
100
50
25
0
America
3/21/2017
Africa
3
Air Quality Monitoring




Sensors monitoring the
concentration of air
pollutants.
Each sensor holds a data vector comprising
measured concentration of various pollutants
(CO2, SO2, O3, etc.).
A function on the average readings determines
the Air Quality Index (AQI)
Issue an alert in case the AQI exceeds a given
threshold.
3/21/2017
4
Sensor Networks

Sensors monitoring the temperature in a server
room (machine room, conference room, etc.)


Ensure uniform temp.: monitor variance of readings
Alert in case variance exceeds a threshold

Temperature readings by n sensors x1, …, xn

Each sensor holds a data vector vi = (xi2, xi )T

The average data vector is v =

1

n

Var(all sensors) =
1
n
3/21/2017
n

i 1
1
xi  
n

2
n

i 1

xi 


n
x
i 1
2
i
1
n
n

i 1

xi 


T
2
5
Search Engine

Distributed datacenter/warehouse




10Ks horizontal partitions
“Our logs are larger than any other data by orders of
magnitude. They are our source of truth.” Sridhar
Ramaswamy. SIGMOD’08 keynote on “Extreme Data Mining”
Mining the logs: Compute pairs of keywords for
which the correlation index is high
Thousands simultaneous tasks

“Network bandwidth is a relatively scarce resource in
our computing environment”. Dean and Ghemawat.
MapReduce paper, OSDI’04
3/21/2017
6
Cloud Health Monitoring
Amazon Web Services » Service Health Dashboard
Amazon S3 Availability Event: July 20, 2008
Amazon S3 Availability Event: July 20, 2008
“At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and
our alarms went off. By 8:50am PDT, error rates were significantly elevated and
very few requests were completing successfully. By 8:55am PDT, we had multiple
engineers engaged and investigating the issue. Our alarms pointed at problems
processing customer requests in multiple places within the system and across
multiple data centers. While we began investigating several possible causes, we tried
to restore system health... At 9:41am PDT, we determined that servers within
Amazon S3 were having problems… By 11:05am PDT, all server-to-server
communication was stopped, request processing components shut down, and the
system's state cleared…. “
3/21/2017
7
Ad-Hoc Mobile P2P Networks
Peer-to-peer network invites drivers to get connected
CarTorrent could smarten up our daily commute, reducing accidents and bringing multimedia journey data to our fingertips
•Laura Parker
•The Guardian,
•Thursday January 17 2008
“The name BitTorrent has become part of most people's
day-to-day vernacular, synonymous with downloading every
kind of content via the internet's peer-to-peer networks.
But if a team of US researchers have their way, we may all be talking
about CarTorrent in the not too distant future…..
Researchers from the University of California Los Angeles are working on a
wireless communication network that will allow cars to talk to each other,
simultaneously downloading information in the shape of road safety
warnings, entertainment content and navigational tools….”
3/21/2017
8
3/21/2017
9
Distributed Monitoring –
State of the Art

Periodically send all data to a central location
High communication
 High latency


A tradeoff
Expensive central resources
 Power inefficient


Can we do better?
Linear systems 
 Non-linear systems 

3/21/2017
10
Monitoring Distributed Non-Linear Functions
f ( x)
Threshold
x
T
x y
2
y
 x y
f ( x)  T , f ( y )  T , f 
T
 2 
3/21/2017
11
Mutual Information
Given a 2X2 table
 , the mutual information is defined as




11
12
MI ( )  11 log 
 12 log 



 (11 12 )(11  21 ) 
 (11 12 )(12  22 ) 




 21
 22
 21 log 
  22 log 


(



)(



)
(



)(



)
 21 22 11 21 
 21 22 12 22 
 0.9 0.03 
 0.04 0.02 
1  
, 2  

0.02
0.05
0.03
0.91




 1   2 
MI ( 1 )  0.104, MI ( 2 )  0.082, MI 
  0.494
 2 
The mutual information of the global table is much larger than the
local values. As in the parabola case, there’s no way to infer about the
global MI given the local ones.
3/21/2017
12
Non-Linear Functions
“…
The link function is, of course, nonlinear. So we
agonize over trading off optimization
performance with ability to use the massive
infrastructure.
…”
Sridhar Ramaswamy.
SIGMOD’08 Keynote talk on “Extreme Data Mining”
Slide title: “10 top reasons why googlers do not sleep at night”
(Coffee is reason #5)
3/21/2017
13
Geometric Method – Idea



The behavior of a general function over
distributed data may be hard to see
 Local indications may be misleading
 Non-linear
Looking at the *domain* of the function
may be easier
For long periods, the local inputs are
stationary, or do not change much
3/21/2017
14