Systems Area: OS and Networking

Download Report

Transcript Systems Area: OS and Networking

The Data-Centric Revolution in
Networking?
Scott Shenker
International Computer Science Institute
U. C. Berkeley
Liberally stealing the insight and work of others,
particularly Hari Balakrishnan, Deborah Estrin,
Ramesh Govindan, Joe Hellerstein, and Ion Stoica
1
Two Communities Apart

Networking (Internet) researchers:
- don’t know and don’t care about databases

Vast gap between communities
- much more overlap with other systems communities

But data-centrism has narrowed the gap
- metaphors and algorithms

This talk will tell that story: in reverse order
- Internet, then sensornets
2
Our Central Mission
Get data from here to there
3
Host-centric Protocols

Protocols defined in terms of IP addresses:
- Unicast: IP address = host
- Multicast: IP address = set of hosts

Destination address is given to protocol

Protocol delivers data from one host to another
- unicast: conceptually trivial
- multicast: address is logical, not physical
4
Host-centric Applications

Classic applications: destination is “intrinsic”
-

telnet: target machine
FTP: location of files
electronic mail: email address turns into mail server
multimedia conferencing: machines of participants
Destination is specified by user (not network)
- Usually specified by hostname not address

DNS translates names into addresses
5
Domain Name System (DNS)

DNS is built around recursive delegation
- Top level domains (TLDs): .com, .net, .edu, etc.
- TLDs delegate authority to subdomains
• berkeley.edu
- Subdomains can further delegate
• cs.berkeley.edu

Hierarchy fits host administrative structure
- Local decentralized control
- Crucial to efficient hostname resolution
6
Network Research in Early 90’s

Consumed by a few obsessions:
- Quality of service for streaming media
- Multicast
- Congestion control

But nobody questioned host-centricity:
- assumed to be the only way to build Internet
7
Surprise #1: The web catches on!
But we don’t....
8
The Web

Web URLs have host-name/path format
- Essentially the same information as FTP

Early web:
- browsers basically a GUI for FTP
- URLs were easily transmitted pointers

Early web was host-centric
- and largely ignored (but used) by net researchers
9
Modern Web

URLs often function as names of data
- users think of www.cnn.com as data, not a host
- Fact that www.cnn.com is a hostname is irrelevant

Users want data, not access to particular host

The web is now data-centric
10
Data-centric App in Host-centric World

Data still associated with host names (URLs)
- administrative structure of data same as hosts
- weak point in current web

Key enabler: search engines
- Searchable databases map keywords to URLs
- Allowed users to find desired data

Networkers focused on technical problems:
- HTTP, persistence (URNs), replication (CDNs), ...
11
We Missed the Point!
We thought:
 web was an aberration
 search engines were a sufficient hack
No networker (except Jacobson) articulated that:
 web had gone from host-centric to data-centric
 it was a harbinger of future applications
12
Surprise #2: Stolen Music is Popular!
And we finally get the message...
13
The P2P Filesharing Phenomena

Napster: “Fastest growing Internet application”

Music sharing is intrinsically data-centric
- data never associated with hosts

Centralized searchable database
- listed IP addresses where content could be found
- analogous to Google+DNS in the web

Legal problems forced decentralization
- Led to Gnutella and other distributed programs
14
Gnutella-style File Sharing

Gnutella nodes form an overlay network
- each node has a few “neighbors” in a virtual network
- virtual link: node knows other’s IP address
- do app-level “networking” on this graph
15
Gnutella-style Searching

Keyword queries are flooded (within scope)
- query is processed locally at each node
- all nodes having hits respond to source
- many variations on this theme (freenet, etc.)

Clearly not scalable

P2P traffic now sizable fraction of overall load

We finally realize that we need a scalable way
to find data for data-centric applications
16
Is there life outside the Internet?
Yes, and we should have been listening!
17
Sensornets (predating P2P)

Vision:
- Many sensing devices with radio and processor
- Enable fine-grained measurements over large areas
- Huge potential impact on science, and society

Technical challenges:
- untethered: power consumption must be limited
- unattended: robust and self-configuring
- wireless: ad hoc networking
18
Conceptual Challenge

Sensornets are inherently data-centric
- Users know what data they want, not where it is
- Estrin, Govindan, Heidemann (2000, etc.)

Centralized database infeasible
- vast amount of data, constantly being updated
- small fraction of data will ever be queried
- sending to single site expends too much energy
19
Flood-then-Aggregate
General class of methods:
 Flood query to all nodes (or in region)
 Nodes with data matching query respond
 Responses are aggregated as appropriate
Examples:
 Directed diffusion: reinforce based on data
 TAG: tree for flood and return-path aggregation
 Etc....
20
Scaling Problems
This approach suffers as:
 systems get bigger
 queries more frequent and more specific
For current deployments, not an issue:
 systems are small, queries primitive
But if technology progresses as hoped:
 want to get relevant data without flooding
 similar to situation in Internet
21
Is Data-centric Flooding Necessary?

The initial decentralized data-centric designs (in
both Internet and sensornets) used flooding
- unscalable and unsustainable

Since data-centrism is here to stay, we can’t
ignore this problem

We had to broaden our research charter
22
Our Revised Mission
Get data from here to there
23
A DNS for Data?

Can we map data names into addresses?
- a data-centric DNS, distributed and scalable
- doesn’t alter net protocols, but aids data location
- not just about stolen music, but a general facility

A formidable challenge:
- Data does not have a clear administrative hierarchy
- Likely need to support a flat namespace
- Can one do this scalably?

Data-centrism requires scalable flat lookups
24
Distributed Hash Tables (DHTs)
The latest networking fad....
Presented from the Internet perspective but
applies to sensornets as well
25
An Internet-scale Distributed Index
Interface: put(key,object), get(key)
DHTs form a structured overlay network

nodes choose particular neighbors

all objects have keys, usually hash(name)

each node responsible for range of keys

puts/gets routed to appropriate node
26
Example Design: Chord

Node

and

object keys:

Neighbors:
 



- random location around a circle







- nodes 2-i around the circle

- found by routing to desired key

Routing: greedy




- pick nbr closest to destination

Storage: “own” interval





Ownershi

p
range



- node owns key range between
her key and previous node’s key
27
Key Properties

Large aggregate capacity: O(n) storage/b’width

Scalable:
- O(log n) routing hops and state
- O((log n)2) update costs for node join/leaves

Robust: self-configuring and resilient to failures

Nonproperty: strict guarantees when failures
28
Our Version of “Data Independence”

DHT interface allows us to get data by name
- We no longer care where data is

A radical transition in databases
- perhaps it will be one in networking as well

Apologies to Joe Hellerstein...
- see latest SIGMOD Record for his article
29
Caveat!

DHTs are a work-in-progress

A flurry of research activity on:
-

security
replication
proximity
real operational experience
.....
For rest of talk, we put these worries aside...
30
Why Not Centralized Solutions?

Ugh! (and infeasible for sensornets)

Fault tolerance: avoid single point of failure

Economic:
- DNS: “donated” machines, scales organically
- Centralized solutions require business model

Issue still open....but irrelevant to data-centrism
- need to support interface
- DHTs allow us to choose between cent. and decent.
31
Multiple Roles for DHTs

Application-specific
- rolled into P2P application, run on “peers”

General-purpose service
- run on managed nodes

Intrinsic part of Internet architecture
- run on managed nodes
32
Multiple Roles for DHTs

Application-specific
- rolled into P2P app, run on “peers”

General-purpose service
- run on managed nodes

Intrinsic part of Internet architecture
- run on managed infrastructure nodes
33
Some Applications using DHTs
Partial list:
 File sharing
 Storage repositories and file systems
 Backup systems
 Event notification systems
 Electronic mail
 App-layer multicast and streaming media
 .....
Useful substrate for many (not all) large distributed
applications because HTs are useful
34
Multiple Roles for DHTs

Application-specific
- rolled into P2P app, run on “peers”

General-purpose service
- run on managed nodes

Intrinsic part of Internet architecture
- run on managed infrastructure nodes
35
Internet-scale Query Processing

Superficial motivation:
- Joins can be implemented with hash tables so...
- Distributed joins can be implemented with DHTs
- Scaling: latency O(log n) while computation O(n)

PIER (talk later today in session A9!):
- joins, aggregation, recursive and continuous queries

Intended targets:
- data “in the wild” (filesharing, net monitoring, etc.)
- schema provided by standardized protocols
- no need for ACID semantics
36
More Complex Queries
Range search:
 using “prefix hash table”
 no need to walk tree
Keyword search:
 engineering the boolean approach
Active research on DHT-based distributed data
structures for search (net and db communities)
37
Multiple Roles for DHTs

Application-specific
- rolled into P2P app, run on “peers”

General-purpose service
- run on managed nodes

Intrinsic part of Internet architecture
- run on managed infrastructure nodes
38
Cleaning Up the Architecture

Making URNs a reality
- webNG based on flat and opaque DHT keys
- enables persistence and eliminates branding

Host identifiers versus routing information
- IP addresses currently (and stupidly) serve as both
- DHT key = host id, resolves to routing address
- Architectural challenge for basic protocols
39
Subverting the Architecture

Use DHT for forwarding, not just lookup!
- e.g., Internet Indirection Infrastructure (i3)
- similar in spirit to multicast (logical addressing)
- transcends current naming/addressing structures

Make overlay the real “network” layer
- turn IP into a link layer technology

Leverages, not limited by, current infrastructure

New network layer is still simple, but not IP
40
New Generation of Networking?

Current Internet relies on hierarchies to scale:
- DNS naming, IP addressing, etc.

Hierarchies limit flexibility:
- addresses and names have to fit given structure
- need to care “where” data/machines are

Scalable flat lookup avoids hierarchy
- network would be “structure independent”

Less of a distinction between hosts and data
41
Do DHTs Apply to Sensornets?
Can we build them?
Do they help?
42
Finding Sensornet Data w/o Flooding

Extract high-level features or “events”
- Temperature spikes, toxins, animal sightings
- Name these events

Store/Access events with DHT-like structure
- Can later get detailed data from specific nodes

Call this “data-centric storage” (DCS)
- Good for frequent specific queries
- Not good for long-running or aggregate queries

But how do you build a sensornet DHT?
43
Geographic Routing


Nodes know own and neighbors’ positions
Packets routed to geographic destination
- Greedy forwarding, when possible
- If greedy fails at a void, use the right hand rule to
navigate around the void
A
B (x,y)
44
“Geographic” Hash Table (GHT)




Keys hashed to random coordinates
Likely no node exists at that location!
Forwarding ends at node closest to destination
Closest node stores the data
A
(x,y)
45
Additional Algorithms

Caching and replication
- Cache around perimeter, replicate independently

Structured replication (SR)
- Hierarchical decomposition of key space
- Tree of “mirror images”
Hash(event)
Mirror
Images
46
More Complex Queries
Using GHT+SR: (which has spatial structure)
 Range searches in space and value
 Wavelet analysis
New data structures:
 Higher-dimensional range searches
Active research in distributed data structures for
sensornet queries (in net and db communities)
47
We are finally on our way to the
land of “data independence”...
We ask for your guidance....
48
Areas of Common Interest

Algorithmic:
- distributed data structures for search

Metaphoric:
- thinking about data independence
49