j - Frontiers in Distributed Information Systems
Download
Report
Transcript j - Frontiers in Distributed Information Systems
phi
public health for the internet
joe hellerstein
intel research & uc berkeley
agenda
• three visions driving
j
• building block: the PIER query engine
• challenges, synergies
vision 1: shift network security from
medicine to public health
• security tools focused on “medicine”
• vaccines for viruses
• improving the world one patient at a time
• weakness/opportunity in the “public health” arena
• public health: population-focused, community-oriented
• epidemiology: incidence, distribution, and control in a population
•
j: a new approach
• enable population-wide measurement
• engage end users: education and prevention
• understand risky behaviors, at-risk populations.
a center for disease control?
• [staniford/paxson/weaver 2002]
•
•
•
•
am I being targeted?
is this remote host a “bad guy”?
is there a new type of activity?
is there global-scale activity
• who owns the center? what do they control?
• this will be unpopular at best
• electronic privacy for individuals
• the internet as “a broadly surveilled police state”?
• dan geer, former cto of @Stake
• provider disincentives
• Transparency = maintenance cost
• and hardly ubiquitous
• can monitor the chokepoints (isp’s)
• but inside intranets??
• e.g. corporate IT
• e.g. berkeley dorms
• e.g. grassroots WiFi agglomerations?
energizing the end-users
• endpoints are ubiquitous
• internet, intranet, hotspot
• toward a uniform architecture
• end-users will help
• populist appeal to home users is timely
• enterprise IT can dictate endpoint software
• differentiating incentives for endpoint vendors
• the connection: peer-to-peer technology
•
•
•
•
harnessed to the good!
ease of use
built-in scaling
decentralization of trust and liability
p2p technology is ripe. a noble app here with significant uptake?
demo time
vision 2: shared network monitoring
• endpoint monitoring becoming a trend
•
•
•
•
•
NETI@Home (GA Tech)
DIMES (TAU)
ForNet (Polytechnic)
DShield
DOMINO (Wisconsin)
• we share the vision!
• but all facing key challenges in getting uptake
• what’s in it for the community members?
• disincentives: privacy & security risks
a communal approach
• enable multiple efforts with a single distributed infrastructure
• extensible endpoint “sensors” and visualizations
• shared engine connecting them up
• a group bands together on the hard systems and crypto
•
•
•
•
cost-effective data processing and analysis
verifiable data and processing
distributed resource limiting
toolkit of privacy-preserving, distributed dataflow components
• a theme: dissemination is as important as collection
• attract end-users with visible community information
• enable real-time swapping across research teams
• there may be much more here (see next vision!)
• intel research is prepared to invest in this community
• as we did with planetlab
vision 3: the network oracle
• imagine that you knew everything
about the internet, at every moment
•
•
•
•
•
•
•
network maps
link loading
point-to-point latency and bandwidth
event detections (e.g., from firewalls)
naming (DNS, ASes, etc.),
end-system software configuration information
router configurations and routing tables
• how would this change things?
•
•
•
•
•
the design of protocols
the design of networked applications
network and system management (performance and security)
the economy (and policy) of nw clients and isp’s
etc.
a dirty (not-so) secret
• we’re sneaking up on the oracle already
• overlays are a subversive attempt to wrest control from ISPs
• overlays compute and disseminate measurements
• measurement and functionality appetite growing
• everybody’s favorite planetlab exercise: all-pairs ping
• detour routing a la RON
• custom routing a la i3/ROSE
• but this is not being done systematically
•
•
•
•
every overlay does its own thing, opaquely
granularity of aggregation in time and space not well explored
measurement & dissemination often 2ndary/implicit
algorithmic/architectural choices abound, little exploration
• and the brass ring remains…
wrapping up: 3 visions
• multiple rationales to pursue this agenda
• commonalities
•
•
•
•
many networked sensors
many computational agents for data processing
many destinations for result dissemination
decentralized infrastructure:
• organic scaling
• no centralized maintenance
• no single unified repository of raw data (privacy ramifications)
• differences (invariably!)
• desired data granularities, in time and space
• “reach” of querying and dissemination
• sensitivity to privacy issues
• goal: a shared infrastructure
• shared effort to develop and extend it, seeded by intel research
• shared bootstrap deployment (planetlab and beyond)
agenda
• three visions driving
j
• building block: the PIER query engine
• challenges, synergies
pier: p2p information exchange & retrieval
• a wide-area distributed dataflow engine
• designed to scale to thousands or millions of nodes
• outfitted with “streaming” relational operators, recursive graph queries
• fully extensible dataflow graphs, SQL-like interface for convenience
• built on distributed hash table (DHT) overlays
• a put()/get() hashtable interface for the Internet.
• content-based routing, soft-state semantics
• pier is DHT-agnostic (CAN chord bamboo)
• a very different design point than DB2, Oracle, etc.
•
•
•
•
scale = # machines, not necessarily # bytes
relaxed consistency a requirement (not really a dataBASE at all)
organic scaling
data lives in its natural habitat
initial pier applications
•
φ intrusion app
•
•
•
•
real-time snort aggregation from ~300 planetlab nodes
identification of top-10 attackers (validating DOMINO)
real time joins: “who are my attackers attacking”
plausible end-user visualizations
• transitive closures and other graph algorithms
• distributed gnutella crawler
• distributed web crawler
• shortest paths queries (distance vector routing)
• improved filesharing for rare items
• deployed as hybrid gnutella ultrapeer on 50 planetlab nodes
• intercepts gnutella queries, identifies “rare items and publishes”
• 18% decrease in number of unnecessarily empty query results
• 66% possible with better “rare item” identification
• upshot: reasons to believe the generality is real
pier in the j context
• goal is for pier to serve as an information plane
• gather data from “sensors”
• perform basic filtering, aggregation, combination
• though aggregation can be rather fancy (e.g. wavelet encoding)
• disseminate the right “cooked” data to the right people
• and do so in a “trusted” way
• privacy and security
• manageability
• but … only a piece of the puzzle
•
•
•
•
•
•
active probing
mapping
backbone monitors
network forensics, tomography
honeypots
etc.
• we won’t do all of this ourselves!
• gathering playmates
agenda
• three visions driving
j
• building block: the PIER query engine
• challenges, synergies
Challenges
General
Challenges
Declarative
Queries
Security
Privacy
Quality of Service
Query Plan
Query Optimization
Multi-Query Optimization
Catalogs
Persistent Storage
Recursion on graphs
Overlay Network
Physical Network
Query Dissemination
Replication
Soft-State
Quality of Service
Net-Embedded functions
Resilience
Route Flapping
Efficiency
current limitations of pier
• query per client
• no systematic sharing of computation/results across queries
• locality control forfeited to dht
• difficult to express local gossiping rules
• queries, not triggers
• alerts currently supported via polling
• loose query semantics
• network dynamics and timing make guarantees hard
• active monitoring
• we can do it, but it’s not systematic
• security/privacy
• we’re attacking many of these now
so, is pier the “right” infrastructure
• not today
• though many of the decisions seem sound
• level of indirection between task specification and execution
• non-hierarchical model provides flexibility and simplicity
• vs. domain hierarchy (a la ip naming)
• vs. data hierarchies (a la xml)
• extensible aggregation + relational operators covers a lot of
territory
• monitoring
• routing
potential synergies
• design of shared info plane
•
•
•
•
scenarios & requirements
architectural brickbats
built-in components
complementary components
• and requirements for integration
• understanding the opportunity
• what if the network oracle existed
• fostering the community
• leveraging each other’s efforts to get mindshare
• resources
• if the intel genie granted you a wish…
• (think about building/leveraging community)
backup slides
•
A Note on Structured Data on Networks
• Industrial Revolution for Information
• Mechanized data generation
• Sensing the physical world
• Monitoring software, networks, machines
• Tracking objects, processes, behaviors
• Uniformity of products
• Mass Transport of Data and Computation
• Data generators and consumers spread over the Internet and the Planet
• Happening at both extremes
• Compare to hand-generation of text