Towards Wireless Overlay Network Architectures

Download Report

Transcript Towards Wireless Overlay Network Architectures

A Research Program in
Reliable Adaptive
Distributed Systems (RADS)
Armando Fox*, Michael Jordan, Randy Katz, George
Necula, David Patterson, Ion Stoica, Doug Tygar
University of California, Berkeley
*Stanford University
1
What Are We Trying to Do:
New Approach for RADS
Dramatically improve the trustworthiness of
networked systems
• Observe: design observation points throughout
system
• Analyze: infer via statistical learning
– Respond: detect anomalous behavior vs. baseline
– Learn: use observations to modify responses to future
observations
• Act:
– Reactive: use control points in system for rapid recovery
if detect something wrong
– Proactive/protective: prophylactically act on system to
prevent predicted impending failure
2
Today’s Systems are Too Brittle
• Fragile, easily broken, yielding poor dependability and
security
– E.g., Amazon: yearly revenue $3.1B, downtime costs $600,000/hr
• Why?
– Existing systems focus on performance, not fast adaptive detection
and response to failure and attack
– Fundamentally incorrect assumptions
» Humans are perfect
» Software can be made bug free
» Maintenance is “free”
• People/HW/SW failures are facts, not problems
“If a problem has no solution, it may not be a problem,
but a fact--not to be solved, but to be coped with over time”
— Shimon Peres
3
Failures and Attacks Inevitable … so
Design for Rapid Adaptation
• Rapid application and server recovery, agile network
rerouting, proactive protective actions ...
– No distinction between “normal operation” and “recovery”
• Elements of our solution
– Programming paradigms for robust recovery
– Crash-only software design for rapid server recovery
– Network protocols designed for observation to allow rapid
detection of behavioral violations
– Instrumentation and online statistical analysis for anomaly
detection and failure diagnosis/localization
• Adaptation benchmarks to measure progress
– What you can’t measure, you can’t improve
– Collect real failure data to drive benchmarks
4
Example: anomaly detection meets crashonly design
• Use simple time series analysis on key operating statistics
(committed writes, offered load, etc.)
• Count relative frequencies of all substrings of length k or
shorter, look for discrepancies in relative frequencies across
replicas
• Works even when period is irregular or not known a priori
• If you see anything unusual, coerce to a crash and recover from
that; reboot is nearly free, so occasional false positives OK
5
Security Challenges for RADS
• Need new techniques to detect and respond to
rapidly-evolving attacks
• But these techniques can themselves be used
to mount attacks
– So we must secure the learning process
• Rapid secure protocol synthesis tools can be
applied to this problem
6
Approach for Success:
Interdisciplinary Expertise
• Interdisciplinary Team
–
–
–
–
–
Armando Fox/Dave Patterson: Dependable System Design
Randy Katz/Ion Stoica: Network Services/Protocols
Michael Jordan: Statistical Learning Theory
Ion Stoica/Doug Tygar: Verification of networks and security
George Necula: Language/Applications-level mechanisms
• Spans algorithm design and system implementations
– Comprehensive distributed architecture embedding SLT as a
primitive building block
– Embedding observational and inference means at strategic points
throughout the distributed system
– New kinds of statistical inference and verification techniques
able to execute on-line and in real-time
7
RADS Conceptual Architecture
Programming
Abstractions
For Roll-back
(Necula)
Crash-Only
Middleware &
Servers,
System O&C
Infrastructur
e (Fox)
Protocols Enabling
Fast Detection &
Route Recovery,
Network O&C
Infrastructure
(Katz, Stoica)
Prototype Application:
User
Messaging, E-Mail
for Operational Systems
Client
Operator
Server
Distributed
Middleware
SLT Services
Distributed
Middleware
PNE Edge
Network
ApplicationSpecific
Overlay Network
EdgePNE
Network
Router
Router
Commodity
Internet & IP networks
Benchmarks,
Tools for
Human
Operators
(Patterson)
Online
Statistical
Learning
Algorithms
(Jordan)
Reduction to practice of on-line SLT and observe/analyze/act infrastructure
Reusable embeddable components
Pervasive security considerations (Tygar)
8
Vulnerable Messaging Application
that Requires Trustworthiness
Net Failure
DHS/Federal
Network
Active Adversary
Service Attacks
Coalition
Internet
Trust
Relations
Allies Networks
Adversary Allies NetworksNet Failure
Allies Networks
Allies
Networks
Local Police,
Fire,
Adversary
State Police
Compromised Network
With Embedded Adversaries
Incident Reports
Responder Locations
GIS Data
Etc.
Exploit DETER Testbed for Prototyping
9
Scientific Foundation For
“Self-*” Systems
• New design principles and tools for systems
that continuously adjust their behavior in
response to analysis of online observations
• New metrics and benchmarks for evaluating
self-adapting networked systems
• Advances in Statistical Learning Theory to
move from offline to online analysis of largescale distributed systems
10
Measuring Success
• Build messaging prototype using RADS design
principles and tools
• Put realistic performance workload on prototype,
embed in DHS DETER testbed
• Subject prototype to increasingly aggressive failure
and attack workloads
– E.g., hardware failures, software failures, operator failures, worms
attacks, DDOS attacks, …
• Measure false positive rates, accuracy rates, time to
analyze failures, time to act, performance impact of
actions, availability of prototype, performability of
prototype, …
• Compare results with conventional systems under
similar performance, failure, and attack workloads
11
New Funding Opportunity:
NSF CyberTrust Program
From RFP:
• People rely on systems based on networked computers
– Too vulnerable to cyber attacks: inhibit function, corrupt data, or
expose private information
• Promote vision where networked systems are:
– More predictable, more accountable, and less vulnerable to attack
and abuse;
– Developed, configured, operated and evaluated by a well-trained
and diverse workforce;
– Used by a public educated in their secure and ethical operation
• Example research area: improve trustworthiness of
networks; explore evolving nature of security protocols
and policies in communications networks
• Individual, Team projects and 1-2 Centers
12
CATS: Center for Adaptive
Trustworthy Systems
Dramatically improve the trustworthiness of networked systems
• New understanding of how to construct such systems
– Observe-Analyze-Act
– From responding to known problems to learning new problems
– From reacting to problems to proactively responding before problems
become significant
– Experimental method of benchmarking, prototyping, and deployment to
provide context
• Technical Thrusts
–
–
–
–
Statistical Learning Theory
Crash-Only Software
Behaviorally-Consistent and Secure Protocols
Programmable Network Elements
• Integration Vehicle
– Application: Disaster Response Messaging
– Supported by prototype distributed system architecture
– Deployment and Evaluation Plan
13
We need your
help and
support!
Discussion?
14