SSTiC lecture 5 - Disaster resilient networks and

Download Report

Transcript SSTiC lecture 5 - Disaster resilient networks and

Resilient
Networks &Transport Services
Jane W. S. Liu
Institute of Information Science
Academia Sinica, Taiwan
http://openisdm.iis.sinica.edu.tw
International Summer School on Trends in Computing, Tarragona Spain, July 2013
Disaster Resilient
Networks &Transport Services
Jane W. S. Liu
Institute of Information Science
Academia Sinica, Taiwan
http://openisdm.iis.sinica.edu.tw
International Summer School on Trends in Computing, Tarragona Spain, July 2013
Topic Outline
 Overview on ICT for disaster management






Introduction: definitions, scenarios and scope
State of the art in disaster detection and prediction
State-of-the-art information and communication
infrastructures and remaining technology gaps
Selected topics on critical real-time computing and
information systems (CRICIS) for disasters, including
• Exploiting linked data and semantic web technologies
• Information access control and privacy protection
• Ubiquitous smart devices and applications for disaster
preparedness and early response
Crowdsourcing human sensor data by disaster
surveillance and early warning systems
Disaster resilient networks and transport services
Outline
 Resilience versus disaster resilience
 Requirements for disaster resiliency
 Examples of recent efforts
 OIGY: Open Information Gateway

Connectivity assessment

Physical layer defense and recovery

Transport layer recovery
“Resilience is the ability to provide and
maintain an acceptable level of service
in the face of faults and challenges to
normal operations” – from wikipedia
For disaster resilience, what are
 Challenges to normal operations, &
 Acceptable levels of service
2011 Japan
Earthquake,
Tsunami, and
Nuclear Power
Plant Accident
2010 Haiti and 2011 Chilean Earthquakes
2009 Moroko
Typhoon,
Heavy Rain,
and Landslide
in Taiwan
Killer
Tornados
in USA,
2012 and
2013
Types of Information
For EoC’s (Emergency operation Centers): real-time
data and information to support command & control
decisions and operations and resource allocations

Within each affected area: information on
 demands for and availability of services/resources
 conditions of affected people
 restoration status of transportation, power,
communication, hospitals in the area
 Across multiple affected areas: asynchronous and
synchronous interactions among family members,
friends, communities and so on.

Outside of affected areas: information on affected
infrastructures, causalities, and resource needs.
Timing criticality

Partial List of Requirements
 Can meet relative deadlines of alert messages:




Earthquakes: A fraction of a second to a few seconds
Tsunami, tornados, land slides, etc.: a few minutes
Typhoons and seasonal floods: hours to days
Cries for help: minutes to tens of minutes
 Can accommodate diverse information sources, e.g.,


Authorized senders of information and alert messages
News media, social networking services, panic crowds, etc.
 Use open and sustainable solutions: can



Support diverse end-user devices and equipment
Exploit redundancy, diversity and heterogeneity
Adapt with needs, technological advances, lessons learned
 Can defend, detect, remediate and recover
Resilience Disciplines
https://wiki.ittc.ku.edu/resilinets/Related_Work
Trustworthiness
 Dependability:
 Maintainability
 Availability
 Performability:
QoS measures
Challenge tolerance
 Survivability: many to numerous
failures, to global failure
 Disruption tolerance
 Environmental: connectivity,
mobility, and delay
Resilience Disciplines
 Energy
Disaster
J. P. G.
Sterbenzm et al.,flash
“Resilience
and survivability in communication networks: strategies,
 From
Traffic
tolerance:
crowd
Resilience
principles and survey of disciplines,” Computer Networks,
54, 2010
Topics on
Multi-failure
survivability
Random
failures
Correlated
(disaster) failures
Modeling
disaster failures
Combating
disaster failures
Connectivity
assessment
Transport service
recovery
Full
restoration
Dynamic
data delivery
Disaster Resilient
Networks and
Transport Services
Physical connectivity
recovery
Backup
connectivity
Connectivity
repairs
Degraded
service
provision
Common Sense
Strategies,
Success Stories
and
Future Visions
Common sense
Strategies:
Exploiting redundancy, diversity,
and heterogeneity
All shouldof
be
Examples
Coverage/
parts of
means
to solutions
provide
Infrastructure
for
disaster
physical
Satellite
phones
resilience
connectivity
Bandwidth
Energy
Portability
Weather
Bandwidth
Energy
Portability
Coverage/
Infrastructure
Weather
Mobile base stations
Bandwidth
Energy
Coverage/
Infrastructure
Portability
Weather
Wi-Fi
OIGY (Open Information Gateway)
Connectivity Repair
 Mobile cellular base
stations provide
coverage for large
areas temporarily.
 Last-mile mesh
networks & gateway
routers extends the
coverage area and
relays data through
mobile cellular base
stations to and from
the Internet
Internet
Mobile
base
stations
Gateway
router
Last-mile mesh
Restoring
last-mile
connections
with
wireless
meshes
Wi-Fi networks
From “Towards Throughput Optimization of Wireless Mesh Networks in Disaster Areas,” by W.
Liu, et al. Proceedings of WPMC, 2012
Using vehicular network
for temporary
connectivity restoration
Onboard
Wireless
gateway
Multi-hop vehicular
network
Connection to WLAN
From “Multi-hop Communication and Multi-protocol Gateway by Using Plural ITS,” by T. Ohyama,
et al., Proc. of WPMC, 2012
Using WLAN to Repair Optical BS Backhaul
From “Rapid Recovery of Base Station Backhaul … “ by Y. Kishi, et al., Proceedings of WPMC, 2012
Multi-Layering for Resilient Communication
From “Secured Information Service Platforms Effective in Case of Disasters,”
by F. Adachi, et al., Proceedings of WPMC, 2012
Multi-layered network: cellular networks,
Internet, mobile LANs and satellite network
From “R&D Project of Multilayered Communications Network -For disaster-resilient communications,”
By F. Adachi, et al. Proc. of WPMC, 2012
Exploiting Cooperation Among Access Points
From “Multi-AP Cooperative Diversity for Disaster-resilient Wireless LAN,” by
F. Adachi and S. Kumagai, in Proc. of WPMC, 2012
TRIPS:
Trustful Real-time Information Pub/sub
Services
OIGY
Open Information Gateway
HAPPY:
Heterogeneous And Plug-nPlaY networks
Delayed or
failed delivery
Connectivity
Assessment
Access
points
required
HAPPY
Real-Time
Pub/Sub
Services
Backup nodes
available
Network
Recovery
TRIPS
Service
recovered
Service
Recovery
Service
brokers
required
Connectivity Assessment Service
(Use Scenario and Operations)
Disaster Response
Authorities (DRA)
RequestAssessment
(areas_list, …)
1
3
Disconnected
area list
2
Active Network
Probe (ANP)
Possibly
disconnected areas
Reactive Footprint
Search (RFS)
L. J. Chen, et at., “A rapid method for detecting geographically disconnected areas after disasters,”
Proc. of IEEE International Conf. on Technologies for Homeland Security, Nov. 2011
Active Network Probe
 ANP pre-installed in preparedness phase
 Two approaches and their shortcomings
IP Geolocation Service
(IP2Location and Quova)
 Not sufficiently accurate
 One way mapping
 Too costly
Network Topology Discovery
 For core network only
 ICMP based
 Cannot work with dynamic
and distributed DNS
Reactive (Internet) Footprint Search
 People in high-risk and affected areas are likely to
use all possible means to send messages.
 RFS leverages their footprints on location-based
social networks, e.g., Facebook Places, Google
Latitude, Foursquare, Gowalla, Twitter, flickr, ...
Connectivity Assessment Service
 ANP proactively queries landmarks, which are computers

and devices at known locations maintained at all times
RFS looks for footprints in potentially disconnected areas
reported by ANP.
ANP
RFS
List of polygons
around
disconnected
landmarks
Request for connectivity report
Polygons of disconnected areas
Disaster Response Authorities
Performance of Landmark-Based ANP


Accuracy of landmark-based ANP is compared with that of
two off-the-shelf IP geolocation services
Network landmarks are K-12 schools in greater Kaohsiung
Quova
IP2Location
Proof-of-concept RFS
http://nrl.iis.sinica.edu.tw/RFS/
Where
How,
How
Physical
Connectivity
Recovery
OIGY (Open Information Gateway)
Connectivity Repair
 Mobile cellular base
stations provide
coverage for large
areas temporarily.
 Last-mile mesh
networks & gateway
routers extends the
coverage area and
relays data through
mobile cellular base
stations to and from
the Internet
Internet
Mobile
base
stations
Gateway
router
Last-mile mesh
A Means to Repair Cellular Coverage
By adjusting power of existing base stations
B3 failed
From slides presented by Po-Kai Tseng and Wei-Ho Chung of CITI,
Academia Sinica on 2013/7/5
interference
and demand
Search available BSs
Recovery
procedure
when B1
failed
Miss comm. users
Channel
and
power
allocation
From slides presented by Po-Kai Tseng and WeiHo Chung of CITI, Academia Sinica on 2013/7/5
Served by the BS with min distance
Recovery from Failure of B1 without BSC
Response received signal strength
interference
and demand
Search available BSs
Channel
and
power
allocation
Response a random number
Determine BS by the best signal or largest number
Mobile Ad-Hoc Network
s
 Requires no infrastructure
 Can provide coverage
during disasters
d
Network partitioning makes
it a less than ideal solution
S
d
Based on slides for a presentation on “Autonomous mobile mesh network,” by W. L. Shen, et al., of CITI
Academia Sinica. The original slides can be found at http://openisdm.iis.sinica.edu.tw
Autonomous Mobile Mesh Networks
A more robust alternative
bridge router
Inter-group
router
Intra-group router
Client
From, “Autonomous Mobile Mesh Networks,” by Shen, W.-L., C.-S. Chen, K. C.-J. Lin and K. A. Hua to
appear in IEEE Transactions on Mobile Computing.
On Use of Wireless Mobile Networks
for Temporary Connectivity Repair
Mobile ad-hoc networks
 Minimal resource/effort to put in place
 Unpredictable partitioning/connectivity
Autonomous wireless mesh networks
 Can ensure good connectivity
 Resource limitation: require well-place
routers and gateway
Others, e.g., ITS networks
 Special resources required
 Network partitioning less problematic
What,
How,
How
Message
Delivery
Service
Recovery
OIGY
Review data after disaster
Store sensor data in
persistent cloud
TRIPS
API
TRIPS
API
Receive/report
alerts
TRIPS
API
Heterogeneous And Plug-n-Play Network
TRIPS Broker
TRIPS
Communication
infrastructureo
TRIPS Broker
TRIPS Broker
TRIPS Broker
TRIPS
API
TRIPS
API
TRIPS
API
Publish/subscribe disaster data
A Simpler View of OIGY
TRIPS
Service network
App
Messaging
service
App
Messaging
service
App
Messaging
service
WAN
LAN
WLAN
PAN
HAPPY
A Use Scenario
Broadcast pathways
?xmlns version = “1.0”
<alert xmlns = …
…
<event>Earthquake</event>
<urgency>Immediate</urgency>
<severity>Strong</severity>
<certainty>Observed</certainty>
…
<parameter>
<valueName>Magnitude
</valueName>
<value>8.1</value>
</parameter>
…
<area>
<circle>32.9525 -115.5585
0</circle>
</area>
…
Message processor (alert extraction)
Action activation rule evaluation
Device interface
Authorized
alert sender
PuSH
PuSH
IP
Network
i
G
a
D
iGad
Elevator
controller
PuSH
iGad
Publish/Subscribe Messaging Over
Heterogeneous Networks
Pub/Sub
Overlay Network
IP Network
Related Work





RSS and Atom: web feed formats used to publish frequently
updated data and information
pubsubhubbub: a simple, open, server-to-server web-hookbased publish/subscribe protocol that extends Atom and
RSS protocols for data feeds. Parties speaking the protocol
can get near-instant notifications (via webhook callbacks)
when a topic (resource URL) they subscribe to is updated
AMQP: Advanced Message Queueing Protocol for delivering
time-sensitive messages over large number of clients.
QPID: an open source implementation for AMQP, which
 allows content-based message subscription to receive
interested information only
 assumes a static and reliable network environment
Data Distribution Service: a publish/subscribe protocol
aiming to enable scalable, real-time, dependable, high
performance and interoperable data exchanges
Architecture of Service Network
Client
Client
Qpid
Broker
Qpid
Broker
Qpid
Broker
Qpid
Broker
PuSH
Qpid
Broker
Qpid
Broker
Client
Structures of Qpid Brokers
clients
Hub
PuSH
Data
Update
Monitor
Data
Transfer
Service
Priority
Queues
Data
Retrieval
Service
Incoming
Data
Monitor
dat
a
client
Broker
Data
Bridge
Exchange
Bindings
Exchange
Bindings
Decreasing
Priority
Priority
Queues
Broker
Decreasing
Priority
clients
Assumption and Objectives
 Messaging services are deployed on wide area


IP networks.
The goal is to design a run time mechanism to
recover failed services
The recovery procedure should be as automatic
as possible and take a short time:
 When only parts of the network fail, recover
the services before technicians arrive
 The recovery procedure should require
minimal amount of technician time.
Landmark-Based (Centralized) Service Recovery



The network is divided into disjoint subnets
Each subnet has an unique recovery service called
landmark which monitors the brokers in the subset
Landmark replaces failed broker by a backup broker
Qpid
Broker
Client
Qpid
Broker
Qpid
Broker
Client
Client
Qpid
Broker
Landmark
Evaluation


Up to 40 broker nodes are deployed in the network,
which can serve more than tens of thousands of client
devices and applications
One landmark node to monitor and recover failed
service if there is any
Initialization Overhead
The landmark node has to

Date rate

learn the network topology and service map when
the system starts.
update its configuration data when a node (broker)
failed and a backup is set up to replace it.
Seconds
Monitor and Recovery Overhead
 Network traffic during regular operation increases
as the number of broker nodes increases.
 When there is no concurrent failure, recovery
overhead is not sensitive to the number of brokers
Recovery Time Versus
Number of Concurrent Failures
Distributed Service Recovery




The service network should tolerate a total of N or fewer
concurrent failures of broker and recovery services
When a broker service fails, the recovery services reach
consensus on selecting a backup service. This is done
according to a Paxos algorithm
Participants of Paxo algorithm plays the roles of
 Proposers: They generate proposals to recover; each
proposal includes the address of a proposed backup
service and a score representing its evaluation
 Acceptors: They vote for proposals and are known to
the proposers
Each broker
 Is monitored by N recovery services acting as proposers
 needs 2N-1 recovery services serving as acceptors
Handshakes According to Paxos
Proposer
Acceptor
Promised_Proposal := none
Phase 1
 Send Prepare_Request
 Wait for ACK from N (i.e., more
Phase 1


than half) acceptors
Phase 2
 Send Accept_Request
 Wait for N or more ACK
 Commit


Do service recovery
Send Commitment to acceptors
Receive Prepare_Request
If Prepare_Request is better than
Promised_Proposal,
 Send ACK
 Replace Promised_Proposal
by Received_Proposal
Phase 2


Receive Accept_Request
If Accept_Request is better than
Promised_Proposal, then a
 Send ACK
 Wait for Commitment
Properties of Paxos Algorithm
 Safety: One proposal and only one proposal is
selected by consensus
 Liveness
As long as one proposer and more than half
of the acceptors remain alive, the algorithm
can succeed in selecting a proposal.
 To tolerate N failures (of brokers and recovery
services), at least N proposers and 2N-1
acceptors are required.

See http://en.wikipedia.org/wiki/Paxos_(computer_science) for details
Reconfigurable Paxos





Paxos protocol assumes that during the selection of a
backup, the set of participants is static: The sets of acceptors
known to all proposers are the same.
Reconfigurable Paxos remove this assumption, which may
not valid when participants are likely to fail.
To change the acceptor set, a monitored service
 Sends to acceptors in the current set a acceptor_change
proposal containing acceptors in the new set
 On each acceptor set change, the monitored service
assigns an increasing version number to the new set.
Every Paxos message is attached with the version number of
the latest version of acceptor set known to the sender
If the known versions of the sender and receiver of any
message are not equal, the participant with newer version
informs the other participant.
Performance Evaluation
 Experiment environment
Network simulator: EstiNet
 OS: Fedora 14 (on virtual machine)
 CPU: Xeon E5620 (1 core)
 RAM: 2GB
Objective: To determine the time required to recovery
from concurrent failures
 Using centralized & distributed recovery schemes
 Varying the total number of hosts, number of
failed hosts, and background traffic
Method: collect data by logging timestamps and
traffic on simulated network interface



Control Message Rate/Size
 When a recovery service monitors a broker, it
retrieves QMF messages periodically.
 The data rate is about 2.4 KB/sec per broker
Number of Brokers
Data Rate(KB/sec)
Standard Deviation
10
23.97
1.67
20
42.68
2.97
30
68.92
3.27
40
89.49
6.68
Control Message Rate/Size
 The data transmitted during recovery is related
to number of objects on the broker.

It takes about 10 KB to create a queue and
12KB to create a route
# of
Queues
Data size
(KB)
Std Dev.
# of
Routes
Data size
(KB)
Std Dev.
0
496.38
4.51
0
567.03
2.23
10
595.80
5.16
5
627.87
3.99
20
704.23
5.48
10
690.02
5.30
30
800.05
6.09
15
751.34
6.19
40
907.87
5.80
20
810.38
6.91
Topology
Re-initialization Latency of Distributed Recovery


When one broker fails, one recovery service conducts
recovery procedure
When one recovery service fails, K brokers find new
proposers/acceptors
Number of
brokers = 25
Number of
recovery
services = 25
K=5
K=7
K=9
Comparison between
Centralized and Distributed Recovery


50 hosts
Bandwidth = 10Mbps


24 hosts
Bandwidth = 10Mbps
Effect of Limited Bandwidth
 24 hosts
 Bandwidth = 1Mbps, 80% consumed by
background traffic
Future Plans

The source code of landmark-based recovery will be
submitted to QPID open source community.

Field Trials:

Collaborate with National Science and Technology
Center for Disaster Reduction to deploy the system for
earthquake alerts in Taipei area.

Collaborate with domestic telecommunication service
providers to delivery messages to phone users.

Continue to experiment with pushing messages in CAP
format to iGaDs, smart devices applications for disaster
preparedness purposes
Thank
You!