SSTiC lecture 5 - Disaster resilient networks and
Download
Report
Transcript SSTiC lecture 5 - Disaster resilient networks and
Resilient
Networks &Transport Services
Jane W. S. Liu
Institute of Information Science
Academia Sinica, Taiwan
http://openisdm.iis.sinica.edu.tw
International Summer School on Trends in Computing, Tarragona Spain, July 2013
Disaster Resilient
Networks &Transport Services
Jane W. S. Liu
Institute of Information Science
Academia Sinica, Taiwan
http://openisdm.iis.sinica.edu.tw
International Summer School on Trends in Computing, Tarragona Spain, July 2013
Topic Outline
Overview on ICT for disaster management
Introduction: definitions, scenarios and scope
State of the art in disaster detection and prediction
State-of-the-art information and communication
infrastructures and remaining technology gaps
Selected topics on critical real-time computing and
information systems (CRICIS) for disasters, including
• Exploiting linked data and semantic web technologies
• Information access control and privacy protection
• Ubiquitous smart devices and applications for disaster
preparedness and early response
Crowdsourcing human sensor data by disaster
surveillance and early warning systems
Disaster resilient networks and transport services
Outline
Resilience versus disaster resilience
Requirements for disaster resiliency
Examples of recent efforts
OIGY: Open Information Gateway
Connectivity assessment
Physical layer defense and recovery
Transport layer recovery
“Resilience is the ability to provide and
maintain an acceptable level of service
in the face of faults and challenges to
normal operations” – from wikipedia
For disaster resilience, what are
Challenges to normal operations, &
Acceptable levels of service
2011 Japan
Earthquake,
Tsunami, and
Nuclear Power
Plant Accident
2010 Haiti and 2011 Chilean Earthquakes
2009 Moroko
Typhoon,
Heavy Rain,
and Landslide
in Taiwan
Killer
Tornados
in USA,
2012 and
2013
Types of Information
For EoC’s (Emergency operation Centers): real-time
data and information to support command & control
decisions and operations and resource allocations
Within each affected area: information on
demands for and availability of services/resources
conditions of affected people
restoration status of transportation, power,
communication, hospitals in the area
Across multiple affected areas: asynchronous and
synchronous interactions among family members,
friends, communities and so on.
Outside of affected areas: information on affected
infrastructures, causalities, and resource needs.
Timing criticality
Partial List of Requirements
Can meet relative deadlines of alert messages:
Earthquakes: A fraction of a second to a few seconds
Tsunami, tornados, land slides, etc.: a few minutes
Typhoons and seasonal floods: hours to days
Cries for help: minutes to tens of minutes
Can accommodate diverse information sources, e.g.,
Authorized senders of information and alert messages
News media, social networking services, panic crowds, etc.
Use open and sustainable solutions: can
Support diverse end-user devices and equipment
Exploit redundancy, diversity and heterogeneity
Adapt with needs, technological advances, lessons learned
Can defend, detect, remediate and recover
Resilience Disciplines
https://wiki.ittc.ku.edu/resilinets/Related_Work
Trustworthiness
Dependability:
Maintainability
Availability
Performability:
QoS measures
Challenge tolerance
Survivability: many to numerous
failures, to global failure
Disruption tolerance
Environmental: connectivity,
mobility, and delay
Resilience Disciplines
Energy
Disaster
J. P. G.
Sterbenzm et al.,flash
“Resilience
and survivability in communication networks: strategies,
From
Traffic
tolerance:
crowd
Resilience
principles and survey of disciplines,” Computer Networks,
54, 2010
Topics on
Multi-failure
survivability
Random
failures
Correlated
(disaster) failures
Modeling
disaster failures
Combating
disaster failures
Connectivity
assessment
Transport service
recovery
Full
restoration
Dynamic
data delivery
Disaster Resilient
Networks and
Transport Services
Physical connectivity
recovery
Backup
connectivity
Connectivity
repairs
Degraded
service
provision
Common Sense
Strategies,
Success Stories
and
Future Visions
Common sense
Strategies:
Exploiting redundancy, diversity,
and heterogeneity
All shouldof
be
Examples
Coverage/
parts of
means
to solutions
provide
Infrastructure
for
disaster
physical
Satellite
phones
resilience
connectivity
Bandwidth
Energy
Portability
Weather
Bandwidth
Energy
Portability
Coverage/
Infrastructure
Weather
Mobile base stations
Bandwidth
Energy
Coverage/
Infrastructure
Portability
Weather
Wi-Fi
OIGY (Open Information Gateway)
Connectivity Repair
Mobile cellular base
stations provide
coverage for large
areas temporarily.
Last-mile mesh
networks & gateway
routers extends the
coverage area and
relays data through
mobile cellular base
stations to and from
the Internet
Internet
Mobile
base
stations
Gateway
router
Last-mile mesh
Restoring
last-mile
connections
with
wireless
meshes
Wi-Fi networks
From “Towards Throughput Optimization of Wireless Mesh Networks in Disaster Areas,” by W.
Liu, et al. Proceedings of WPMC, 2012
Using vehicular network
for temporary
connectivity restoration
Onboard
Wireless
gateway
Multi-hop vehicular
network
Connection to WLAN
From “Multi-hop Communication and Multi-protocol Gateway by Using Plural ITS,” by T. Ohyama,
et al., Proc. of WPMC, 2012
Using WLAN to Repair Optical BS Backhaul
From “Rapid Recovery of Base Station Backhaul … “ by Y. Kishi, et al., Proceedings of WPMC, 2012
Multi-Layering for Resilient Communication
From “Secured Information Service Platforms Effective in Case of Disasters,”
by F. Adachi, et al., Proceedings of WPMC, 2012
Multi-layered network: cellular networks,
Internet, mobile LANs and satellite network
From “R&D Project of Multilayered Communications Network -For disaster-resilient communications,”
By F. Adachi, et al. Proc. of WPMC, 2012
Exploiting Cooperation Among Access Points
From “Multi-AP Cooperative Diversity for Disaster-resilient Wireless LAN,” by
F. Adachi and S. Kumagai, in Proc. of WPMC, 2012
TRIPS:
Trustful Real-time Information Pub/sub
Services
OIGY
Open Information Gateway
HAPPY:
Heterogeneous And Plug-nPlaY networks
Delayed or
failed delivery
Connectivity
Assessment
Access
points
required
HAPPY
Real-Time
Pub/Sub
Services
Backup nodes
available
Network
Recovery
TRIPS
Service
recovered
Service
Recovery
Service
brokers
required
Connectivity Assessment Service
(Use Scenario and Operations)
Disaster Response
Authorities (DRA)
RequestAssessment
(areas_list, …)
1
3
Disconnected
area list
2
Active Network
Probe (ANP)
Possibly
disconnected areas
Reactive Footprint
Search (RFS)
L. J. Chen, et at., “A rapid method for detecting geographically disconnected areas after disasters,”
Proc. of IEEE International Conf. on Technologies for Homeland Security, Nov. 2011
Active Network Probe
ANP pre-installed in preparedness phase
Two approaches and their shortcomings
IP Geolocation Service
(IP2Location and Quova)
Not sufficiently accurate
One way mapping
Too costly
Network Topology Discovery
For core network only
ICMP based
Cannot work with dynamic
and distributed DNS
Reactive (Internet) Footprint Search
People in high-risk and affected areas are likely to
use all possible means to send messages.
RFS leverages their footprints on location-based
social networks, e.g., Facebook Places, Google
Latitude, Foursquare, Gowalla, Twitter, flickr, ...
Connectivity Assessment Service
ANP proactively queries landmarks, which are computers
and devices at known locations maintained at all times
RFS looks for footprints in potentially disconnected areas
reported by ANP.
ANP
RFS
List of polygons
around
disconnected
landmarks
Request for connectivity report
Polygons of disconnected areas
Disaster Response Authorities
Performance of Landmark-Based ANP
Accuracy of landmark-based ANP is compared with that of
two off-the-shelf IP geolocation services
Network landmarks are K-12 schools in greater Kaohsiung
Quova
IP2Location
Proof-of-concept RFS
http://nrl.iis.sinica.edu.tw/RFS/
Where
How,
How
Physical
Connectivity
Recovery
OIGY (Open Information Gateway)
Connectivity Repair
Mobile cellular base
stations provide
coverage for large
areas temporarily.
Last-mile mesh
networks & gateway
routers extends the
coverage area and
relays data through
mobile cellular base
stations to and from
the Internet
Internet
Mobile
base
stations
Gateway
router
Last-mile mesh
A Means to Repair Cellular Coverage
By adjusting power of existing base stations
B3 failed
From slides presented by Po-Kai Tseng and Wei-Ho Chung of CITI,
Academia Sinica on 2013/7/5
interference
and demand
Search available BSs
Recovery
procedure
when B1
failed
Miss comm. users
Channel
and
power
allocation
From slides presented by Po-Kai Tseng and WeiHo Chung of CITI, Academia Sinica on 2013/7/5
Served by the BS with min distance
Recovery from Failure of B1 without BSC
Response received signal strength
interference
and demand
Search available BSs
Channel
and
power
allocation
Response a random number
Determine BS by the best signal or largest number
Mobile Ad-Hoc Network
s
Requires no infrastructure
Can provide coverage
during disasters
d
Network partitioning makes
it a less than ideal solution
S
d
Based on slides for a presentation on “Autonomous mobile mesh network,” by W. L. Shen, et al., of CITI
Academia Sinica. The original slides can be found at http://openisdm.iis.sinica.edu.tw
Autonomous Mobile Mesh Networks
A more robust alternative
bridge router
Inter-group
router
Intra-group router
Client
From, “Autonomous Mobile Mesh Networks,” by Shen, W.-L., C.-S. Chen, K. C.-J. Lin and K. A. Hua to
appear in IEEE Transactions on Mobile Computing.
On Use of Wireless Mobile Networks
for Temporary Connectivity Repair
Mobile ad-hoc networks
Minimal resource/effort to put in place
Unpredictable partitioning/connectivity
Autonomous wireless mesh networks
Can ensure good connectivity
Resource limitation: require well-place
routers and gateway
Others, e.g., ITS networks
Special resources required
Network partitioning less problematic
What,
How,
How
Message
Delivery
Service
Recovery
OIGY
Review data after disaster
Store sensor data in
persistent cloud
TRIPS
API
TRIPS
API
Receive/report
alerts
TRIPS
API
Heterogeneous And Plug-n-Play Network
TRIPS Broker
TRIPS
Communication
infrastructureo
TRIPS Broker
TRIPS Broker
TRIPS Broker
TRIPS
API
TRIPS
API
TRIPS
API
Publish/subscribe disaster data
A Simpler View of OIGY
TRIPS
Service network
App
Messaging
service
App
Messaging
service
App
Messaging
service
WAN
LAN
WLAN
PAN
HAPPY
A Use Scenario
Broadcast pathways
?xmlns version = “1.0”
<alert xmlns = …
…
<event>Earthquake</event>
<urgency>Immediate</urgency>
<severity>Strong</severity>
<certainty>Observed</certainty>
…
<parameter>
<valueName>Magnitude
</valueName>
<value>8.1</value>
</parameter>
…
<area>
<circle>32.9525 -115.5585
0</circle>
</area>
…
Message processor (alert extraction)
Action activation rule evaluation
Device interface
Authorized
alert sender
PuSH
PuSH
IP
Network
i
G
a
D
iGad
Elevator
controller
PuSH
iGad
Publish/Subscribe Messaging Over
Heterogeneous Networks
Pub/Sub
Overlay Network
IP Network
Related Work
RSS and Atom: web feed formats used to publish frequently
updated data and information
pubsubhubbub: a simple, open, server-to-server web-hookbased publish/subscribe protocol that extends Atom and
RSS protocols for data feeds. Parties speaking the protocol
can get near-instant notifications (via webhook callbacks)
when a topic (resource URL) they subscribe to is updated
AMQP: Advanced Message Queueing Protocol for delivering
time-sensitive messages over large number of clients.
QPID: an open source implementation for AMQP, which
allows content-based message subscription to receive
interested information only
assumes a static and reliable network environment
Data Distribution Service: a publish/subscribe protocol
aiming to enable scalable, real-time, dependable, high
performance and interoperable data exchanges
Architecture of Service Network
Client
Client
Qpid
Broker
Qpid
Broker
Qpid
Broker
Qpid
Broker
PuSH
Qpid
Broker
Qpid
Broker
Client
Structures of Qpid Brokers
clients
Hub
PuSH
Data
Update
Monitor
Data
Transfer
Service
Priority
Queues
Data
Retrieval
Service
Incoming
Data
Monitor
dat
a
client
Broker
Data
Bridge
Exchange
Bindings
Exchange
Bindings
Decreasing
Priority
Priority
Queues
Broker
Decreasing
Priority
clients
Assumption and Objectives
Messaging services are deployed on wide area
IP networks.
The goal is to design a run time mechanism to
recover failed services
The recovery procedure should be as automatic
as possible and take a short time:
When only parts of the network fail, recover
the services before technicians arrive
The recovery procedure should require
minimal amount of technician time.
Landmark-Based (Centralized) Service Recovery
The network is divided into disjoint subnets
Each subnet has an unique recovery service called
landmark which monitors the brokers in the subset
Landmark replaces failed broker by a backup broker
Qpid
Broker
Client
Qpid
Broker
Qpid
Broker
Client
Client
Qpid
Broker
Landmark
Evaluation
Up to 40 broker nodes are deployed in the network,
which can serve more than tens of thousands of client
devices and applications
One landmark node to monitor and recover failed
service if there is any
Initialization Overhead
The landmark node has to
Date rate
learn the network topology and service map when
the system starts.
update its configuration data when a node (broker)
failed and a backup is set up to replace it.
Seconds
Monitor and Recovery Overhead
Network traffic during regular operation increases
as the number of broker nodes increases.
When there is no concurrent failure, recovery
overhead is not sensitive to the number of brokers
Recovery Time Versus
Number of Concurrent Failures
Distributed Service Recovery
The service network should tolerate a total of N or fewer
concurrent failures of broker and recovery services
When a broker service fails, the recovery services reach
consensus on selecting a backup service. This is done
according to a Paxos algorithm
Participants of Paxo algorithm plays the roles of
Proposers: They generate proposals to recover; each
proposal includes the address of a proposed backup
service and a score representing its evaluation
Acceptors: They vote for proposals and are known to
the proposers
Each broker
Is monitored by N recovery services acting as proposers
needs 2N-1 recovery services serving as acceptors
Handshakes According to Paxos
Proposer
Acceptor
Promised_Proposal := none
Phase 1
Send Prepare_Request
Wait for ACK from N (i.e., more
Phase 1
than half) acceptors
Phase 2
Send Accept_Request
Wait for N or more ACK
Commit
Do service recovery
Send Commitment to acceptors
Receive Prepare_Request
If Prepare_Request is better than
Promised_Proposal,
Send ACK
Replace Promised_Proposal
by Received_Proposal
Phase 2
Receive Accept_Request
If Accept_Request is better than
Promised_Proposal, then a
Send ACK
Wait for Commitment
Properties of Paxos Algorithm
Safety: One proposal and only one proposal is
selected by consensus
Liveness
As long as one proposer and more than half
of the acceptors remain alive, the algorithm
can succeed in selecting a proposal.
To tolerate N failures (of brokers and recovery
services), at least N proposers and 2N-1
acceptors are required.
See http://en.wikipedia.org/wiki/Paxos_(computer_science) for details
Reconfigurable Paxos
Paxos protocol assumes that during the selection of a
backup, the set of participants is static: The sets of acceptors
known to all proposers are the same.
Reconfigurable Paxos remove this assumption, which may
not valid when participants are likely to fail.
To change the acceptor set, a monitored service
Sends to acceptors in the current set a acceptor_change
proposal containing acceptors in the new set
On each acceptor set change, the monitored service
assigns an increasing version number to the new set.
Every Paxos message is attached with the version number of
the latest version of acceptor set known to the sender
If the known versions of the sender and receiver of any
message are not equal, the participant with newer version
informs the other participant.
Performance Evaluation
Experiment environment
Network simulator: EstiNet
OS: Fedora 14 (on virtual machine)
CPU: Xeon E5620 (1 core)
RAM: 2GB
Objective: To determine the time required to recovery
from concurrent failures
Using centralized & distributed recovery schemes
Varying the total number of hosts, number of
failed hosts, and background traffic
Method: collect data by logging timestamps and
traffic on simulated network interface
Control Message Rate/Size
When a recovery service monitors a broker, it
retrieves QMF messages periodically.
The data rate is about 2.4 KB/sec per broker
Number of Brokers
Data Rate(KB/sec)
Standard Deviation
10
23.97
1.67
20
42.68
2.97
30
68.92
3.27
40
89.49
6.68
Control Message Rate/Size
The data transmitted during recovery is related
to number of objects on the broker.
It takes about 10 KB to create a queue and
12KB to create a route
# of
Queues
Data size
(KB)
Std Dev.
# of
Routes
Data size
(KB)
Std Dev.
0
496.38
4.51
0
567.03
2.23
10
595.80
5.16
5
627.87
3.99
20
704.23
5.48
10
690.02
5.30
30
800.05
6.09
15
751.34
6.19
40
907.87
5.80
20
810.38
6.91
Topology
Re-initialization Latency of Distributed Recovery
When one broker fails, one recovery service conducts
recovery procedure
When one recovery service fails, K brokers find new
proposers/acceptors
Number of
brokers = 25
Number of
recovery
services = 25
K=5
K=7
K=9
Comparison between
Centralized and Distributed Recovery
50 hosts
Bandwidth = 10Mbps
24 hosts
Bandwidth = 10Mbps
Effect of Limited Bandwidth
24 hosts
Bandwidth = 1Mbps, 80% consumed by
background traffic
Future Plans
The source code of landmark-based recovery will be
submitted to QPID open source community.
Field Trials:
Collaborate with National Science and Technology
Center for Disaster Reduction to deploy the system for
earthquake alerts in Taipei area.
Collaborate with domestic telecommunication service
providers to delivery messages to phone users.
Continue to experiment with pushing messages in CAP
format to iGaDs, smart devices applications for disaster
preparedness purposes
Thank
You!