Transcript CS514-lec

CS514: Intermediate
Course in Operating
Systems
Professor Ken Birman
Ben Atkin: TA
Lecture 20 Nov. 2
Scalable Reliable
Multicast
• Became a hot topic with the
introduction of Berkeley’s SRM
• Idea is to provide a primitive that
– Has best-effort reliability (tries to repair
problems… doesn’t promise)
– Is receiver driven
• Sender doesn’t know who is in the group
• Receiver must solicit repairs
– Has localized costs
SRM goal compared to
virtual synchrony
• Virtual synchrony provides a strong
guarantee
– Either all processes that stay
operational receive the message
– Or none does
• But this seems to limit scalability
• Virtual synchrony can be fast but
apparently is limited to groups of
about 50 processes or fewer
Stock Exchange Problem:
Vsync. multicast is too “fragile”
Most members are
healthy….
… but one is slow
throughput
(msgs/sec)
Effect of Perturbation
250
200
150
100
50
0
Virtual Synchrony Protocol
Ideal
Pbcast Protocol
Actual
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
amount perturbed
Figure 1: Multicast throughput in an 8-member group perturbed by transient failures
The problem gets worse as
the system scales up
Virtually synchronous Ensemble multicast protocols
average throughput on nonperturbed members
250
group size: 32
group size: 64
group size: 96
200
150
100
50
0
0
0.1
0.2
0.3
0.4
0.5
perturb rate
0.6
0.7
0.8
0.9
SRM
• Unlike virtual synchrony the
membership of the group will be
loosely defined
– Do it in the style of IP multicast
– Receivers join at will
• Receiver has a way to request
missing data
Application Level
Framing
• Really, a separate idea
• But was bundled with SRM:
– Application is assumed to be doing video
multicast streaming
– Perhaps it has its own way to recover
lost data
• So instead of retransmitting precisely the lost
data…
• … we solicit retransmission but let the
application decide what to send in response
Comparison
• Virtual synchrony:
– Strong membership
– Each multicast has its own identity
– Strong guarantees of ordering,
synchronization, atomicity
• SRM:
– Weak membership
– Multicast has a sequence number but
retransmission determined separately
– Weak guarantees
Flow control
• Virtual synchrony does flow control
– Sender has some amount of space
– If it runs low, new multicasts are
stopped
• SRM lacks any form of flow control
• Sender just runs at whatever rate it
may select, oblivious to network
• In fact, SRM doesn’t even use
acknowledgements from receiver to
sender!
How SRM works
• Assumes that network has an IP
multicast routing tree within it
• So, IP multicast is reasonably cheap
• SRM is built using broadcasts within
this tree
– One exception: so called “scalable
session messages”
– We’ll return to this later
Basic algorithm
• Application generates and
numbers a packet
• Sends it using IP multicast
• Receiver receives it, puts it in
order, delivers it
… cost is one packet per link in
the IP multicast tree, one send
and one receive per process
Basic algorithm:
application perspective
R
S: Sender
R: Receiver
R
: Network Router
S
R
R
R
R
Basic algorithm:
network perspective
R
R
S: Sender
R: Receiver

: Network Router
S


R





R
R


R




Things to notice
• IP multicast lives “in the network”
• SRM protocol lives in the application
• There may be many more network
nodes involves than application
nodes
• Similar observation applies to any
protocol unless implemented by the
router itself
SRM error recovery
• Suppose a packet is lost…
• This will either happen
– In the O/S near the receiver
• Affects just one process
– Or in the network or the O/S of the
sender
• Affects a subtree
• In SRM, subtree might be very large
Basic algorithm:
network perspective
R
R
S: Sender
R: Receiver

: Network Router
S


R





R
R


R




Receiver notices
• Either it receives the next
packet in the stream
– In this case it notices a gap
• Or it receives a “session
message”
– Again, this causes it to see a gap
– Sent periodically
A wave of discovery
• Receivers have some distance
from the sender
• Can measure it in milliseconds
• Close receivers notice the
problem first.
Basic algorithm:
network perspective
R
R
S: Sender
R: Receiver

: Network Router
S


R





R
R


R




Basic algorithm:
network perspective
R
R
S: Sender
R: Receiver

: Network Router
S

R






R
R


R
 NAK



Basic algorithm:
network perspective
R
R
S: Sender
R: Receiver

: Network Router
S

R






R
R


R
 NAK



NAK spreads like a
wave
• Other receivers see the NAK
– It inhibits them from sending their
retransmission request
• Some upstream sender sees it
too
– It generates a repair
NAK or Repair Flood
• But why won’t all receivers at a
given distance NAK
simultaneously?
• Or all senders retransmit?
– In fact this would be a problem as
protocol was described
SRM borrows an idea
from the Ethernet
• Introduce a few milliseconds of
randomness
• Instead of sending the NAK after
timeout … pick a delay  and send
after a randomly selected delay in
time interval [ … +]
• With any luck at all, first to NAK
inhibits the others
• Similarly for retransmissions
Why will this work?
• Ideally, packet loss is low
– Hence zero or one losses in the
entire tree
– But many impacted receivers
• With this scheme
– Ideally, one NAK inhibits others
– One retransmission, inhibits
others
Scalable Session
Messages
• In a large tree may see losses in
different subtrees
• Want to limit cost of
– NAK
– Repair
• Do this using the regional TTL
setting for IP multicast
• Limits the propagation of the various
messages, except for initial
multicast of the message itself
Experience?
• SRM works well in networks with low loss
rates
• But if packet loss becomes higher
– SRM overheads begin to soar
– May see many solicitations, repairs on each
network link
– Eventually, this overloads the network
• Example: with 100 receivers in a large
network we easily provoked scenarios
where each packet triggers 5 or 10 NAKs
and a similar number of repairs
Why is this bad?
• All processes see these NAK
and repair messages
• Hence costs are global
A quadratic cost!
• Risk of a “multi-loss” rises with
network size, roughly linearly
• And cost of a NAK or repair is
also linear
• Could argue that SRM has an
O(n2) mechanism buried within
• In practice, dies around 50
processes… similar to vsync!
Status?
• In the Internet today, IP multicast is
mostly disabled
• If used at all, limited to individual
companies
– But even then is usually disabled
• Internet mbone set aside for
multicast
• But has a rigid structure, used for a
single Internet “video feed”
Althernatives to SRM
• RMTP was developed at Lucent
– Reliable Multicast Transport Protocol
• Supports 1-many file transfer
• Uses notion of regional repair
servers
– Each region has a repair server that
handles processes in its vicinity
– Hierarchical repair architecture
– Also has flow-control built in
RMTP status
• Experience suggests it works
well
– But placement of repair servers is
very important
– Also, not very helpful for arbitrary
uses; most experience is with file
transfer applications
• An IETF standard as of 1999
PGM
• A Cisco proposal
• Idea is to put a form of reliable
multicast right into the routers
– Routers talk to each other
– Recovery of lost packets occurs
locally
– But mechanism is very simple and
offers no guarantees
PGM
• Cisco promoting it as a
standard
• But seems not to be taking off
• To be useful, needs to work on
all routers in a network
– Cisco has a tunneling scheme to
leapfrog small numbers of noncompliant routers
LGM
• Lightweight group multicast:
another network-level proposal
• Getting a lot of attention now
• But no implementations in the
field
• And there are three or four
other proposals
Forward Error
Correction
• Can be used with any protocol
• Idea is:
– Instead of just sending data:
• m0.. m1… m2…
– Send an error correction code from time
to time:
• m0.. m1… m2… ECC012… m3…
• Receiver can use code to correct for
packet loss
FEC continued
• Example?
– Suppose that ECC simply is XOR of prior
k packets
– Than with any k-1 of them and ECC we
can reconstruct the missing packet
– Hence would need to lose 2 to have an
unrecoverable gap
• Can use more powerful codes but
they tend to be slower to compute
FEC continued
• With genuinely random packet loss,
FEC works really well
• No need to send a NAK or a repair
• Good match to Internet router
behavior
– RED drops `em
– FEC fixes em!
• One idea: FEC plus congestion
control equals SRM!
Issues?
• Dependency on a single route or
single IP multicast forwarding
tree
– Routers rarely fail…
– But if router becomes heavily
loaded, may drop many packets
– Similarly if routes change – c.f.
Jahanian study
Wishful thinking?
• Recall our idea of a network routing
scheme that would offer failureindependent redundant routes
– (A,B) and (A’,B’) “independent”
• If we had such a scheme for IP
multicast…
• … we could retransmit over it and
would have a good chance of
recovery even if network becomes
very overloaded!
IBM Gryphon Routers
• Gryphon: high speed publishsubscribe project at IBM
– Idea was to build hardware
– Started with software on UNIX boxes…
highly optimized
• Special purpose “network” just for
publish-subscribe applications
– Targets stock markets, financial systems
– Likely to have a substantial market
– But expensive: dedicated infrastructure
What will future bring?
• Topic of much debate
• Two broad schools of thought
– Mine (pessimistic)
– Theirs (optimistic)
Mine?
• Perhaps I’m too negative
• But I suspect:
–
–
–
–
The Internet is a mess
QoS is not coming any time soon
Multicast isn’t about to start working
Situation broken and we need a whole
new “web” design
• But perhaps can reuse many pieces of
existing architecture and technology
• Much early evidence suggests this can work
Theirs
• The prevailing academic research
perspective
• They see
– Diffsrv, multilevel queuing…
– Various multicast mechanisms
• And they expect
– Internet telephony
– Publish-subscribe hardware
– Streaming media over IP
• Vast riches for all!