EECE 411: Design of Distributed Software Applications (or

Download Report

Transcript EECE 411: Design of Distributed Software Applications (or

Peer to Peer Networking and Application
Hongfei Yan
http://net.pku.edu.cn/~course/cs501/2009/
Refer to Matei Ripeanu‘s slides
Today’s Objectives
Understand real-world applications in terms of:
Motivation and objectives
Requirements: compute/storage/network resources
Architecture (“distributed systems” part)
Examples: Recent p2p applications
Start thinking of computer networks from the
perspective of a networked-application
More intuitive
P2P Definition(s)
Def 1: “A class of applications that takes
advantage of resources — storage, cycles, content,
human presence — available at the edges of the
Internet.”
Edges often turned on/off, without permanent IP addresses
Def 2: “A class of decentralized, self-organizing
distributed systems, in which all or most
communication is symmetric.”
Lots of other definitions that fit in between
Lots of (P2P?) systems that fit nowhere …
P2P Impact: Widespread adoption
Skype: >300M users
KaZaA – 389 millions downloads (1M/week)
one of the most popular applications ever!
Number of users for file-sharing applications
(estimate www.slyck.com, Sept ‘06)
eDonkey
3,108,909
FastTrack (Kazaa)
2,114,120
Gnutella
2,899,788
Cvernet
691,750
Filetopia
3,405
P2P Impact (2): Huge resource users
P2P generated traffic now dominates the
Internet load
Internet2 traffic statistics
Cornell.edu (March ’02): 60% P2P
UChicago estimate (March
‘01): Gnutella
100%
90%
control traffic about 1%
of all Internet
80%
traffic.
70%
60%
Other
50%
Data transfers
40%
Unidentified
30%
File sharing
20%
10%
0%
Feb.'02
Aug.'02
Feb.'03
P2P Impact (3) – Demonstrate that underused
resources can be efficiently harnessed
Resources: CPU, storage space,
But also: network bandwidth, availability, user
attention and expertise
Seti@Home statistics
Total
Users
Last 24 Hours
4,236,090
2,365
764M
1.13M
Total CPU time
1.3 M years
1.3 K years
Floating Point
Operations
2.564760e+21
4.441277e+18
(51.40 TeraFLOPs/sec)
Results received
P2P Impact (4) – Social / Business
Data copying and distribution at (almost) zero
almost cost
Might force a companies to change their
business models
• Digital content production and distribution
• Telecomunications companies
But what business models?
P2P is great for ‘volunteer’ computing, yet its
applicability is unclear for infrastructures
Roadmap
Definitions
Impact
Applications
Mechanisms
An In-depth Case Study
Applications: Number crunching
Examples: Seti@Home, Entropia, UnitedDevices, etc
Characteristics (for Seti@Home):
Massive parallelism
Low bandwidth/computation ratio
Fixed-rate data processing task
Error tolerance
Users do donate *real* resources
Problems
$1.5M / year
extra consumed power
Centralized. Does it scale?
How to prevent cheating?
Approach suitable for a particular class of problems.
• How to extend the model to problems that are not
massively parallel
Applications: File sharing & content
distribution
The ‘killer application’ to date
Too many to list them all:
BitTorrent, Napster, FastTrack (KaZaA, KazaaLite,
iMesh), Gnutella (LimeWire, Morpheus, BearShare)
Two independent problems
1. Distributed index
2. Fast content download
Environment: unreliable, non-cooperative
Applications: Performance evaluation
Performance evaluation & monitoring requires
multiple measurement points
Connectivity statistics
Routing errors
Evaluate Web-site performance form end-user perspective
Poor online performance costs businesses
$25 billion per year (Zone Research)
28% of attempted online purchases fail (BCG)
• Slow page performance is the primary reason for
transaction abandonment
• Business transactions are at particular risk
Eight seconds was kind of the threshold, but that’s a
ridiculous notion now.
• The real expectations are around 4 seconds
Measurements: The Performance “Blind Spot”
Back-end
Infrastructure
Network
Landscape
Last-mile
“Blind Spot”
Datacenter Testing
“Beacon”
Database
Web server
Backbone
ISP
Enterprise Provider
T1
Firewall
Corporate
Network
ISP
App server
Backbone
3rd party
content
Regional
Network
Corporate
User
Major
Provider
Local ISP
Component
Testing
•BMC
•Mercury Interactive
•Tivoli
•ProactiveNet
•HP OpenView
•Computer Associates
Datacenter
Monitoring
•Keynote Systems
•Mercury Interactive
•BMC/SiteAngel
•Service Metrics
Consumer
User
Critical to estimate end-to-end
performance
Slide source: www.porivo.com
Measurements: End-to-end Performance
Back-end
Infrastructure
Database

Network
Landscape
Web server

App server

Backbone
ISP
Enterprise Provider
T1
Firewall


Corporate
Network
ISP
Backbone
3rd party
content
Regional
Network
Corporate
User
Major
Provider
Local ISP
Component
Testing
Datacenter
Monitoring
Consumer
User
End-to-end
Web Performance
Testing
Slide source: www.porivo.com
More applications …
Backup storage (HiveNet, OceanStore)
Instant messaging (Yahoo, AOL)
Collaborative environments
Spam filtering
Anonymous email
Censorship-resistant publishing systems
(Ethernity, Freenet)
Roadmap
Definitions
Impact
Uses and Examples
Mechanisms
An In-depth Case Study
Mechanisms (I)
To obtain a resilient system:
integrate multiple components with uncorrelated
failure curves.
use replication for data and services to move the
service ‘closer’ to the user
Example: Content Delivery Networks
Origin
Server
?

?
httpprx
Fetch data
from nearby
httpprx
dnssrv
DNS Redirection
Return proxy,
preferably one
near client
Cooperative
Web Caching
Resolver
akamai.cnn.com
Browser
216.165.108.10
Mechanisms (II)
To improve the quality of the service delivered and
reduce costs:
integrate multiple providers with uncorrelated
demand curves
(lower over-provisioning at each of them)
move service delivery closer to the user
Example: Server consolidation
Averages utilization
40%
Mainframes are idle
90%
Unix servers are idle
95%
PC servers are idle
But peak utilizations are …
0-15% Mainframes are idle in peak-hour
70%
PC servers are idle in peak-hour
Source: “Grid Computing” Dr Daron G Green
Throughput (requests/s)
Load Is Dynamic
M
T
W
Th
F
S
S
ibm.com external site
• February 2001
• Daily fluctuations (3x)
• Workday cycle
• Weekends off
0
Throughput (requests/s)
0
Time (one week)
Week 6
0
0
Time (two months)
7
8
World Cup soccer site
• May-June 1998
• Seasonal fluctuations
• Event surges (11x)
For Example:
Energy-Conscious Provisioning
boot
136w
Light load: concentrate traffic
on a minimal set of servers
CPU max
120w
Step down surplus servers
watts
to low-power state
Activate surplus servers on
demand
Even smarter: also manage
air conditioning
CPU idle
93w
off/hib
2-3w
Idling consumes
60% to 70% of
peak power
demand.
disk spin
6-10w
work
Mechanisms (III)
To provide anonymity:
use large number of independent components
(“hide in the crowd”) and make search
impossible (or at least costly)
Example: onion routing
EECE 411: Design of Distributed Software Applications
Mechanisms (IV)
To detect anomalies, to generate good statistics:
Use multiple views
Example: Web-server performance
characterization
Roadmap
Definitions
Impact
Uses and Examples
Mechanisms
An In-depth Case Study
File sharing – The Gnutella Network
Basic Primitives for File Sharing
Join: How do I begin participating?
Publish: How do I advertise my file(s)?
Search: How do I find a file?
Fetch: How do I retrieve a file?
Lots of different solutions for each of these four
primitives. We’ll look at Gnutella network.
What makes Gnutella network interesting?
Large scale
today up to 2M nodes, 1000TB data, 100M files today
Self-organizing network
Fast growth in its early stages
more than 50 times during first half of 2001
(50 times again 2001 to 2006)
Open architecture, simple and flexible protocol
Interesting mix of social and technical issues
EECE 411: Design of Distributed Software Applications
Gnutella search mechanism
Boston
Chicago
MIT
UBC
Beatles: Yellow Submarine
Q:Beatles
Calgary
Gnutella nodes
TCP overlay tunnels
Routers
Search steps:
1. Initiates search for “Yellow Submarine”
2. Sends message to all neighbors
3. Neighbors forward message
4. Initiate reply message
5. Reply message is back-propagated
6. File download
EECE 411: Design of Distributed Software Applications
Gnutella: Overview
Join: on startup, client contacts a few other
nodes; these become its “neighbors”
Publish: no need
Search:
Flooding: ask neighbors, who is their
neighbors, and so on... when/if found, reply
to sender.
Back-propagation in case of success
Fetch: get the file directly from peer (HTTP)
[Note: this was the original design. Later the network
moved to a two-layer structure]
EECE 411: Design of Distributed Software Applications
Gnutella: Summary
Gnutella: self-organizing, large-scale, P2P
application based on a hierarchical overlay
network. It works!
Growth hindered by inefficient resource use.
EECE 411: Design of Distributed Software Applications
P2P: Summary
Huge impact
Many architecture styles -- pros and cons for
each
centralized, flooding, swarming,
Lessons learned:
Single points of failure are bad
Flooding messages to everyone does not scale
Underlying network topology is important
Not all nodes are equal
Need incentives to discourage freeloading
Privacy and security are important
…
References
Eng Keong Lua, Jon Crowcroft, Marcelo Pias,
Ravi Sharma and Steven Lim, "A survey and
comparison of peer-to-peer overlay network
schemes", IEEE Communications Surveys &
Tutorials, (7)2: 22-73, Apr., 2005
An excellent survey of modern peer-to-peer
systems, covering structured as well as
unstructured networks.
This paper forms a good introduction for
those wanting to get deeper into the
subject but do not really know where to
start.