Transcript ppt

CS 34701: Large-Scale
Networked Systems
Professor: Ian Foster
TA: Adriana Iamnitchi
http://dsl.cs.uchicago.edu/Courses/cs347-2002/
CS 34701 Course Goals

Primary
– Gain deep understanding of fundamental
issues that effect design of large-scale
networked systems
– Map primary contemporary research themes
– Gain experience in network research

Secondary
– By studying a set of outstanding papers,
build knowledge of how to present research
– Learn how to read papers & evaluate ideas
How the Class Works

Research papers
– Prior to each class, we all read and evaluate
two research papers
– During each class, we discuss those papers

Project
– One-page project description by 2nd week
– Five-page project summary by 5th week
– 10-20 final paper by 9th week
– Project presentations: 9th and 10th weeks.
Paper Review & Discussion


Everyone reads two papers per class and
submits an evaluation (see below)
We discuss (not present) papers in class
– A team of 2-3 leads each discussion
– The leading team submits discussion plan
before class, submits “master critique” and
summarizes discussion at the beginning of
following class

Look over schedule between now & Friday,
when we will allocate discussants
Evaluations

You must submit evaluations of papers
– Email them by 6pm the day before

Answer a set of standard questions
1. State the main contribution of the paper
2. Critique the main contribution
3. What are the three strongest and/or most
interesting ideas in the paper?
4. Three most striking weaknesses in the paper?
5. Three questions to ask the authors?
6. Detail an interesting extension to the work
not mentioned in the future work section.
7. Optional comments on the paper that you’d
like to see discussed in class.
What I’ll Assume You Know

Basic Internet architecture
– IP, TCP, DNS, HTTP

Basic principles of distributed computing
– Asynchrony (cannot distinguish between
communication failures and latency)
– Partial global state knowledge (cannot know
everything correctly)
– Failures happen. In very large systems,
even rare failures happen often

If there are things that don’t make sense,
ask!
Large-Scale Networked Systems


Internet-connected networks with a large
number of components, spanning multiple
DNS domains (usually WAN)
Designed to solve specific problems:
– Content distribution
– Cycle sharing
– File sharing
– Sensor data fusion
– Distributed data analysis
–…
Example: Gnutella

Peer-to-peer file sharing system
– File sharing: goal is to enable publication
and access to files
– P2P: no central servers; all clients also act
as servers and are equivalent (more or less)

Issues
– Scaling to very large numbers of nodes
– Properties: bootstrapping, reliability, cost,
anonymity, security, freeloading, …
Gnutella Protocol Overview

P2P file sharing application on top of an
overlay network:
– Nodes maintain open TCP connections.
– Messages are broadcasted (flooded) or
back-propagated.

Protocol:
Membership
Query
File download
Broadcast
(Flooding)
Backpropagated
PING
PONG
QUERY
QUERY HIT
Node to node
GET, PUSH
Gnutella search mechanism
Steps:
1. Node 2 initiates search for file A
7
1
A
4
2
6
3
5
Gnutella search mechanism
A
Steps:
1. Node 2 initiates search for file A
2. Sends message to all neighbors
7
1
4
2
3
A
6
A
5
Gnutella search mechanism
A
A
Steps:
1. Node 2 initiates search for file A
2. Sends message to all neighbors
3. Neighbors forward message
7
1
4
2
6
3
A
5
A
Gnutella search mechanism
A:7
A
7
1
4
2
6
3
A:5
A
5
A
Steps:
1. Node 2 initiates search for A
2. Sends message to all neighbors
3. Neighbors forward message
4. Nodes that have file A initiate a
reply message
Gnutella search mechanism
7
1
4
2
3
A:7
A:5
A 6
A
5
Steps:
1. Node 2 initiates search for A
2. Sends message to all neighbors
3. Neighbors forward message
4. Nodes that have file A initiate
a reply message
5. Query reply message is backpropagated
Gnutella search mechanism
7
1
A:7
2
4
A:5
6
3
5
Steps:
1. Node 2 initiates search for A
2. Sends message to all neighbors
3. Neighbors forward message
4. Nodes that have file A initiate
a reply message
5. Query reply message is backpropagated
6. Node 2 gets replies
Gnutella search mechanism
download A
1
7
4
2
6
3
5
Steps:
1. Node 2 initiates search for A
2. Sends message to all neighbors
3. Neighbors forward message
4. Nodes that have file A initiate
a reply message
5. Query reply message is backpropagated
6. Node 2 gets replies
7. File download
Tools for network exploration


Eavesdropper - modified node inserted into
the network to log traffic.
Crawler - connects to all active nodes and uses
the membership protocol to discover graph
topology.


Parallel crawling.
Graph analysis tools

high-volume offline
computations.
Network growth
Number of nodes in the largest
network component ('000)
50
Gnutella Network Growth
.
40
 High user interest:

30
Users tolerate high latency,
low quality results.
 Better resources:
20
10

05/12/01
05/16/01
05/22/01
05/24/01
05/29/01
02/27/01
03/01/01
03/05/01
03/09/01
03/13/01
03/16/01
03/19/01
03/22/01
03/24/01
11/20/00
11/21/00
11/25/00
11/28/00
-
DSL and cable modem
nodes grew from 24% to
41% over 6 months.
 Open architecture / open-source environment:

Competing implementations,
 Lower overhead network traffic, improved
resource utilization, better structure,
 Recently, two-level structure.
Growth invariants
Percent of node pairs (%)
1. Graph connectivity: 3.4 links per node on average.
2. Path length distribution: node-to-node distance
maintains similar distributions.
50%
40%
30%
20%
10%
0%
1
2
3
4
5 6 7 8 9 10 11 12
Node-to-node shortest path (hops)
 Avg. node-to-node
distance grew 25%
while the network
grew 50 times over 6
months.
 Random graph theory
predicts about 75%
increase.
Is Gnutella a power-law network?
Power-law networks: the number of nodes N with
exactly L links is proportional to L-k
N ~ L-k
Num. of nodes (log scale)
10000
Examples:
 The Internet,
 In/out links to/from
HTML pages,
 Citations network,
 US power grid,
 Social networks.
November 2000
1000
100
10
1
1
10
100
Number of links (log scale)
Implication: High tolerance to random node failure but low
reliability when facing an ‘intelligent’ adversary
Is Gnutella a power-law network?
 Later, larger networks display a bimodal distribution.
 Implications:

High tolerance to random node failures preserved
Increased reliability
10000
when facing an
attack.
1000
Number of nodes
(log scale)

May 2001
100
10
1
1
10
100
Number of links (log scale)
Traffic analysis
Message Frequency
25
.
20
Ping
Push
Query
Other
15
10
5
353
321
289
257
225
193
161
129
97
65
33
-
1
messages per secod
  6-8 kbps per link over any connection.
 Traffic structure changed over time.
minute
Total generated traffic
1Gbps (or 330TB/month)!
– Note that this estimate excludes actual file transfers
– Q: Does it matter?
– Compare to 15,000TB/month estimated in US Internet
backbone (Dec. 2000).
Reasoning:





and PING messages are flooded. They form more than 90% of
generated traffic
predominant TTL=7
>95% of nodes are less than 7 hops away
measured traffic at each link about 6 to 8kbs
network with 50k nodes and 170k links
QUERY
Topology mismatch
The overlay network topology doesn’t match
the underlying Internet infrastructure
topology!




40% of all nodes are in the 10 largest Autonomous
Systems (AS).
Only 2-4% of all TCP connections link nodes within
the same AS.
Largely ‘random wiring’.
Entropy experiment gives similar results.
Course Topics

Internet Architecture and Design Principles

Flat Pricing vs. Prioritized Traffic

Internet Measurements

Availability in Wide-Area

Patterns in Real Networks

Modeling the Internet Topology

Internet Services: DNS

Web Caching, Content Distribution Networks

Overlay Networks

Peer-to-Peer systems

Computational Grids

Security Issues

Sensor Nets

Wireless Networks

XML SOAP and Web Services
Course Topics

Internet Design Principles
– How do I deliver Internet
services: end-to-end vs.
within the network?

Flat Pricing vs. Prioritized
Traffic
– How do I determine
which traffic to pass over
the Internet?

Internet Measurements
– What does the Internet
really look like?
Course Topics

Availability in Wide-Area
– How reliable is the Internet?

Patterns in Real Networks
– What does Internet traffic
look like?

Modeling the Internet
Topology
– How can I construct realistic
models of Internet
structure?
Course Topics

Internet Services: DNS
– How well does DNS
work?

Web Caching, Content
Distribution Networks
– How do we optimize
Web content mgmt?

Overlay Networks
– Improving routing
performance
Course Topics

Peer-to-Peer systems
– Gnutella, etc., etc.

Computational Grids
– Globus, etc.

Security Issues
– Authorization, etc.
Circulatory Net
Course Topics

Sensor Nets
– How do I structure &
program networks of
lightweight devices?

Wireless Networks
– How do I route in ad
hoc networks?

XML SOAP and Web
Services
– What are Web
services anyway?
Disaster Response
Projects




Literature surveys, real implementations,
analytical evaluations
Can be performed individually or in a team
of two
Your project ideas appreciated (to be
discussed before proposal due date)
Primary goal is to do something interesting
and to do it well
Example Project

Gnutella network analysis
– Develop a “crawler” that
traverses network,
collects membership &
connectivity info
– Analyze structure
– Characterize structure

See, e.g.:
– Mapping the Gnutella Network: Properties of LargeScale Peer-to-Peer Systems and Implications for
System Design, M. Ripeanu, I. Foster, A. Iamnitchi,
in IEEE Internet Computing Journal, vol. 6(1), 2002
Project Ideas


http://dsl.cs.uchicago.edu/Courses/cs3472002/cs347_projects.htm
Gnutella network measurements
– Topology discovery for 500K nodes
– Structural analysis with 500K nodes
– Study impact of overlay networks
– Etc.
Project Ideas

Overlay networks: build unstructured or
semistructured self-organizing overlays optimizing
different cost functions:
– Topology-aware: map onto physical infrastructure
– Usage-aware: map onto usage patterns

Analysis of Sloan Digital Sky Survey logs to
explore access patterns
– What files are accessed how often
– What community usage patterns emerge?
– How can we exploit these in content distribution
networks?
Project Ideas

Compare qualitatively and analytically current filelocation solutions (CAN, Chord, Gnutella, Napster,
etc.) in the context of scientific file-sharing
collaborations.
– Evaluate sharing patterns based on real usage traces
in a scientific collaboration
– Use these patterns to evaluate benefits/drawbacks
and propose better alternatives

Expand existing simulator to evaluate request
forwarding techniques for resource location in grid
environments
For More Information

Contact me
– Ian Foster, [email protected]
– Email or set up a meeting

Contact Anda, our TA
– Adriana Iamnitchi, [email protected]

Monitor the class web page
– http://dsl.cs.uchicago.edu/Courses/cs347-2002/
Next 2 Classes

Friday:
– Discuss:
> J. Saltzer, D. Reed, and D. Clark, End-to-end Arguments in
System Design. ACM Transactions on Computer Systems,
Vol. 2, No. 4, pp. 195-206, 1984.
> D. Clark and M. Blumenthal, Rethinking the design of the
Internet: The end to end arguments vs. the brave new
world, Workshop on Policy Implications of End-to-End.
December 1, 2001.
– Leading group: Ian + 2 volunteers (who?)

Wednesday:
– Leading Group: Anda + 1-2 volunteers (who?)