eccs07_willinger_ldl_01

Download Report

Transcript eccs07_willinger_ldl_01

“Lies, Damned Lies, and Statistics:”
A Critical Assessment of
Preferential Attachment-type Network Models
of the Internet
Walter Willinger
AT&T Labs-Research
[email protected]
Acknowledgments
Main Collaborators
• John Doyle (Caltech)
• David Alderson (Caltech)
• Lun Li (Caltech)
Contributions
• Matt Roughan (U. Adelaide, Australia)
• Steven Low (Caltech)
• Ramesh Govindan (USC)
• Reiko Tanaka (RIKEN, Japan)
• Stanislav Shalunov (Abilene)
• Heather Sherman (CENIC)
2
Recap: What “Network Science” says about the Internet
• Measurements
– Large datasets of Internet-related connectivity
measurements are available and can be downloaded
– Diverse data sources (traceroute, BGP-based,
WHOIS, special-purpose crawlers)
• Inference
– (Exclusive) focus on node degree distribution
– Inferred node degree distributions follow a power law
• Modeling
– Preferential attachment-type growth model
•Incremental growth
•Preferential attachment: p(k)  degree of node k
– There exist many variants of this basic PA model
3
Recap: What “Network Science” says about the Internet (cont)
• Key features of PA-type models
– Randomness enters via attachment mechanism
– Exhibit power law node degree distributions with or
without exponential cutoffs
• Model validation
– The model “fits the data …”
– Reproduces observed node degree distribution
• Highly publicized claims about Internet topology
– High-degree nodes form a hub-like core
– Fragile/vulnerable to targeted node removal
– Achilles’ heel
– Zero epidemic threshold
4
Basic Question
Do the available Internet-related connectivity
measurements and their analysis support the sort
of claims that can be found in the existing complex
networks literature?
Key Issues
•What about data hygiene?
•What about statistical rigor?
•What about model validation?
5
The Internet: The User Perspective
web
server
my
computer
router
router
6
The Internet: The Engineering Perspective
web
server
my
computer
router
router
HTTP
TCP
IP
LINK
7
The Internet is a LAYERED Network
my
computer
The perception of the Internet as a simple, userfriendly, and robust system is router
enabled by
router
FEEDBACK and other CONTROLS that operate
both WITHIN LAYERS and ACROSS LAYERS.
web
server
HTTP
TCP
IP
packet
packet
packet
packet
packet
packet
These ARCHITECTURAL DETAILS
(protocols, interfaces, etc.) are MOST
ESSENTIAL to the nature of the Internet.
LINK
8
Internet Architecture: Vertical Decomposition
web
server
my
computer
router
Vertical decomposition
Protocol Stack
router
Benefits:
HTTP
• Each layer can evolve independently
TCP
• Substitutes,
complements
Requirements:
IP
1. Each layer follows the rules
2. Every other layer does “good
enough” with its implementation
LINK
9
Internet Architecture: Horizontal Decomposition
web
server
my
computer
router
router
Benefit: Individual components can fail
(provided that they “fail off”) without
HTTP
disrupting the network.
TCP
IP
Horizontal decomposition
Each level is decentralized and asynchronous
LINK
10
Internet Architecture: Hourglass Design
Applications
Web
FTP
Mail
News
Video
Audio
ping
napster
Transport protocols
TCP SCTP UDP
ICMP
IP
Ethernet 802.11
Power lines ATM
Optical
Satellite Bluetooth
Link technologies
11
Hourglass Design: Internet as Critical Infrastructure
Applications
Web
FTP
Mail
TCP
News
Video
Audio
ping
napster
Everything
Transport protocols
on IP
SCTP
UDP
ICMP
IP
Ethernet 802.11
IP on
Power lines ATM Optical
everything
Satellite Bluetooth
Link technologies
12
Courtesy Hari Balakrishnan
Internet Connectivity/Topology
WWW, Email, Napster, FTP, …
Applications
TCP
IP
Transmission
Ethernet, ATM, POS, WDM, …
• Consider a (vertical) layer of the Internet hourglass
• Expand it horizontally
• Give layer-specific meaning to “nodes” and “links” 13
“nodes”
“links”
14
The Many Facets of Internet Connectivity/Topology
Applications
TCP
IP
Transmission




Web graph
Email graph
P2P graph
and many others …
 Autonomous System (AS) or
AS-level ecosystem
 IP-level connectivity (i.e.,
layer 3)
 Router-level connectivity
(i.e., layer 2)
15
Internet Connectivity/Topology
virtual
WWW, Email, Napster, FTP, …
dynamic
Applications
TCP
IP
Transmission
physical
Ethernet, ATM, POS, WDM, …
static
16
On Measuring Internet Connectivity
• No central agency/repository
• Economic incentive for ISPs to obscure network
structure
• Direct inspection is typically not possible
• Based on measurement experiments, hacks
• Mismatch between what we want to measure and can
measure
• Specific examples covered in this talk
– Physical connectivity (ISP router-level topology)
– Logical connectivity (Internet AS-level topology)
17
Back to our Basic Question
Do the available Internet-related connectivity
measurements and their analysis support the sort
of claims that can be found in the existing complex
networks literature?
Key Issues
•What about data hygiene?
•What about statistical rigor?
•What about model validation?
18
On Data Hygiene
On Measuring the Internet’s Router-level Topology
• traceroute tool
– Discovers compliant (i.e., IP) routers along path
between selected network host computers
• Large-scale traceroute experiments
– Pansiot and Grad (router-level map from around
1995)
– Cheswick and Burch (mapping project 1997--)
– Mercator (router-level maps from around 1999 by R.
Govindan and H. Tangmunarunkit)
– Skitter (ongoing mapping project by CAIDA folks)
– Rocketfuel (state-of-the-art router-level maps of
individual ISPs by UW folks)
– Dimes (EU project)
20
http://research.lumeta.com/ches/map/
21
http://www.isi.edu/scan/mercator/mercator.html
22
http://www.caida.org/tools/measurement/skitter/
23
http://www.cs.washington.edu/research/networking/rocketfuel/bb
24
http://www.cs.washington.edu/research/networking/rocketfuel/
25
HOWEVER: Problems with existing measurements
• traceroute-based measurements are ambiguous
– traceroute is strictly about IP-level connectivity
– traceroute cannot distinguish between high connectivity
nodes that are real (i.e., represent a physical device
such as a router or switch) and that are not real (i.e.,
represent entire opaque Layer-2 structures such as
national/global ATM or MPLS networks)
26
The Internet: The Engineering Perspective
web
server
my
computer
router
router
HTTP
TCP
IP
LINK
27
Illusion of a fully-meshed
Network due to use of MPLS
http://www.cs.washington.edu/research/networking/rocketfuel/
28
 www.savvis.net
 managed IP and
hosting company
 founded 1995
 offering “private IP
with ATM at core”
This “node” is an
entire network!
(not just a router)
http://www.caida.org/tools/measurement/skitter/
29
HOWEVER: Problems with existing measurements
• traceroute-based measurements are ambiguous
– traceroute is strictly about IP-level connectivity
– traceroute cannot distinguish between high connectivity
nodes that are real (i.e., represent a physical device
such as a router or switch) and that are not real (i.e.,
represent entire opaque Layer-2 structures such as
national/global ATM or MPLS networks)
• The irony inherent in traceroute-based measurements
– The high-degree nodes in the middle of the network that
traceroute reveals are not real ….
– The high-degree nodes that are real but can only exist at
the edge of the network will never be revealed by
traceroute …
30
HOWEVER: Additional problems with existing measurements
• traceroute-based measurements are inaccurate
– “Alias resolution” problem
– Which IP addresses/interface cards refer to the same
router?
– Only heuristic solutions exist
• traceroute-based measurements are incomplete/biased
– IP-level connectivity is more easily/accurately inferred the
closer the routers are to the traceroute source(s)
– Node degree distribution is inferred to be of the power-law
type even when the actual distribution is not
– traceroute-bias has been extensively studied but is the
least relevant problem with traceroute-based
measurements ….
31
On Statistical Rigor
How to lie with statistics …
Given: Samples from an exponential
distribution
Want: Claim power law behavior
Recipe: Use size-frequency plots!
10 2
Freq.
Given: Samples from a Pareto
distribution with =1.0
Want: Claim power law with =1.5
Recipe: Use size-frequency plots!
3
10
Ys
Ye
2
10
Freq.
10 1
1
10
α+1 = 1.5
10 0
0
2
10
yk
Size
3
10 0
1
2
10
Size
3
4
33
Size-Frequency vs. Size-Rank Plots
or
Non-cumulative vs. Cumulative
3
3
10
10
2
2
10
10
α = 0.5
1
1
10
10
α+1 = 1.5
α = 1.0
0
10 0
10
1
10
2
10
3
10
0
4 10 0
10
10
1
10
2
10
3
10
4
10
34
6
Noncumulative Size-Frequency
5
raw MERCATOR data
10
Binned Size-Frequency
5
10
10
a common
reporting
technique
4
10
4
10
3
10
3
10
2
10
2
10
1
10
1
10
0
10 0
10
6
10
5
10
0
1
2
10
3
10
4
10
10
Size-Rank (log-log scale)
1
10
10
Size-Rank (log-linear scale)
6
without 2 largest nodes
3
2
10
10
without 2 largest nodes
5
10
4
4
10
10
3
exponential
in tail…
3
10
10
2
2
10
10
1
1
10
10
0
10 0
10
10 0
10
1
10
2
10
3
10
0
10
0
50
100
150
200
250
300
35
350
On Model Validation
Taking Model validation more serious …
• Mathematical Modeling 101
– For one and the same observed phenomenon, there are
usually many different explanations/models
– All models are wrong, but some are “damned lies”
• Model validation ≠ data fitting
– The ability to reproduce a few graph statistics does not
constitute “serious” model validation
– Which of the observed properties does a proposed
model have to satisfy before it is deemed “valid”?
• What constitutes “serious” model validation?
– There is more to networks than connectivity
– Meaning of “node” and “link”
37
Cisco 12000 Series Routers
• Modular in design, creating flexibility in configuration.
• Router capacity is constrained by the number and speed of line
cards inserted in each slot.
Chassis
Rack size
Slots
Switching
Capacity
12416
Full
16
320 Gbps
12410
1/2
10
200 Gbps
12406
1/4
6
120 Gbps
12404
1/8
4
80 Gbps
Source: www.cisco.com
38
Router Technology Constraint
10
Cisco 12416 GSR, circa 2002
3
high BW
low degree
Bandwidth (Gbps)
Total Bandwidth
10
10
high degree
low BW
2
1
15 x 1-port 10 GE
10
0
15 x 3-port 1 GE
15 x 4-port OC12
15 x 8-port FE
Technology constraint
10
-1
10
0
10
1
Degree
10
2
39
Back to the Basic Question:
Do the available Internet-related connectivity
measurements and their analysis support the sort
of claims that can be found in the existing complex
networks literature?
Short Answer:
No!
40
Network Science and the Internet: “Lies, damned lies and statistics”
• Power-law (scale-free) node degree distribution
– Example of “how to lie with statistics”
• Preferential attachment-type models
– (White) lies …
• Highly popularized claims (e.g., Achilles’ heel,
fragile/vulnerable to targeted node removal, zero
epidemic threshold)
– Damned lies …
– These claims are not “controversial” – they are
simply wrong!
• Bad analysis of bad data = bad models (“damned lies”)
“Bad [models] are potentially important: they can be
used to stir up public outrage or fear; they can
distort our understanding of our world; and they can
lead us to make poor policy choices.” (J. Best)
41
The “Math” Perspective of the Internet
• Assumption
– Node degree distributions follow a power-law
• Rigorous model definition/formulation
– Preferential attachment-type models
• Rigorous proofs
– Achilles’ heel
– Fragile/vulnerable to targeted node removal
– Zero epidemic threshold
• End result is the same
– Rigorous analysis of bad model = “damned lies”
42
How to avoid such Fallacies?
• Taking model validation more serious
• Applying an engineering perspective to engineered systems
43
Internet Modeling: An Engineering Perspective
• ISPs design their router-level topology for a purpose,
namely to carry an expected traffic demand
• Surely, the way an ISP designs its physical infrastructure
is not the result of a series of coin tosses …
• Randomness enters in terms of uncertainty in traffic
demands
• ISPs are constrained in what they can afford to build,
operate, and maintain (economics).
• The “nodes” and “links” are physical things that have
hard constraints (technology).
• Decisions of ISPs are driven by objectives (performance)
and reflect tradeoffs between what is feasible and what is
desirable (heuristic optimization)
• Power laws: Full of sound and fury, signifying nothing!
44
Heuristically Optimized Topologies (HOT)
Given realistic technology constraints on routers, how well is
the network able to carry traffic?
Step 1: Constrain to
be feasible
Step 2: pick traffic demand model
Bj
Bandwidth (Mbps)
1000000
xij  Bi B j
xij
100000
10000
1000
100
Bi
Abstracted
Technologically
Feasible Region
Step 3: Compute max flow
max  xij  max  Bi B j

10
degree
1
10
100
1000
i, j
s.t.
i, j
x
i , j:krij
ij
 Bk , k
45
Preferential Attachment
102
node rank
HOT model
101
100
101
node degree
46
HOT-type Network Models
• Very recent alternatives to PA-type models
– Extremely unlikely to occur at random
• Key features of HOT models
– Consistent with existing ISP router-level topologies
– Consistent with existing technologies
– Consistent with (complementary) measurements
– Node degree distribution is a non-issue
• Same story for AS-level Internet topology
47
Some Implications of this Engineering Perspective
• New paradigm for network modeling
– Network modeling ≠ data fitting
– Node degree distribution is a non-issue
• Constrained optimization formulation
– Optimization of tradeoffs between multiple functional
objectives of networks
– Subject to constraints on their components
– With an explicit source of uncertainty (in the
environment) against which solutions must be
tolerant or robust
48
Further Implications of this Engineering Perspective
• Dynamics of graphs
– Evolution of connectivity structures
– Evolution of (internal) node/link structure
• Dynamics over graphs
– Traffic dynamics (bytes, packets, flows, …)
• Challenging feedback problem
– Traffic dynamics/routing impacts network structure
– Network structure impacts traffic dynamics/routing
• Can’t (shouldn’t) model connectivity without traffic
49
Further Implications of this Engineering Perspective (cont.)
• Internet-related robustness/fragility considerations
– Cannot be studied in the void
– Can only be addressed in the context of the broader
system, i.e., protocol stack
• Fact 1: Internet is extremely robust to attacks on its
physical infrastructure (routers, switches, links)
– Top design objective of original Internet architecture
– IP routing “sees damage and routes around it”
• Fact 2: Internet is extremely fragile to the malicious
exploitation or hijacking of the very mechanisms that
confer much of its robustness (i.e., protocols).
– Original Internet architecture relied on trust
– Nightmare scenario: Attackers hijacking DNS or BGP
50
Further Implications of this Engineering Perspective (cont.)
• Key question #1: What is the network as whole trying to
achieve?
– Internet router-level: see earlier
– Internet AS-level: ?
– WWW, P2P: ??
– Social Networks: ???
• Key question #2: How is the network trying to achieve its
objective?
- Decentralized, distributed
- Duality gap (“price of anarchy”)
51
A Reminder
• Past: Modeling in the presence of high-quality data
– “All models are wrong … but some are useful” (G.E.P. Box)
• Future: Modeling in the presence of highly ambiguous data
– Take the ambiguities in the data into account
– “When exactitude is elusive, it is better to be approximately
right than certifiably wrong.” (B.B. Mandelbrot)
52
http://hot.caltech.edu/topology.html
• L. Li, D. Alderson, J.C. Doyle, W. Willinger. Toward a Theory of Scale-Free
Networks: Definition, Properties, and Implications. Internet Mathematics
2(4), 2006.
• D. Alderson, L. Li, W. Willinger, J.C. Doyle. Understanding Internet Topology:
Principles, Models, and Validation. ACM/IEEE Trans. on Networking 13(6),
2005.
• J.C. Doyle, D. Alderson, L. Li, S. Low, M. Roughan, S. Shalunov, R. Tanaka,
and W. Willinger. The "robust yet fragile" nature of the Internet. PNAS
102(41), 2005.
• D. Alderson and W. Willinger. A contrasting look at self-organization in the
Internet and next-generation communication networks. IEEE Comm.
Magazine. July 2005.
• W. Willinger, D Alderson, J.C. Doyle, and L. Li, More “normal” than Normal:
scaling distributions in complex systems. Proc. Winter Simulation Conf.
2004.
• W. Willinger, D Alderson, and L. Li, A pragmatic approach to dealing with
high-variability in network measurements, Proc. ACM SIGCOMM IMC
2004.
• L. Li, D. Alderson, W. Willinger, and J. Doyle, A first-principles approach to
understanding the Internet’s router-level topology, Proc. ACM SIGCOMM
2004.
53