Towards Wireless Overlay Network Architectures

Download Report

Transcript Towards Wireless Overlay Network Architectures

Berkeley-Helsinki Summer Course
Lecture #7: Network
Measurement and Monitoring
Randy H. Katz
Computer Science Division
Electrical Engineering and Computer Science Department
University of California
Berkeley, CA 94720-1776
1
Outline
•
•
•
•
•
Web Traffic Measurement
Multi-layer Tracing and Analysis
Network Distance Mapping
SLA Verification
Service Management
2
Outline
•
•
•
•
•
Web Traffic Measurement
Multi-layer Tracing and Analysis
Network Distance Mapping
SLA Verification
Service Management
3
Measuring/Characterizing
Web Traffic
• Motivation for Measurement
– Insights into Web site design
– Managing Proxies and Servers
– Operating IP Networks
• Measurement Process
– Monitoring from some network location
– Generate measurement records in some format
– Preprocessing for subsequent analysis
• Based on Chapter 9, “Web Traffic
Measurement,” in Web Protocols and Practice,
Krishnamurthy and Rexford, Addison Wesley,
Reading, MA, 2001.
4
Web Measurment
• Content Creators
– Measurements of user browsing patterns
» Number of visitors, site stickiness influences advertising
revenue
» Optimize for common user sequences
» User perceived latency influences server and placement
decisions
• Web Hosting Company
– Number of response messages/bytes served influence load
balancing strategy among multiple hosted sites
» Mix of busy day sites/busy night sites
» Managing persistent connections
– Resource usage influences billing
» When to introduce more servers, better connectivity
5
Web Measurement
• Network Operators
– Resource decisions: where to add bandwidth, when to
upgrade links, where to place proxies, caches, how to
modify routing within the provider cloud, etc.
– User community: relative mix of clients with low vs. high
bandwidth connectivity
• Web/Networking Researchers
– Evaluating performance of protocols and software
– Drive evolution of protocols, policies, algorithms
– Better understanding of Internet traffic dynamics
6
Measurement Techniques
• Server Logging
– Log entry per HTTP request
– Requesting client
» Could be a user, a proxy, or a cache—the latter two
represent aggregated patterns
» Identified by an IP addresd
• Could represent the workload of multiple users
• Dynamically assigned addresses not correlated with same user
each time encountered
–
–
–
–
Request time
Request/response messages
Coarse grained, aggregated times
NOTE: proxy/cache satisfied requests filtered before
reaching the server
– Hard to obtain!
7
Measurement Techniques
• Proxy Logging
– Proxies can be associated with clients or servers, e.g.,
proxy for UC Berkeley vs. proxy for Google
– Former provides insights into client behavior aggregated
by administrative domain; more detailed information about
individual clients may be available
– Degree of aggregation depends on how close proxy is to
clients (close implies small community, far implies large
community)
– Limited scope, accesses filtered by browser caches
– Hard to obtain!
8
Measurement Techniques
• Packet Monitoring
– Network level logging (HTTP, IP, TCP)
– Fine grained time stamping possible
– Some requests satisfied from client caches, encrypted
packets could represent collection difficulties
– Monitor needs to be placed so as to be able to ease drop
on packets
9
Measurement Techniques
• Active Measurement
– Generate requests in a controlled manner, observe their
performance
– Issues:
» Where to locate the modified user agents—
geographical placement, quality of connectivity to
wide-area network
» What requests to generate—e.g., based on profile of
popular web sites
» What measurements to collect—DNS queries, TCP
timeouts, proxy interception difficult to distinguish
sources of latencies
10
Inferences from Measurement
Data
• Limitation of HTTP Header Information
– Incomplete header logging
– Heuristics needed to reconstruct behavior from log
• Ambiguous Client/Server Identity
– Client identity/unique IP address
– Many IP addresses associated with same server
• Inferring User Actions
– Difficult to correlate user level actions like mouse clicks
with observed network activity
– One click many http requests
• Detecting Resource Modifications
– Web level actions typically miss modifications
– Incomplete use of Last-Modified and Date fields by servers
11
Web Workload
Characterization
• Applications of Workload Models
– Identifying performance problems
» High latency/low thruput under specific load scenarios
– Benchmarking Web components
» Selecting among competing architectures
– Capacity planning
» “Right sizing” net b/w, CPU, disk, memory given
expected loads
• Workload Parameters
– Protocols: Request method/Response code
– Resources: Content type, Resource size, Response size,
Popularity, Modification frequency, Temporal locality,
Number of embedded resources
– Users: Session interarrival times, Number of clicks per
session, Request interarrival times
12
Workload Characteristics
• HTTP Requests/Responses
– GET method predominates, small number of POSTs (forms), OK
responses
– More intelligent protocols for communicating with caches may
change distribution of requests (e.g., HEAD)
• Web Resources
– Text and images dominate, increasing audio/video content
– Small resource size dominates, average HTML file size is 4-8
KB, image file size 14 KB, wide variation around the mean implies
Pareto distribution (“heavy tailed”)
– Higher b/w connections imply larger web objects over time
• Response Sizes
– Users likely to abort large transfers, so median response size
smaller than median resource size; very heavy tail
– Effect of higher b/w connections on response size?
13
Workload Characteristics
• Resource Popularity
– Zipf’s Law: a small number of objects are highly popular
– Effectiveness of caching at all levels (client browser
cache, site proxy cache, even DNS name cache)
• Resource Changes
– Static content vs. script-based descriptions
– Periodic changes (“young die young”)
• Temporal Locality
– Correlated access to resources in time
• Embedded Resources
– Web pages have median of 8-20 embedded resources,
heavy tailed distribution
14
Workload Characteristics
• User Behavior
– Session and request arrivals
» Infer session via repeated access to same server
» Burst of HTTP requests, think time
– Clicks per session
» 4-10 clicks on average; distinguish between “sticky”
sites and directory/redirection sites
» Heavy user vs. light user
– Request interarrival times
» Activity punctuated with think times
» Request interarrivals order of 60 seconds
15
Research Perspectives on
Measurement
•
•
•
•
Packet monitoring of HTTP traffic
Analyzing Web server logs
Publicly available logs and traces
Measuring multimedia streams
16
Packet Monitoring of HTTP
Traffic
•
•
•
•
•
•
Tapping a link carrying IP packets
Capturing packets from HTTP transfers
Demux packets into TCP connections
Reconstructing ordered stream of bytes
Extracting HTTP messages from byte stream
Generating a log of HTTP messages
17
Analyzing Web Server Logs
• Parsing and Filtering
– Logs in multiple formats
– Interleaved log records
– Timestamp diversity
• Transforming
– Remove erroneous records
– Diverse formats for URLs, conversion to unique integers
for easier processing
18
Publicly Available Logs and
Traces
• Internet Traffic Archive
– http://www.acm.org/sigcomm/ita
• World Wide Web Consortium’s Web
Characterization Group Repository
– http://www.purl.org/net/repository
• NLANR
– http://ircache.nlanr.net/Cache/
• CAnet Squid logs
– http://ardnoc41.canet2.net/cache/
19
Measuring Multimedia Streams
• Static analysis of multimedia resources
– Locating video content at various web sites
– Acquiring copies
– Computing statistics
• Multimedia server logs
– VCR-like operations
– User access patterns, frequency of early abort
• Packet monitoring of multimedia streams
– Infer session identity from src/dst IP address, port #,
protocol
• Multilayer packet monitoring
– Correlation of control and data streams
20
Probability Distributions in
Web Workload Models
• Exponential: Session interarrival times
• Pareto:
–
–
–
–
Response Sizes (tail of distribution)
Resource Sizes (tail of distribution)
Number of Embedded Images
Request Interarrival Times
• Lognormal:
– Response sizes (body of distribution)
– Resource sizes (body of distribution)
– Temporal locality
• Zipf-like: Resource popularity
21
Outline
•
•
•
•
•
Web Traffic Measurement
Multi-layer Tracing and Analysis
Network Distance Mapping
SLA Verification
Service Management
22
Wireless Link Management
• Modeling GSM data network layers
– Media access, link, routing, and transport
– Validated ns modeling suite and BONES simulator
– GSM channel error models from Ericsson
• Reliable Link Protocols
– Wireless links have high error rates (> 1%)
– Reliable transport protocols (TCP) interpret errors as congestion
» Need tools to determine multi-layer interaction effects
» Large amounts of data: 120 bytes/s
» Important for design of next generation networks
– One solution: use a reliable link layer (ARQ) protocol
» However, retransmissions introduce jitter
– Alternative: use error-resilient algorithms to allow apps to handle
corrupted data (only protect network protocol headers)
» Less end-to-end delay, constant jitter, higher throughput
23
Testbed, Protocols, Tools
H.263+ Encoder
H.263+ Decoder
RTP
RTP
Packetization
De-Packetization
Socket Interface
Socket Interface
UDP / UDP Lite
UDP / UDP Lite
IP
IP
PPP
PPP
Transparent /
Non-transparent
Mobile Host
Unix BSDi 3.0
SocketDUMP
RLPDUMP
Transparent /
Non-transparent
GSM
BTS
GSM
Network
MultiTracer
Plotting & Analysis
(MATLAB)
PSTN
Fixed Host
Unix BSDi 3.0
SocketDUMP
RLPDUMP
24
MultiTracer Time-Sequence Plots
Bytes
416000
414000
412000
410000
TcpRcv_ack
TcpRcv_data
13 Segments
dropped at
TCP receiver
5 Segments
lost due to
RLP Reset
TcpSnd_ack
408000
TcpSnd_data
406000
404000
18 Segments
402000
400000
398000
480
RlpSnd_rst
485
490
495
500
505
510
515
520
Time of Day (sec)
25
Outline
•
•
•
•
•
Web Traffic Measurement
Multi-layer Tracing and Analysis
Network Distance Mapping
SLA Verification
Service Management
26
Applications of Network
Distance Mapping
•
•
•
•
•
Mirror Selection
Cache-infrastructure Configuration
Service Redirection
Service Placement
Overlay Routing/Location
27
Distance Mapping Framework
Goal: Develop scalable, robust distance information
collection/sharing infrastructure
• Feasible distance metrics
– Number of hops
– Latency
– Bandwidth
• Continuous measurement
• Provide approximate distance information
• Continue to operate in the presence of
components changes/failures
• Scale the measurement by self-adaptation
28
Distance Mapping Challenges
•
•
•
•
Select how may probes/monitors to deploy
Monitor placement
Choose appropriate monitor for given client
Statistically quantify estimation error: e.g,
x% of the estimates within a factor  of
actual distances
• How stable are these clustering?
29
IDMaps Project
• Internet-wide infrastructure to collect
distance information
• IDMaps provides:
– Long-term approximate distances
– Distance estimation between any 2 points on the Internet
• IDMaps does not provide:
– End-to-end application-level performance
– Available bandwidth or current delay
– Characteristics of any specific path
30
IDMaps Components
tracer
• Tracers: autonomous
instrumentation boxes
• Tracers measures
distance between
themselves and to APs
• APs (Address Prefixes):
regions of the Internet;
Hosts within AP are equidistant from rest of
Internet
Hosts in AP near tracer
T*T + AP cost
T = number of tracers
AP = number of APs
Courtesy of IDMaps group
31
IDMaps Architecture
Courtesy of IDMaps group
32
IDMaps Results and Limitations
– Cyan: random
selection
– Others: various
heuristics &
algorithms
Complementary distribution function
• Simulation results on
synthetic and static
network topology
Percentage of correct answers
Courtesy of IDMaps group
33
IDMaps Limitations
• Based on triangulation
inequality
• Consider only number
of hops
• Ignore the dynamics of
Internet, no stability
study
Clients
A
Monitors
D
C
B
AB = AC + CD + DB ?
34
Wide-area Network Measurement
and Monitoring Services
Goal: Understand behavior of Internet/provide
adaptation to Internet apps thru monitoring
services
• Layered Architecture
– Bottom layer a common core shared across multiple apps
with generic metrics
– More application-specific at the top layer
• Modularity
– Separation of functionality
– Clear definition of interaction between different layers
– Ease of customization and modification
35
Layered Architecture
Decision/Design Procedures
Dissemination Layer
Application side
Pull-/push- based APIs
Federation for Sharing Layer
Measurement Collection,
Transformation and Storage Layer
Measurement Layer
What to measure, what tools?
Probe placement & density
36
Current Focus at Berkeley:
Internet “Iso-bar”
• Regions of network that perceive similar
performance to the Internet, i.e., spatial
correlation
– How to find it without knowing the topology?
• Used to determine # and placement of monitors;
High dimensional feature space for iso-bar
clustering
– Each host collects distance values to m hosts as m-dim feature
vector
– Use K-means for high-dimension clustering
– Choose site closest to the cluster center as monitor
– Initially m can be the total number of clients, later it may be
the number of representative monitoring sites
37
Iso-bar Experiments
• Remove triangulation inequality assumption
• Stationarity: Predictability of network
properties – temporal correlation
– Global stationarity: change of the total number of clusters
– Local stationarity: expand and shrink of each cluster
• Experiements with NLANR Active
Measurement Project (AMP) data set
–
–
–
–
119 sites on US and New Zealand
Traceroute between every pair of hosts every minute
Use daily average round-trip time (RTT)
Color the clustered hosts and map them on US map with
longitude and latitude info (imprecise mapping)
38
Geographic Distribution of
NLANR AMP Monitoring Sites
39
Underlying Topology of NLANR Sites
Most of the NLANR sites use Abilene Network
40
Preliminary Clustering Results
41
Stationarity of Iso-bar
• Global stationarity quite good
• Local stationarity still under investigation
• Will apply more statistical learning methods, e.g., Gaussian
mixture model, kernel methods for clustering and its dynamics
• Will evaluate its prediction with real measurement data
42
Inferring Internet Topology
Goal: Determine hierarchy amongst autonomous
systems(AS) based on types of
relationships among them
• Assume two-types of relationships
– Provider-Customer
– Peer-Peer
• Providers are above customers in the hierarchy;
peers mostly in same level in the hierarchy.
• Inferences
– 5-level hierarchy in the Internet
– Connectivity across levels is strictly non-hierarchical
43
Inferring Internet Topology
• CAIDA & Mercator
– Traceroutes from diff locations to get connectivity
– Whois & BGP dumps to find IP addr ownership
• Krishnamurthy et al.
– BGP dumps to find IP addr ownership
– Use web server logs to cluster IP addrs by behaviour
• GT-ITM
– Generated topologies
– Useful for testing on specific cases, but not actual Internet
• Our work
– BGP dumps to find AS connectivity
– BGP dumps to find amount of paths carried by each link
– BGP dumps to find AS preferences for links
44
45
Inferring Type of Relationship
Assumption: ISPs with high probability do not
forward BGP advertisements from its peers or
providers to other peers or providers
• Implication: If assumption is completely true,
every AS path is “valley-free” (no traversal
from peer/provider to customer and back to
peer/provider)
• Features of inference algorithm
– Collected large # of BGP dumps;
Partial views of Internet from different sources
– Assign every AS rank based on every dump;
Apply dominance/clustering rules to find type of
relationships
46
Layers in the Internet
• Layer 0 (Strong Core)
– Dense sub-graph(peering links) of the Internet topology
consisting of only Tier-1 ISPs
• Layer 1 (Transit Core)
– Consists of all top transit providers/large national ISPs
• Layer 2 (Outer Core)
– Last layer where any two ASs have peering relationship
• Layer 3 (Regional)
– Collection of regional ISPs that support small customer base
• Layer 4 (Customers)
– Large collection (87%) of ASs that are only customers
47
Our Findings
• Innercore of 20 AS’s is
highly connected
– 271 edges (full clique = 380)
• Full graph has 10918 AS’s
– 24,598 edges out of
119,191,806 possible edges
• Distribution of paths
carried by edges
48
Our Graph of the Core
49
Quantifying the Layering
# Intra# InterLayer Edges Layer edges
Layer
# of ASs
%
Strong Core
20
0.2
329
9600
Transit Core
162
1.5
1052
6000
Outer Core
674
6.3
1070
3600
Regional
950
9.2
202
2400
Customers
8852
83.0
0
0
Note: Edges directed from providers to
customers; peer-peer links directed both ways
50
Outline
•
•
•
•
•
Web Traffic Measurement
Multi-layer Tracing and Analysis
Network Distance Mapping
SLA Verification
Service Management
51
“Trust but Verify”
• Monitoring is integral to SLA verification
• Built on top of SNMP Architecture
–
–
–
–
SNMP Agents
SNMP Manager
SNMP Protocol (polling/trapping)
Objects and Management Information Bases (MIBs)
Manager
Management Station
SNMP
Network
Ethernet I/F
Managed Element
Agent
Managed Node
Agent
52
Network Connectivity
SLA Monitoring
• Need to monitor availability, traffic
(bandwidth, latency) between access routers
• Standard SNMP MIBs
–
–
–
–
–
Current interface status (up/down)
Time since last status change
# bytes/packets received/transmitted
# packets discarded/received in error
Length of packet queue
• Not really sufficient for determining
connectivity SLA!
53
Remote Monitoring of IP
Network
• RMON Architecture
– Manager (SNMP Manager), Probe Points (SNMP Agents)
– Network is a collection of LAN segments;
For each, collect:
» Segment statistics (e.g., packet counts)
» Host specific statistics
» Traffic matrix between hosts on same segment
– Lots of stats can be collected by difficult to correlate
across the LAN segments
– Best for finding bottleneck segments and to drive capacity
planning
– Not helpful for delays or latency measurements
54
Monitoring Flows
• Flow: correlated subset of network traffic,
e.g., with a common source and destination
• Cisco Proprietary NetFlow Architecture
– Flow Collector
– Router to collect the flow information
– Traffic counts on virtual links
• IETF Real-Time Flow Monitoring
– Standardized Flow MIB
55
Network Monitoring with
Active Probing
• Ping Program
– Active probing via ICMP echo messages
– Determines loss rates and delays
• Traceroute
– Path and estimated delay that packet followed in the IP
network
– Sends multiple ICMP packets with increasing TTL,
discovering routers due to ICMP TTL expired messages
– This can cause high variability in the reported delays
• NTP Sync Messages
– Clock offset, round trip delay, dispersion info exchange
• Various Statistical Probing Schemes
– Delays and loss rates
56
SLA Monitoring Issues
• Client- versus operator-side monitoring and
reporting
• Monitoring in multi-class network
• Transport- and Application-level monitoring
• Monitoring in an overlay network
• Monitoring in a multi-service provider
environment (finding “the weakest link”)
• Accuracy in monitoring
– Number of measurements, frequency of measurements,
stability of results, confidence intervals
57
Measurement Points
for Verifying SLAs
• Distinguish between measuring within service provider
cloud and end-to-end between customer nodes
58
Outline
•
•
•
•
•
Web Traffic Measurement
Multi-layer Tracing and Analysis
Network Distance Mapping
SLA Verification
Service Management
59
From Network Management to
Service Management
Service Level Control
Server
Load Balancing
• Server
and site
availability
• Balanced
server and
site load
Advanced Traffic
Management
• Rapid change
• Network and
application
flexibility
• Scalability
• Complex site
administration
• Rapid problem
diagnosis/
isolation
• Service level
measurement
• Multi-tier
resource
monitoring
• Preferential
Services
• Resource
Provisioning
• Self-tuning
• Problem
prevention
Morino, Resonate
60
Service Reliability is Critical
• ISP connection down
• LAN segment
overloaded
Network
Failure 18.2%
• CPU overloaded
• NIC failure
• Process hung
• Slowed database
performance
Source: IDC
Applications
Systems
Failure 28.5%
Server Failure 20%
OS Failure 24.6%
Administration 8.7%
Morino, Resonate
61
Traditional Traffic
Management
• Single tier, single site,
service level control
– Higher service levels
– Better resource utilization
– Multiple features to meet
unique needs
User
Internet
Content
Servers
Traffic
Management
Morino, Resonate
62
Basic LAN Solution
Requirements
• Simple load balancing
– Establish Virtual IP address (VIP)
– Delivers scalable performance,
• Health checks and service monitoring
–
–
–
–
Look beyond layer 3/4 characteristics
Returned content, response times, etc.
Better information to determine server status
Use traffic management techniques to insulate user from
affects of server or software failure
Morino, Resonate
63
Advanced LAN Solution
Requirements
• Complex traffic management
– More intelligent policies for application state management
– Enforce sophisticated user based policies
– Inspection of application header
» URL parsing - Direct requests to systems with available content
• Functional segregation of Web site
» SSL Session IDs - Requirement to maintain persistence
• Maintains application state
• Multiple TCS sessions within since SSL session
» Cookies - More precise user identification and classification
• Look through proxies, firewalls
• Establish preferential services
– Integration with WAN solutions
Advanced Traffic Management Features Require Delayed Binding
Morino, Resonate
64
Binding
Syn
Immediate
Delayed Binding Connection
Client (Browser)
Binding
Delayed
Bound to ‘Server’
Server (HTTPd)
Morino, Resonate
65
Delayed Binding Issues
• Push Packet contains URL, cookie, all
application information (except port number)
• Must read application header to deliver
advanced traffic management features
• Delayed Binding is the only way to see the
application header before decision is made to
‘bind’ to server
Morino, Resonate
66
Be Careful What You Wish
For...
• Now that you have the header, what
do you do with it?
• Unstructured format, application
specific, might be encrypted
• CPU Sink hole!!!!
• Be sure to watch what happens to throughput
when you turn on Delayed Binding features
Morino, Resonate
67
Deeper Visibility for Managing
Complex Infrastructures
Multi-tier service level control
• Instrument back-end systems
• Capture health and status
• Diagnose and isolate problems
• Take corrective
action immediately!
User
Internet
App Servers
Data Layer
Content
Servers
Traffic
Management
Systems Management
Server side
instrumentation
Morino, Resonate
68
Redundant Site Implementation:
Growth and Failover
• Multi-site service level control
• Higher service levels
• Better resource utilization
User
WAN Traffic
Management
Internet
SF
• Not a networking solution
• Not a performance issue
– POP persistence
dominates issues
SF
NY
Morino, Resonate
69
Management and Administration
is Crucial
Enterprise Services
Console
Consolidated view of multiple sites
• Eases management of complex
e-businesses
• Reduces costs associated with
undetected problems
Sys Admin
SF
Denver, CO
SF
NY
Morino, Resonate
70
Intelligent Service Management
Feedback
IP-Application
Traffic
Management
Functions
Policy-Based
Control
Systems
Management
Functions
Feedback
Closed-loop Real-Time Control
of IP-based Applications
Morino, Resonate
71
Resonate Case Study
• Central Dispatch
– Software-based load balancer for servers on a LAN
– Sophisticated policy-driven filtering, redirection, load balancing
– Class of service support for server access
• Global Dispatch
– Multi-site management, wide-area redirection, disaster recovery
– Advanced Traffic Mapping capabilities:
» Sticky/persistent session support and sticky session failover
» Directed Traffic Table directs users to predefined POP
– Configurable scheduling based on WAN latency and site load
– POP failover handling
– Advanced stats: avg. DNS response, POP hit rate, other QoS
– Coexistence with existing DNS and load balancing architecture
– Pass multiple IP addresses to client for browser-based failover
– Weighted round-robin scheduling
http://www.resonate.com
72
Resonate Case Study
• Commander
– End-to-End monitoring
» URL tests, host access tests, HTTP service availability tests
» SNMP traps
– Test, statistics, and control features
» Gather availability info: site + Web/app/DB servers
» Process events: inaccessible file servers, db, net congestion,
etc. for reporting/initialization of user-defined action
– Features:
» Rapid identification and resolution of site problems
» Multi-tier resource monitoring of site servers
» Identify problems before service levels are affected
» Identify network trends essential to optimized site planning
» User-defined service mgmt policies for automated control
http://www.resonate.com
73
Resonate Case Study
• Automated Control for Policy-Based Problem Resolution
– Sophisticated server-level control policies
– Monitors events/processes them according to pre-defined rules &
action(s)
– E.g., sending email/electronic pages, script invocation
• Examples of policy-based control include:
– Schedule traffic from Web server w/ slow/failed backend app server
– Increase/decrease traffic to server when perf crosses thresholds
– Enable backup content server in a Central Dispatch site when one or
more active content servers fail/become too busy
– Monitor apps and server processes; restart any that fail
http://www.resonate.com
74