It`s a two dimensional problem….

Download Report

Transcript It`s a two dimensional problem….

Network Monitoring,
WAN Performance Analysis,
& Data Circuit Support
at Fermilab
Phil DeMar
US-CMS Tier-3 Meeting
Fermilab
October 23, 2008
Active Wide-Area Network Monitoring

PerfSONAR: distributed network monitoring infrastructure


Supported by US-LHC T1 sites and Internet2 community
PerfSONAR-PS: Active monitoring package



Web services collection built on trusted monitoring tools:
•
ping, BWCTL(iperf), owamp, NPAD, NDT toolkit
•
Web service interface for pulling data into other monitoring tools
Zero configuration; out of box deployment
•
Based on Knoppix Live CD bootable disk
•
Optional software bundle deployment
Modest hardware requirements for on-site deployment
PerfSONAR Deployment Status

US-Atlas moving ahead with perfSonar-PS at T1 & T2s:



Two dedicated systems per site; one each for latency & b/w testing
Systems are spec’ed devices, $628 each (Koi computer)
Utilize Knoppix disks & standard configurations

We’ve recommended the same model for US-CMS

Current PerfSONAR-PS deployment:



Both US-LHC Tier-1s (FNAL & BNL)
UNL (CMS), U-Mich (ATLAS); U-Delaware; Internet-2; ESnet
Complete active monitoring matrix of the above
Background information

PerfSONAR-PS project http://code.google.com/p/perfsonar-ps/

Tour of perfSONAR-PS service is available http://code.google.com/p/perfsonar-ps/wiki/CodeTour

Knoppix Live CD bootable disk info http://code.google.com/p/perfsonar-ps/wiki/NPToolkit
Appliance PCs:




Vendor:
Spec:
Cost:
KOI Computing – (630) 627-8811
1U Intel Pentium Dual-Core E2200 2.2GHz System
$628/each
Performance Analysis Support

In 1999, Matt Mathis coined the
term ‘Wizard’s Gap’


Users often don’t know about:



Common OS tuning issues for
WAN data movement
Wide-area network path, its characteristics, available tools
Its still an end-to-end problem


Today, it’s still an issue
And the world is still short on wizards
Our structured analysis methodology seeks to put some
of the wizardry into structured process
Find the performance problem area(s)
Network Application Performance
Factors !!!
End System
1
2
MEM
CPU
Applications
3
4
Disks
Operating System
5
NIC
7
1’
•
•
•
•
•
CPU speed
MEM Size
System Load
Disk I/O Speed
Operating System
• R/W buffer size
• Disk cache size
• NIC Speed
MEM
CPU
Disks
8
Network
3’
Operating System
5’
NIC
Router
LAN
Network
Applications
4’
6
R/S
2’
R/S
6’
7’
Cable
9
WAN
Router
• Network Delay
• Bandwidth
• Packet Drop Rate
Performance Analysis Methodology

Structured approach to performance analysis

Model the process like medical diagnosis




Collect the physical characteristics
Run diagnostic tests
Record everything; develop a history of the analysis
Strategic approach:

Sub-divide problem space:
•
•
•

Application-related problems
Host diagnosis and tuning
Network path analysis
Then divide and conquer
Network Performance Analysis Architecture
NESDS
` NES
NPDS
NES
NES
LAN
R/S
NESDS
End-to-end Path
BR
NPDS
PTDS
`
LAN
7’
8
Router
WAN
BR
WAN
NPDS
Network Path
Diagnosis Server
NPDS
PTDS
Packet Trace
Diagnosis Server
NESDS
Network End System
Diagnosis Server
PTDS
Cable
9
Network End System
Router
BR
Border Router
Performance Analysis Tools…

Host diagnosis



Network path diagnosis



Script that pulls system configuration
Network Diagnostic Tool (NDT)
•
Faulty network connections & NICs, duplex mismatches
OWAMP to collect and diagnose one-way network path statistics.
•
Packet loss, latency, jitter
Other tools such as ping, traceroute, as needed
Packet trace diagnosis




Port mirror on border router(s)
Tcpdump to collect packet traces
Tcptrace to analyze packet traces
Xplot for visual examination.
Network path characteristics collected





Round-trip time
Sequence of routers along the paths
One-way delay, delay variance
One-way packet drop rate
Packet reordering
Network Performance Analysis Methodology

Step 1: Definition of the problem space

Step 2: Collect host information & network path
characteristics

Step 3: Host tuning & diagnosis

Step 4: Network path performance analysis





Route changes frequently?
Network congestion: delay variance large?
Infrastructure failures: examine the counter one by one
Packet reordering: load balancing? Parallel processing?
Step 5: Evaluate packet trace pattern
Tier2/Tier3 Sites worked with










UERJ (Brazil)
IHEP (China)
RAL (UK)
University of Florida
IFCA (Spain)
TTU (Texas)
CIEMAT (Spain)
Belgium
OWEA (Austria)
CSCS (Swiss)
Performance Analysis Status & Summary

An available service for CMS Tier-2/3 sites




A work-in-progress at this point
Focus is on process as well as results
Willing to work with others in this area
Future areas of effort:


Incorporate into work flow & content management system
Make use of perfSonar monitoring infrastructure

https://plone3.fnal.gov/P0/WAN/netperf/methodology/

How to get hold of us:


Send email to [email protected]
Wide Area Work Group video-conf meetings every other Friday
Strategic Direction Toward Circuits

DOE High Performance Network Planning Workshop
established a strategic model to follow:

High bandwidth backbones for
reliable production IP service
•

Separate high-bandwidth
network paths for large scale
science data flows
•

ESnet
Science Data Network
Metropolitan Area Networks
(MAN) for local access
•
Fermi LightPath a cornerstone
for Chicago area MAN
ESnet4: Core networks 50-60 Gbps by 2009-2010 (10Gb/s circuits)
Canada
Canada
Asia-Pacific
Asia Pacific
(CANARIE)
(CANARIE)
GLORIAD
CERN (30+ Gbps)
CERN (30+ Gbps)
Europe
(Russia and
China)
(GEANT)
Science Data
Network Core
Australia
Boston
IP Core
Boise
New York
Denver
Washington
DC
Australia
Tulsa
LA
Albuquerque
South America
San Diego
(AMPATH)
South America
(AMPATH)
Jacksonville
Production IP core (10Gbps)
SDN core (20-30-40-50 Gbps)
MANs (20-60 Gbps) or
backbone loops for site access
International connections
Topology of circuit connections

Circuits utilize MAN infrastructure:




Circuits based on end-to-end vLANs


10GE channel(s) reserved for routed IP
service (purple)
LHCOPN circuit (orange) to CERN
SDN channels for E2E circuits to CMS
Tier-2/3 (shades of green)
Direct BGP peering with remote site
Multiple provider domains is the norm


Deployed technology varies by
domains involved
Complexity is higher than IP service
FNAL Alternate Path Circuits

Supported since 2004

Serve a wide spectrum
of experiments


Implemented on
multiple technologies


CMS Tier-2s are heavy
users
But based on end-to-end
layer-2 paths
Usefulness has varied
E2E Circuit Summary

FNAL currently supporting E2E circuits to Tier0 & Tier2s

A few Tier3s

Today, circuits are largely static configurations

Dynamic circuit services are becoming available


Driven largely by Internet2 DCN services
Alternate path support services also emerging

Lambda Station (FNAL)
TeraPaths (BNL)

Contact [email protected] for help or information
