Transcript Slide 1
Protocol Analysis in a
Complex Enterprise
April 2nd, 2008
Hansang Bae
Senior VP | Citigroup
SHARKFEST '08
Foothill College
March 31 - April 2, 2008
SHARKFEST '08 | Foothill College | March 31 - April 2, 2008
Challenges:
As it turns out, size does matter!
Citi’s branch network spans 5,000+ locations in the US
Citi’s network infrastructure includes 30,000+ devices
300,000 users located in over 100 countries.
Compliance/Security Quagmire
It’s for your own protection, or so I’m told!
Doing a full packet capture is difficult
Wireshark is the only approved protocol analyzer at Citi. It
dislodged past market leaders.
SHARKFEST '08 | Foothill College | March 31 - April 2, 2008
Challenges (con’t):
Capturing and Analyzing: Two pieces to the same
puzzle
Enormous amounts PCAP data are involved.
In most cases, header analysis is adequate.
Wireshark/WinPCAP is not well suited for this much volume
Citi uses a commercial product for packet capturing.
Working with the vendor, it took over three years of
development before it was deemed “Citi-ready”
Example One: Path MTU
Infrastructure size makes it interesting.
Very difficult problem without a proper protocol
analyzer
Example One: (Con’t)
In depth understanding of routers and protocols were
required.
Usenet to the rescue!
ICMP and IP.ADDR filters were key!
So which side am I on in the “religious debate” about
whether ICMP messages should be included in the “ip.addr”
display filter?
..\..\..\Traces\Consumer\CBNA\ICMPRateLimit.pcap
In retrospect, it was an easy problem to solve. Yet the
sheer size made it difficult to spot.
Example Two: Clock Drift
MarketData driven business complains of extreme
delays from UK to US.
At first glance, application logs seem to confirm delays in the
200+ms delays. RTT is 70ms.
Because it’s easy, let’s blame the firewall and the network!
SLA tracking and further investigation of routers/switches
gets us nowhere with problem resolution.
Our analysis shows that something is not right!
Example Two (Con’t)
Due to mis-matched traffic flow, pcap data itself yield
unreliable data.
For example, we would
see and an ACK for a
packet that was not yet
delivered. This was traced
to the output buffer of the
SPAN on the switch.
The SPAN issue forced us to look
a the packets in detail, including the
data timestamp
Example Two (Con’t)
Charting the pcap timestamp with the data timestamp
showed a peculiar pattern.
App Log Delay
900
800
700
Delay in Milliseconds
600
500
Delay @ fe0
Delay @ fe2
400
300
200
100
0
1
449 897 1345 1793 2241 2689 3137 3585 4033 4481 4929 5377 5825 6273 6721 7169 7617 8065 8513 8961 9409 9857 Packet #
By spotting the pattern above, we were able to show
the vendor that their clock was drifting!
Lessons Learned/Feature Request
Picture really is worth a thousand words.
The two pictures above show the same event!
Bounce diagrams can quickly pinpoint issues.
Lessons Learned (Con’t)
Allow zoom in feature from the bounce diagram for
even easier troubleshooting.
The above shows the slow start in action. It’s immediately
obvious what’s going on with one look at the chart!
Increase performance for TCP/IP dissection. Although
Wireshark’s support for protocols is impressive, most
folks in the enterprise deal with TCP/IP problems.