Transcript Slide 1

Protocol Analysis in a
Complex Enterprise
April 2nd, 2008
Hansang Bae
Senior VP | Citigroup
SHARKFEST '08
Foothill College
March 31 - April 2, 2008
SHARKFEST '08 | Foothill College | March 31 - April 2, 2008
Challenges:


As it turns out, size does matter!

Citi’s branch network spans 5,000+ locations in the US

Citi’s network infrastructure includes 30,000+ devices

300,000 users located in over 100 countries.
Compliance/Security Quagmire

It’s for your own protection, or so I’m told!

Doing a full packet capture is difficult

Wireshark is the only approved protocol analyzer at Citi. It
dislodged past market leaders.
SHARKFEST '08 | Foothill College | March 31 - April 2, 2008
Challenges (con’t):

Capturing and Analyzing: Two pieces to the same
puzzle

Enormous amounts PCAP data are involved.

In most cases, header analysis is adequate.

Wireshark/WinPCAP is not well suited for this much volume

Citi uses a commercial product for packet capturing.
Working with the vendor, it took over three years of
development before it was deemed “Citi-ready”
Example One: Path MTU


Infrastructure size makes it interesting.
Very difficult problem without a proper protocol
analyzer
Example One: (Con’t)

In depth understanding of routers and protocols were
required.

Usenet to the rescue!

ICMP and IP.ADDR filters were key!

So which side am I on in the “religious debate” about
whether ICMP messages should be included in the “ip.addr”
display filter?
..\..\..\Traces\Consumer\CBNA\ICMPRateLimit.pcap

In retrospect, it was an easy problem to solve. Yet the
sheer size made it difficult to spot.
Example Two: Clock Drift

MarketData driven business complains of extreme
delays from UK to US.




At first glance, application logs seem to confirm delays in the
200+ms delays. RTT is 70ms.
Because it’s easy, let’s blame the firewall and the network!
SLA tracking and further investigation of routers/switches
gets us nowhere with problem resolution.
Our analysis shows that something is not right!
Example Two (Con’t)

Due to mis-matched traffic flow, pcap data itself yield
unreliable data.

For example, we would
see and an ACK for a
packet that was not yet
delivered. This was traced
to the output buffer of the
SPAN on the switch.
The SPAN issue forced us to look
a the packets in detail, including the
data timestamp

Example Two (Con’t)
Charting the pcap timestamp with the data timestamp
showed a peculiar pattern.

App Log Delay
900
800
700
Delay in Milliseconds
600
500
Delay @ fe0
Delay @ fe2
400
300
200
100
0
1

449 897 1345 1793 2241 2689 3137 3585 4033 4481 4929 5377 5825 6273 6721 7169 7617 8065 8513 8961 9409 9857 Packet #
By spotting the pattern above, we were able to show
the vendor that their clock was drifting!
Lessons Learned/Feature Request

Picture really is worth a thousand words.

The two pictures above show the same event!

Bounce diagrams can quickly pinpoint issues.
Lessons Learned (Con’t)

Allow zoom in feature from the bounce diagram for
even easier troubleshooting.


The above shows the slow start in action. It’s immediately
obvious what’s going on with one look at the chart!
Increase performance for TCP/IP dissection. Although
Wireshark’s support for protocols is impressive, most
folks in the enterprise deal with TCP/IP problems.