Transcript Slides

The NewIntro
Wayfor
to Debug
Java
in
Production
Customer Name
Hi!
Nice to meet you, I’m Nick Durkin,
Monitoring Engineer
Know When, Where and Why
Code Breaks in Production
Current Tools – Not Solving Problems in Timely Fashion
94%
Application Logs
69%
Profilers
68%
Database Logs
52%
Debuggers
46%
Language’s Built-In Tooling
46%
Memory Dump Analyzers
41%
Thread Dump Analyzers
33%
OS Command Line Tooling
13%
APM Tools
>1 month <12%
Hour
6%
<1 month
<1 Day
19%
18%
55%
< 1 week
DevOps tools for finding root cause
Avg. time-to-resolution
Copyright © 2016 OverOps. All rights reserved.
3
Source: DZone 2015 guide to performance and monitoring
How it
Works
Copyright © 2016 OverOps. All rights reserved.
4
How TripAdvisor
Solves Production
Issues
Hi!
Nice to meet you, I’m Steve Rogers,
Director of Software Engineering
The world’s largest online
resource and category leader
for travel destination activities
Viator by the numbers
7M
visitors per month
100
AWS Instances
120M
API requests per month
3
Front ends
60
Developers in Sydney and SF
20
Backend services
PCI
level 1 compliant
Open Source stack
Monitoring Tools
Architecture
Overview
Workflow
1.
Agile environment
Two weekly sprints, 12 teams
2.
Error resolution
Each team is encouraged to put some time in
each sprint for diagnosing possibly what they
would consider a critical exceptions.
3.
4.
Challenges
1.
PCI Level One compliant environment
Making it nearly impossible to replicate issues
locally, and limits production access
2.
Hard to test
Different data in production and development
3.
Legacy code
The original developer has long gone
4.
Very noisy logs
Logging ‘Just-in-case’
5.
Observability
All teams need knowledge of everything
Each team can change anything
Full Stack teams. No team owns a platform
Teams are responsible for their releases
Multiple per sprint, with releases every day
Error Resolution (Pre-OverOps)
!
Problem
After a release we saw a sudden spike in 500 errors on
one of our APIs with our logs filling up with
[21:17:24,668] [ERROR] com.viator.saint.Saint:1342
java.lang.NullPointerException
at com.viator.lookup.DestinationTaxonomyNode.filtersAreEmpty(DestinationTaxonomyNode.java:1152)
at com.viator.lookup.DestinationTaxonomyNode.getFilteredNode(DestinationTaxonomyNode.java:1171)
at com.viator.lookup.DestinationTaxonomyNode.getFilteredNode(DestinationTaxonomyNode.java:1141)
...
Error Resolution (Pre-OverOps)
private boolean filtersAreEmpty(FilterCollection filters) {
return
isFilterEmpty(filters) &&
isFilterEmpty(filters.getSubFilters()) &&
isFilterEmpty(filters.getSiteParentFilters()) &&
(filters.getPatternMatchedFilters() == null ||
filters.getPatternMatchedFilters().isEmpty());
}
Error Resolution (Pre-OverOps)
That took
How we had to solve it:
1. Rolled back the release
2. Looked in the logs and code, and nothing was obvious :(
3. Created a new hotfix release with extra logging in
4. Released the new version
5. Waited for replication - Did not take long
6. Got the new logs and rolled back the release
3
Full Days
7. Fixed the issue
8. Finally, released the new version
Ouch!
Monitoring Today (or How We Solve Issues)
Dev Tools in monitoring stack:
▶
▶
NewRelic - For performance investigation
OverOps - For error investigation
▶
Rigor - End user performance
▶
Home Grown tools
What we do after each deployment
▶
Smoke Tests
▶
Dashboard checks (OverOps, NewRelic)
▶
Real time alerting of issues
We use in house tools for the basics, and best
of breed for the hard stuff
Problem Resolution (Post-OverOps)
!
Problem
After a release we saw a performance degradation
How we solved it:
1. Of course we started to look at New Relic, but nothing was obvious.
2. Then we received an email from OverOps with the following:
This showed that many gets from a cache were
failing so objects were being re-generated
Why OverOps for Viator
1
Variable values across
the entire call stack in
production for each error
2
Fixing Hard to solve issues
in minutes, without having
to reproduce locally
3
Makes the barrier to
fixing bugs lower Positive small bug ROI
“Short of attaching a debugger in production,
OverOps is the next best thing.”
4
Helps drive a TDD
approach to fixing
bugs for legacy code
Thanks!
Questions?
www.overops.com
Free 14 day trial