Investigating AR System performance issues

Download Report

Transcript Investigating AR System performance issues

Investigating performance outside of workflow
and indexes
Danny Kellett
Java System Solutions
Husband and father
• Worked for Remedy / Pere***n / BMC from 1999 to 2007
• BSM ITSM Solution Architect / Consultant
• Single Sign-On architect for Java System Solutions
© 2013 WWRUG
© 2012Canada
WWRUG
Inc.Canada
All Rights
Inc.Reserved
All Rights Reserved 2
Warning – There is a lot of content in
this presentation !
This is intentional. Not all of this information will mean anything yet but if/when you decide to look
at your own system, this presentation will hopefully fill in all those knowledge gaps
Feel free to email me at [email protected]
© 2013 WWRUG
© 2012Canada
WWRUG
Inc.Canada
All Rights
Inc.Reserved
All Rights Reserved 3
Agenda
Latency
-
What is it and why is it such a big deal?
-
How do I test for it?
-
How to use this information to demonstrate performance for your users
Queues & Threads
-
What are they?
-
How to read the logs
-
How to let your AR Server tell you how many it needs
Plugins
-
What types exist?
-
How they can impact performance
-
The process of diagnosis and fix
© 2013 WWRUG Canada Inc. All Rights Reserved
4
Objects and Results
Objectives
-
To understand that no matter how many CPU’s and RAM you have, a poor
network can bring an application to its knees
-
Demystify the confusion about queues and threads
-
Understand the black boxes called plugins
Results
-
Understand what the above means and give you the tools and knowledge to
understand your own AR System
Skills developed
-
To sit through a very technical and possibly boring tech talk and live to talk
about it
-
The knowledge to understand the “not so well documented” parts of the AR
System
© 2013 WWRUG Canada Inc. All Rights Reserved
5
Latency
“Higher latency decreases app response time, user performance and perceived app quality”
© 2013 WWRUG
© 2012Canada
WWRUG
Inc.Canada
All Rights
Inc.Reserved
All Rights Reserved 6
Latency :: What Is It And Why Is It Such A Big Deal?
What is it?
-
In a network, latency, a synonym for delay, is an expression of how much
time it takes for a packet of data to get from one designated point to
another. There are two typical types:

One way
– The time from the source sending a packet to the destination receiving
it

Round trip
– The one-way latency from source to destination plus the one-way
latency from the destination back to the source
Latency is not bandwidth
-
Two key elements of network performance is bandwidth and latency. The
average person is more familiar with the concept of bandwidth as that is the
one advertised by manufacturers of network equipment. However, latency
matters equally to the end user experience
© 2013 WWRUG Canada Inc. All Rights Reserved
7
Latency :: What Is It And Why Is It Such A Big Deal?
Why is it such a big deal to us and BMC
Software?
-
The BMC AR System architecture has multiple
network node points

Each line in the diagram is effected by latency

If each line added even milliseconds, it all adds
up!

If any point adds a delay then the whole “trip” is
effected
The AR System API (ARAPI) is very “chatty”
-
Instead of eating with a big spoon it eats with a
small spoon
© 2013 WWRUG Canada Inc. All Rights Reserved
8
Latency :: What Is It And Why Is It Such A Big Deal?
Real life example
-
ITSM 7.6.04 SP2, load balanced environment as the
diagram. Browser was Firefox v21.0. URL is HTTPS
Test
-
Incident console, double clicking an incident.
}
50ms
Baseline (Initial, first load example)
-
Took a measurement on the current network to get
the number of trips, amount of data and response
time in seconds
Second test with latency
-
With added latency of 50ms (0.05s) from the client
(browser on desktop) through the Load Balancer
and to one of the Mid Tier servers
© 2013 WWRUG Canada Inc. All Rights Reserved
9
Latency :: What Is It And Why Is It Such A Big Deal?
Results of Baseline (Initial, first load example)
# Trips
Amount of Data
Client to server
65
126.6k
Server to client
103
318.6k
Total
168
445.2k
Time (seconds)
9s
}
50ms
Results of test with 50ms added latency
Total
168
445.2k
12s
Test summary
Just 50ms of network latency in just one
piece of the BMC architecture, from the
browser to the Mid Tier, can add ⅓ to
your end user response times!
© 2013 WWRUG Canada Inc. All Rights Reserved
10
Latency :: How Did I Do Those Tests?
Before you start, understand these things
-
What happens when you type the Mid Tier address in the URL bar:

Your browser will use your desktop network configuration to get the
network details of your Mid Tier.

First is will look at your local hosts file for the Mid Tier host name.

If it is not in there it will ask the Domain Name Service (DNS)

If your Mid Tier’s IP address is configured in your DNS database, then the
browser will connect to it and everything works, you see the application
etc

BUT if you added a line in your local hosts file
(c:\windows\system32\drivers\etc\hosts) so that your desktop believes
its not that IP address but a different one e.g. 127.0.0.1 then the browser
will try and connect to that instead.

127.0.0.1 is something called a loopback adaptor and its basically means
your own machine you are typing on. And unless you have a Mid Tier
running on your machine, it will fail.
© 2013 WWRUG Canada Inc. All Rights Reserved
11
Latency :: How Did I Do Those Tests?
Before you start, understand these things
-
What happens when you type the Mid Tier address in the URL bar (part 2)

What if you had a Mid Tier on your desktop and you added the same Mid
Tier host name to your local hosts file with 127.0.0.1?
– Your browser would still display the correct URL address but you
would be connecting to the Mid Tier on your desktop and not the one
on the network.
-
OK so why do I need to know that?

If you installed a piece of software on your desktop that wasn’t a Mid Tier
but something that connected to your REAL Mid Tier on the network, BUT
delayed all connections, adding latency …. Then this is called a Proxy and
this is what I used. Confused? See next slide for a diagram
© 2013 WWRUG Canada Inc. All Rights Reserved
12
Latency :: How Did I Do Those Tests?
Add a proxy on the desktop to simulate latency
© 2013 WWRUG Canada Inc. All Rights Reserved
13
Latency :: How Did I Do Those Tests?
Find your Mid Tier real IP using ping or nslookup
Insert that IP value into the proxy app as the MAP IP
Add the Mid Tier URL host name to the loopback address in your
local hosts file
-
127.0.0.1 try.onbmc.com
Click Start on the proxy. Use your browser as before and record
the timings.
© 2013 WWRUG Canada Inc. All Rights Reserved
14
© 2013 WWRUG
© 2012Canada
WWRUG
Inc.Canada
All Rights
Inc.Reserved
All Rights Reserved 15
Latency :: How Can This Predict Your Response Times?
Obtain your users latency times to the Mid Tier server.
Latency
London
15
Paris
45
Houston
484
Open Incident
© 2013 WWRUG Canada Inc. All Rights Reserved
16
Latency :: How Can This Predict Your Response Times?
Obtain your users latency times to the Mid Tier server.
-
Using ping - which uses ICMP but is sometimes turned off on network
equipment
-
Or http-ping with the Free JSS Network Simulator
-
Those times are round trip, so the time its taken from client to server AND
back from server to client. Therefore when testing, half those values!
-
E.g. 156 / 2 = 78
-
The one way latency is 78ms
© 2013 WWRUG Canada Inc. All Rights Reserved
17
Latency :: How Can This Predict Your Response Times?
Screenshot of free JSS app
https://www.javasystemsolutions.com/download/networksim/jss-networksim.zip
Example data for onbmc.com
© 2013 WWRUG Canada Inc. All Rights Reserved
18
Latency :: How Can This Predict Your Response Times?
Using free JSS tools, you can test the response times of all your
users from your own desktop and more importantly before your
users do
Latency
Open Incident
London
15
6s
Paris
45
8s
Houston
484
30s
© 2013 WWRUG Canada Inc. All Rights Reserved
19
Latency :: If Your Latency Is High
Speak to your network teams about Quality of Service (QOS)
-
Some network equipment can prioritise certain protocols. The Mid Tier uses
either HTTP typically on port 80 or HTTPS typically on port 443 – have these
prioritized if possible
Make sure your architecture has as little latency as possible
between the Mid Tiers and AR Servers and more importantly
between the AR Server and the database.
-
ITSM 7.6.04 with approx 900 concurrent users fires approx 127 SQL
statements per second at the database. High latency would bring the app to
its knees!
Install local Mid Tier instances near your end users.
-
There is more traffic between the browser and the Mid Tier than there is
from the Mid Tier to the AR Server.
Make customisations to workflow to remove trips altogether.
-
Tell story at large outsourcer
© 2013 WWRUG Canada Inc. All Rights Reserved
20
Threads & Queues
© 2013 WWRUG
© 2012Canada
WWRUG
Inc.Canada
All Rights
Inc.Reserved
All Rights Reserved 21
Threads :: Lets All Get Up To Speed On Queues & Threads
A queue is an entry point into the AR System
-
They are identified by a number, and sometimes referred to as an RPC
queue. Here are some examples

390600 = Admin

390603 = Escalation

390620 = Fast API calls (just a name without intending to indicate
performance)

390635 = List API calls (just a name as well but was aimed at things that
search and return lists/large amount of data)
A queue can have one or more threads defined for them. On startup, each thread creates a connection to the database that it uses
throughout its existence. Threads only close when you shutdown
the server or it cannot connect to the database
One queue has one or more threads
22
© 2013 WWRUG Canada Inc. All Rights Reserved
22
Threads :: Lets All Get Up To Speed On Queues & Threads
If an API call gets routed to a queue and all the current threads are
being used, it will look at the Max Threads value configured for
that RPC queue. If the current thread number is lower, then it will
create another thread and use that.
If the list queue is at max resource, it will put the work on Fast
queue and vice versa
If both are full, it will move the work to the Admin thread
The AR system has a set of queues -- some pre-defined, some
private and defined per instance -- and each of them has a
number of processing threads as configured in the ar.cfg/ar.conf
23
© 2013 WWRUG Canada Inc. All Rights Reserved
23
Threads :: Confusing Or Mixed Messages
Search: Google/BMC Support/ARSlist/BMC Communities
BMC Atrium Core 8.0.00_20120921_docs.pdf
-
Page 87 - set Min Threads to 5 and Max Threads to 10
-
262, same information repeated on 356, 383
-

Fast threads — At minimum, the same number as you have CPU cores; at
maximum, 3 times the number of CPU cores, but no exceeding 32

List threads — At minimum, the same number as you have CPU cores; at
maximum, 5 times the number of CPU cores, but not exceeding 32
Page 2048

CPU x 1.5 for the Private queue.
SW00427239 - Fast and List threads are not set as per the
recommended Queue settings.
24
Doug Mueller ARSList post - In theory, there is no reason you
cannot have 10s or even 100s of threads in a queue.
© 2013 WWRUG Canada Inc. All Rights Reserved
24
Threads :: Confusing Or Mixed Messages
I do not believe the amount of threads is based on the number of
CPUs alone. In all my tests, the CPU usage never rose over 55%
(excluding Developer Studio work)
If the infrastructure can handle it E.g. an MSSQL database has a
maximum 32767 connections. So theoretically, if the AR Server
could fire that many connections and process them, then why
not?
Just for now, think about if a connection to a database is doing
some long query and is held up, the CPU on the AR Server is still
the quickest component in the architecture and will have to wait
and therefore it will do “other things”
It’s like saying a car can only handle 100BHP
-
Sure it can handle more if the rest of the car components and driver can
handle it!
25
© 2013 WWRUG Canada Inc. All Rights Reserved
25
Threads :: In My Experience The Answer Is…
There are so many variables that make your system unique
-
CPU types, HT, SMT, Cores etc

Virtual CPU vs a bare metal CPU differs

http://www.vmware.com/files/pdf/techpaper/VMW-Tuning-LatencySensitive-Workloads.pdf

http://scn.sap.com/thread/1646435
-
Virtualisation or bare metal
-
Operating systems and settings
Therefore after 14 years of experience, researching hours and
hours, “Googling” the WHOLE internet on system architecture,
posting and reading on so many forums my answer is :
26
-
Every environment is different so suck it and see and get the system to tell
you
-
And here’s how I do it.
© 2013 WWRUG Canada Inc. All Rights Reserved
26
Threads :: Here’s How :: Simple Principles
In a queue, if all the threads are constantly busy then you need to
do some investigation
-
If you have a high number of threads already, and these threads are taking
too long to complete. E.g Investigate the queries to the database and work
with your DBA to speed them up
-
If the above doesn’t work, or the DBA doesn’t want to play ball, then it’s
time to increase the Max Thread count in the AR System Administration
Console
Each queue and thread takes system resources such as CPU cycles
and memory. If those resources are maximised, then it’s time for
an upgrade or to add another AR Server in the server group
No substitution for capacity management and load sharing
27
© 2013 WWRUG Canada Inc. All Rights Reserved
27
Threads :: Here’s How :: Step 1 – Create Logs
In case you didn’t know, you can have multiple log data in one file.
Therefore start a log of API, Escalation, SQL and Filter as required
Run this log in your peak periods if you can. Or if you are not live,
then use a volume and performance testing application
This log will get large so make sure you have enough space and
use your better judgment regarding the log file size and the
amount of time you leave it on. Monitor, do not turn on and go
home.
28
© 2013 WWRUG Canada Inc. All Rights Reserved
28
Threads :: Here’s How :: Step 2 – Run Log Analyzer
There are a couple of tools you can use
-
ARLogAnalyzer

-
https://communities.bmc.com/docs/DOC-2973
AR Log File Analyser

http://www.missingpiecessoftware.com/products/ar-log-file-analyser
29
© 2013 WWRUG Canada Inc. All Rights Reserved
29
Threads :: Here’s How :: Step 2 – Run Log Analyzer
Understanding the detail in the AR logs
-
Green -- The same user
-
Purple -- The same RPC 390620 which is the Fast queue
-
Red -- TWO RPC ID’s meaning two different API calls being processed

-
Every call that the dispatcher thread receives is assigned an RPC ID that
can be used to identify the call from the time the call is placed into the
queue until a response is sent back to the client
Cyan -- One thread executing both API calls one after the other
30
© 2013 WWRUG Canada Inc. All Rights Reserved
30
Threads :: Here’s How :: Step 3 – What To Look For?
Verify the number of threads that is actually running. If the
numbers in the log file match the Max Threads then you know at
some point all threads were utilised
ARLogAnalyzer
31
AR Log File Analyser
Both results show :
Fast Max is 30 but 27 being used
List Max is 40 but 35 being used
© 2013 WWRUG Canada Inc. All Rights Reserved
31
Threads :: Here’s How :: Step 4 – Idle Time
Thread idle time is the time from when a thread completes some
work and then has to start work again
Therefore the lower the idle time, the busier the thread is
Remember this will probably spike during the working day but this
is truly the best way to monitor when your busy periods are and
how busy your system is
There is no such thing as 0 idle time. Even getting work from the
dispatcher takes at least some time
-
Therefore ignore the MIN Idle Time examples include 0.0007
Look for very small numbers on the AVE idle time column
32
© 2013 WWRUG Canada Inc. All Rights Reserved
32
Threads :: Identifying Busy Periods
You can identify when the AR Server has needed to increase the
thread count on a queue
We can see one queue 390626, which is configured to start with 6
threads (Min Threads value below)
Looking through the log, we can see the thread number
synchronously increment for the first 6, 28 to 33, then we see 130
33
© 2013 WWRUG Canada Inc. All Rights Reserved
33
Threads :: Identifying Busy Periods
Note the thread id 0000000130 underlined in red on the previous
slide
Search your log file for TID: 0000000130
The above screenshot of the log entry is on one line but I had to
cut it to fit on the slide
Find the first instance of the thread id (TID) and note the time in
this example is 13:08. This is when the AR Server decided it
needed to create a new thread on the 390626 queue
34
© 2013 WWRUG Canada Inc. All Rights Reserved
34
Threads :: Server Statistics
Another way of identifying number of threads started
-
AR System Administration Console > System > General > Review Statistics
35
© 2013 WWRUG Canada Inc. All Rights Reserved
35
Plugins…
© 2013 WWRUG
© 2012Canada
WWRUG
Inc.Canada
All Rights
Inc.Reserved
All Rights Reserved 36
Plugins :: Lets All Get Up To Speed On Plugins
Three types of plugins. Each do a different thing
-
AREA – External Authentication
-
ARF – Filter plugin which are used to extend actions of Filters
-
ARDBC – Access data outside of forms but mimic the behaviour of forms
There are two types of plugins. C and Java
-
Which obviously relates to the programming language they are built with
C
-
Runs through an executable file arplugin.exe (windows), arplugin (*NIX)
Java
-
Surprisingly runs from separate Java processes, or Java Virtual Machines
They have completely separate configuration, logging output etc
37
© 2013 WWRUG Canada Inc. All Rights Reserved
37
Plugins :: Lets All Get Up To Speed On Plugins
C specific
-
arplugin.exe / arplugin started via armonitor and configured to run in
armonitor.cfg / armonitor.conf

E.g. /opt/bmc/ARSystem/bin/arplugin –s srv1 –i /opt/bmc/ARSystem
-
Configured through the ar.cfg / ar.conf
-
“Plugin:”, “Plugin-Path:” & “Plugin-Port:” apply only to the C plugin daemon.
-
How to identify them? In the ar.cfg/ar.conf
-

Plugin: Then .dll on Windows, or .so or .a on NIX systems

E.g. Plugin: ServerAdmin.so

E.g. Plugin: ardbcconf.dll
Logging is controlled through the AR System Administration Console
38
© 2013 WWRUG Canada Inc. All Rights Reserved
38
Plugins :: Lets All Get Up To Speed On Plugins
Java specific
-
A Java process is started via armonitor and configured to run in
armonitor.cfg / armonitor.conf

-
39
E.g. /usr/java/jdk1.6.0_06/jre/bin/java -Xmx512m -classpath
/opt/bmc/ARSystem/pluginsvr:/opt/bmc/ARSystem/pluginsvr/arpluginsvr
75.jar com.bmc.arsys.pluginsvr.ARPluginServerMain -x svr1 -i
/opt/bmc/ARSystem
Typical ITSM instance has 4 Java plugin servers running

Primary plugin server

Full Text Search Engine

2 CMDB plugin servers
-
Configured via three seperate pluginsvr_config.xml files
-
How to identify them? Within pluginsvr_config.xml files and some are
aliased in the ar.cfg/ar.conf
-
Logging is controlled through each log4j_pluginsvr.xml files
© 2013 WWRUG Canada Inc. All Rights Reserved
39
Plugins :: Lets All Get Up To Speed On Plugins
2 main types of functionality within the plugins
1-Way and 2-Way
-
1-Way is the AR Server calls the plugin and the plugin returns a response
-
2-Way is the AR Server calls the plugin but in order to complete that request,
the plugin must connect back to the AR Server and lookup some data, and
then returns a response
The two way plugins are typically the ones to look out for with
regards to performance
-
E.g. REMEDY.ARDBC.APPQUERY
40
© 2013 WWRUG Canada Inc. All Rights Reserved
40
Plugins :: Monitoring The Configuration In Log Files
Same process of assigning queue numbers permitting the
monitoring of data within the logs
-
E.g. If the plugin is configured on RPC queue 390624 then the API and SQL,
the plugin executes against the AR Server, will be in the API and SQL logs
with:

<Client-RPC: 390624 >
Plugins that connect back to the AR Server are just clients, which
use the same API as the User tool, driver, Mid Tier, etc
41
© 2013 WWRUG Canada Inc. All Rights Reserved
41
Plugins :: Example - REMEDY.ARDBC.APPQUERY
Seen in the log files when viewing the Overview Console
Find if its a C or a Java plugin by looking in the ar.cfg/ar.conf
-
Server-Plugin-Alias: REMEDY.ARDBC.APPQUERY REMEDY.ARDBC.APPQUERY
srv1:9999
-
See port number as :9999 so if Plugin-Port: 9999 then its a C plugin.
Otherwise you can tell its a Java plugin.
Search for NAME in the java plugin config xml
(pluginsvr_config.xml)
Now look for the above classname line in the log4j_pluginsvr.xml
42
© 2013 WWRUG Canada Inc. All Rights Reserved
42
Plugins :: Example - REMEDY.ARDBC.APPQUERY
Now look for the above line in the log4j_pluginsvr.xml
Change warn to trace, restart and open the arjavaplugin.log
ITSM OOTB, this plugin is not configured to run on its own queue.
Add the line in the pluginsvr_config.xml
43
© 2013 WWRUG Canada Inc. All Rights Reserved
43
Plugins :: Summary
Plugins that connect back to the AR Server are clients just like the
User Tool and Mid Tier etc
Most, if not all, are not configured to run on queues OOTB
It’s OK to run trace logs in production as long as it is managed!
-
No better log than one with real user transactions, on a real working system
-
Use log rotation with some form of log monitoring
-
You don’t need a lot of log data. An hour can be enough
Include this analysis in your capacity management assessment
-
E.g. If we add 1000 users, will I need to increase my thread count?
44
© 2013 WWRUG Canada Inc. All Rights Reserved
44
Conclusion
A lot of content but it hopefully was practical content
Each system is unique so use these techniques on your own
system
Don’t be afraid to monitor. I agree log files do grow fast but
manage this rather than not taking any logs at all
Email me if you have any questions, I am geeky enough to actually
enjoy this 
© 2013 WWRUG Canada Inc. All Rights Reserved
45
Wrap-up
© 2013 WWRUG Canada Inc. All Rights Reserved
46