In a lot of enterprises, performance tests are conducted regularly. As part of these tests, QA team gathers various metrics and publishes them in a performance report. Metrics reported in the performance report typically tend to be:
- CPU utilization
- Memory utilization
- Response time of key transactions
- Response time of backend systems
- Network Bandwidth (some of the organizations)
I would like to categorize the above-mentioned metrics as macro metrics. Macro metrics are great; however, they have few short-comings:
a. Performance problems not caught in test environment
Despite a number of performance tests conducted in the test environment, still, performance degradation finds its way to production. In a test environment, you will notice acute performance degradation, but above-mentioned Macro metrics don’t capture them. These acute degradations are the ones when unnoticed, manifests as major performance problems in production. However micro-level metrics discussed below brings visibility to these degradations.
b. Doesn’t facilitate in troubleshooting
Macro metrics to a major extent does not facilitate development team to debug and troubleshoot problems. Say, macro metrics indicates CPU consumption is high, there will be no indication whether CPU consumption increased because of heavy Garbage Collection activity or thread looping problem or some other coding issue. Similarly, if there is degradation in the response time, it won’t indicate whether degradation is because of the locks in the application code or a backend connectivity issue.
Macro metrics should be complemented with micro metrics to address the above-mentioned shortcomings. In this article, I have listed 10 micro metrics, which you may consider adding to your performance reports.
Memory related micro metrics
Here are 4 memory related micro metrics:
- Garbage Collection Pause Times
- Object Creation/Reclamation Rate
- Garbage Collection Throughput
- Memory Consumption By Each Generation
Let’s review them in detail:
1. Garbage Collection Pause Times
You should measure Garbage collection pause times, as during GC pauses entire application freezes. No customer activity will be performed. It would have direct customer impact. One should always aim for low pause time.
2. Object Creation/Reclamation Rate
The rate at which objects are created heavily influences the CPU utilization. If inefficient data structures or code are used, then more objects will be generated to process the same number of transactions. A high object creation rate translates to frequent Garbage Collection (GC). Frequent GC translates to increased CPU consumption.
3. Garbage Collection Throughput
Throughput is basically the amount of time your application spends in processing customer transactions vs amount of time it spends in doing garbage collection activities. One should target for high throughput (i.e. application should spend more time in processing customer transactions and very less time in garbage collection activities)
4. Memory consumption by each generation
In JVM (Java Virtual Machine), ART (Android Runtime), and other platforms, memory is divided into few internal regions. You need to know what is the allocated size, peak utilization size of each region are. Under allocation of internal memory regions, will degrade the performance of the application. Over allocation will increase the bill from your hosting provider 😊.
Fig: GC KPI generated by GCeasy.io
How to source memory related micro metrics?
All the memory related micro metrics can be captured from the Garbage Collection logs. Here are instructions on how to turn ON the Garbage collection logs. You can use free online GC log analyzer tool like GCeasy.io, which will report all of the above memory related micro metrics in a visual/graphical format.
Thread related micro metrics
Here are 4 thread related micro metrics:
- Thread States
- Thread groups
- Daemon vs non-daemon
- Code execution path
Let’s review the importance of these metrics below:
Fig: Thread Group metric generated by fastThread.io
5. Thread States
Threads can be in one of the following states: NEW, BLOCKED, RUNNABLE, WAITING, TIMED_WAITING, TERMINATED state. Threads count by each state should be reported. If threads are in a BLOCKED state for a prolonged period then the application can become unresponsive. If there are lot of threads in a RUNNABLE state, then the application’s CPU consumption will become high. If application threads are spending more time in WAITING, TIMED_WAITING or BLOCKED states, then response time will degrade.
6. Thread groups
A thread group represents a set of threads. Each application has multiple thread groups. You should measure the size of each thread group and report it. An increase in thread group size might indicate a certain type of performance degradation.
7. Daemon vs non-Daemon
There are two types of thread statuses: Daemon and non-daemon (i.e. user) threads. You should report threads count by status. Because when non-daemon threads are running, JVM won’t terminate.
8. Code execution path
Your application’s CPU, memory consumption, and response time would differ based on the code execution path. If most of the threads execute a specific code execution path, then that particular code execution should have to be studied in detail.
How to source thread related micro metrics?
Thread activity related metrics can be sourced from thread dumps. Here are 8 different options to capture thread dumps. You can use the option that is convenient to you. Once you have captured thread dumps, you can upload them to free online thread dump analysis tools like fastThread.io. This tool provides all of the above thread activity related micro-level metrics.
Network related micro metrics
Here are 2 network related micro metrics:
- Outbound connections
- Inbound connections
Let’s review the importance of these metrics below:
9. Outbound connections
In today’s world, you will seldom see enterprise applications that don’t communicate with other applications. Your application’s performance is heavily dependent on the applications to which it communicates. Measuring the number of ESTABLISHED connections by each end-point should be measured. Any variance in the connection count can influence the performance of the application.
10. Inbound connections
The application can get traffic from multiple channels: Web, Mobile, API and multiple protocols: HTTP, HTTPS, JMS, Kafka,… You need to measure the number of connections coming from each channel and each protocol as they also influence the performance of the application.
How to source network related micro metrics?
Application Performance Monitoring (APM) tools like New Relic, App Dynamics… can report this metric OR you can configure custom probes in APM tools to report these metrics. On the other hand, if you aren’t using APM tools, you can also use ‘netstat’ utility.
netstat -an | grep 'hostname' | grep 'ESTABLISHED'