Long GC Pauses are undesirable for applications. It affects your SLAs; it results in poor customer experiences, and it causes severe damages to mission critical applications. Thus in this article, I have laid out key reasons that can cause long GC pauses and potential solutions to solve them.

1. High Object Creation Rate

If your application’s object creation rate is very high, then to keep with it, garbage collection rate will also be very high. High garbage collection rate will increase the GC pause time as well. Thus, optimizing the application to create less number of objects is THE EFFECTIVE strategy to reduce long GC pauses. This might be a time-consuming exercise, but it is 100% worth doing. In order to optimize object creation rate in the application, you can consider using java profilers like JProfiler, YourKit, JVisualVM….). These profilers will report

• What are the objects that created?
• What is the rate at which these objects are created?
• What is the amount of space they are occupying in memory?
• Who is creating them?

Always try to optimize the objects which occupy the most amount of memory. Go after big fish in the pond.

bulb-1  Tit-bit: How to figure out object creation rate?
objectstats

Upload your GC log to the Universal Garbage Collection log analyzer tool GCEasy. This tool will report the object creation rate. There will be field by name ‘Avg creation rate’ in the section ‘Object Stats.’ This field will report the object creation rate. Strive to keep this value lower always. See the image (which is an excerpt from the GCEasy generated report), showing the ‘Avg creation rate’ to be 8.83 mb.sec.

 2. Undersized Young Generation

When young Generation is undersized, objects will be prematurely promoted to Old Generation. Collecting garbage from old generation takes more time than collecting it from young Generation. Thus increasing young generation size has a potential to reduce the long GC pauses. Young Generation can be increased setting either one of the two JVM arguments

-Xmn: specifies the size of the young generation

-XX:NewRatio: Specifies ratio between the old and young generation. For example, setting -XX:NewRatio=3 means that the ratio between the old and young generation is 3:1. i.e. young generation will be fourth of the overall heap. i.e. if heap size is 2 GB, then young generation size would be 0.5 GB.

3. Choice of GC Algorithm

Choice of GC algorithm has a major influence on the GC pause time. Unless you are a GC expert or intend to become one 🙂 or someone in your team is a GC expert – you can tune GC settings to obtain optimal GC pause time. Assume if you don’t have GC expertise, then I would recommend using G1 GC algorithm, because of it’s auto-tuning capability. In G1 GC, you can set the GC pause time goal using the system property ‘-XX:MaxGCPauseMillis.’ Example:

 -XX:MaxGCPauseMillis=200

As per the above example, Maximum GC Pause time is set to 200 milliseconds. This is a soft goal, which JVM will try it’s best to meet it. If you are already using G1 GC algorithm and still continuing to experience high pause time, then refer to this article.

4. Process Swapping

Sometimes due to lack of memory (RAM), Operating system could be swapping your application from memory. Swapping is very expensive as it requires disk accesses which is much slower as compared to the physical memory access. In my humble opinion – no serious application in a production environment should be swapping. When process swaps, GC will take a long time to complete.

Below is the script obtained from StackOverflow (thanks to the author) – which when executed will show all the process that are being swapped. Please make sure your process is not getting swapped.

#!/bin/bash 
# Get current swap usage for all running processes
# Erik Ljungstrom 27/05/2011
# Modified by Mikko Rantalainen 2012-08-09
# Pipe the output to "sort -nk3" to get sorted output
# Modified by Marc Methot 2014-09-18
# removed the need for sudo

SUM=0
OVERALL=0
for DIR in `find /proc/ -maxdepth 1 -type d -regex "^/proc/[0-9]+"`
do
    PID=`echo $DIR | cut -d / -f 3`
    PROGNAME=`ps -p $PID -o comm --no-headers`
    for SWAP in `grep VmSwap $DIR/status 2>/dev/null | awk '{ print $2 }'`
    do
        let SUM=$SUM+$SWAP
    done
    if (( $SUM > 0 )); then
        echo "PID=$PID swapped $SUM KB ($PROGNAME)"
    fi
    let OVERALL=$OVERALL+$SUM
    SUM=0
done
echo "Overall swap used: $OVERALL KB"

If you find your process to be swapping then do one of the following:

a. Allocate more RAM to the server
b. Reduce the number of processes that running on the server, so that it can free up the memory (RAM).
c. Reduce the heap size of your application (which I wouldn’t recommend, as it can cause other side effects).

5. Less GC Threads

For every GC event reported in the GC log, user, sys and real time are printed. Example:

[Times: user=25.56 sys=0.35, real=20.48 secs]

To know the difference between each of these times, please read the article. (I highly encourage you to read the article, before continuing this section). If in the GC events you consistently notice that ‘real’ time isn’t significantly lesser than the ‘user’ time – then it might be indicating that there aren’t enough GC threads. Consider increasing the GC thread count. Say suppose ‘user’ time 25 seconds, and you have configured GC thread count to be 5, then real time should be close to 5 seconds (because 25 seconds / 5 threads = 5 seconds).

WARNING: Adding too many GC threads will consume a lot of CPU and takes away a resource from your application. Thus you need to conduct thorough testing before increasing the GC thread count.

6. Background IO Traffic

If there is a heavy file system I/O activity (i.e. lot of reads and writes are happening) it can also cause long GC pauses. This heavy file system I/O activity may not be caused by your application. Maybe it is caused by another process that is running on the same server, still, can cause your application to suffer from long GC pauses. Here is a brilliant article from LinkedIn Engineers, which walks through this problem in detail.

When there is a heavy I/O activity, you will notice the ‘real’ time to be significantly more than ‘user’ time. Example:

[Times: user=0.20 sys=0.01, real=18.45 secs]

When this pattern happens, here are the potential solutions to solve it:

a. If high I/O activity is caused by your application, then optimize it.
b. Eliminate the processes which are causing high I/O activity on the server
c. Move your application to a different server where I/O activity is less

bulb-1  Tit-bit: How to monitor I/O  activity?

You can monitor I/O activity, using the sar (System Activity Report), in Unix. Example:

sar -d -p 1

Above commands reports the reads/sec and writes/sec made to the device every 1 second. For more details on ‘sar’ command refer to this tutorial.

7. System.gc() calls

When System.gc() or Runtime.getRuntime().gc() method calls are invoked it will cause stop-the-world Full GCs. During stop-the-world full GCs, entire JVM is freezed (i.e. No user activities will be performed during period). System.gc() calls are made from one of the following sources:

  1. Your own application developers might be explicitly calling System.gc() method.
  2. It could be 3rd party libraries, frameworks, sometimes even application servers that you use could be invoking System.gc() method.
  3. It could be triggered from external tools (like VisualVM) through use of JMX
  4. If your application is using RMI, then RMI invokes System.gc() on a periodic interval. This interval can be configured using the following system properties:

– Dsun.rmi.dgc.server.gcInterval=n
– Dsun.rmi.dgc.client.gcInterval=n

Evaluate whether it’s absolutely necessary to explicitly invoke System.gc(). If there is no need to then, please remove it. On the other hand, you can forcefully disable the System.gc() calls by passing the JVM argument: ‘-XX:+DisableExplicitGC‘. For complete details on System.gc() problems & solution refer to this article.

bulb-1  Tit-bit: How to know whether System.gc() calls are explicitly called?
gccauses

Upload your GC log to the Universal Garbage Collection log analyzer tool GCEasy. This tool has a section called ‘GC Causes.’  If GC activity is triggered because of ‘System.gc()’ calls then it will be reported in this section. See the image (which is an excerpt from the GCEasy generated report), showing that System.gc() was made 4 times during the lifetime of this application.

8. Large Heap size

Large heap size (-Xmx) can also cause long GC pauses. If heap size is quite high, then more garbage will be get accumulated in the heap. When Full GC is triggered to evict the all the accumulated garbage in the heap, it will take long time to complete. Logic is simple: If you have small can full of trash, it’s going to be quick and easy to dispose them. On the other hand if you have truck load of trash, it’s going to take more time to dispose them.

Suppose your JVMs heap size is 18GB, then consider having three 6 GB JVM instances, instead of one 18GB JVM. Small heap size has great potential to bring down the long GC pauses.

CAUTION: All of the above mentioned strategies should be rolled to production only after thorough testing & analysis. All strategies may not apply to your application. Improper usage of these strategies can result in negative results.

9. Workload distribution

Eventhough there are multiple GC threads, sometimes work load is evenly distributed between GC worker Threads. There are multiple reasons why GC workloads may not be evenly broken up amoing GC threads. For example:

a. Scanning of large linear data structures currently can not be parallelized.
b. Some times of events only triggers single thread collector (example when there is ‘concurrent mode failure’ in CMS collection)

If you happen to use CMS (Concurrent Mark & Sweep algorithm), you can consider passing -XX:+CMSScavengeBeforeRemark argument. This can create more balanced workloads among GC worker threads.