Time taken by every single GC event is reported in the GC log. In every GC event, there are ‘user’, ‘sys’, and ‘real’ time. What does these time mean? What is the difference between them?
- ‘real’ time is the total elapsed time of the GC event. This is basically the time that you see in the clock.
- ‘user’ time is the CPU time spent in user-mode code (outside the kernel).
- ‘Sys’ time is the amount of CPU time spent in the kernel. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space.
To learn more about these times, please refer to this article.
In normal (all most all) GC events, real time will be less than user + sys time. It’s because of multiple GC threads work concurrently to share the work load, thus real time will be less than user + sys time. Say user + sys time is 2 seconds. If 5 GC threads are concurrently working then real time should be some where in the neighbourhood of 400 milliseconds ( 2 seconds / 5 GC threads).
But in certain circumstances you might see real time to be greater than user + sys time.
[Times: user=0.20 sys=0.01, real=18.45 secs]
If you notice multiple occurrences of this scenario in your GC log then it might be indicative of one of the following problems:
1. Heavy I/O activity
2. Lack of CPU
1. Heavy I/O activity
When there is heavy I/O activity (i.e. networking/disk access/user interaction) on the server then real time tend to spike up to a great extent. As part of GC logging, your application makes a disk access. If there is a heavy I/O activity on the server then GC event might be stranded, causing spiked up real time.
Note: Even if your application is not causing the heavy I/O activity, the other process on the server may cause the heavy I/O activity leading to the high real time. Here is a brilliant article from LinkedIn engineers, explaining the GC problem they experienced because of high I/O activity.
You can monitor I/O activity on your server, using the sar (System Activity Report), in Unix.
sar -d -p 1
Above commands reports the reads/sec and writes/sec made to the device every 1 second. For more details on ‘sar’ command refer to this tutorial.
If you notice high I/O activity on your server then you can do one of the following to fix the problem:
a. If high I/O activity is caused by your application, then optimize your application’s I/O activity.
b. Eliminate the processes which are causing high I/O activity on the server
c. Move your application to a different server where I/O activity is less
2. Lack of CPU
If multiple processes are running on your server and if your application doesn’t get enough CPU cycles to run, it will start to wait. When the application starts to wait then real time will be considerably higher than user + sys time.
You can use the commands like ‘top’ or monitoring tools (nagios, newRelic, AppDynamics…) to observe the CPU utilization on the server. If you notice CPU utilization to be very high and your process doesn’t get enough cycles to run then you can do one of the following to fix the problem:
a. Reduce the number of processes that is running on the server, so that your application gets fair chance to run
b. Increase the CPU capacity – if you are in AWS cloud (or equivalent), move to a bigger instance type which has more CPU cores
c. Move your application to a new server where there is adequate CPU capacity
November 26, 2019 at 5:19 pm
How does high disk IO contribute to high real time? Isn’t the GC event completed BEFORE it’s written to disk?