In this modern world, Garbage collection logs are still analyzed in a tedious & manual mode. i.e. you have to get hold of your Devops engineer who has access to production servers, then he will mail you the application’s GC logs, then you will upload the logs to GC analysis tool, then you have to apply your intelligence to anlayze it. There is no programmatic way to analyze Garbage Collection logs in a proactive manner. Thus to eliminate this hassle, gceasy.io is introducing a RESTful API to analyze garbage collection logs. With one line of code you can get your GC logs analyzed instantly.

Here are few use cases where this API can be extremely useful.

Use case 1: Automatic Root cause Analysis

Most of the DevOps invokes a simple Http ping or APM tools to monitor the applications health. This ping is good to detect whether application is alive or not. APM tools are great at informing that application’s CPU spiked  up by ‘x%’, memory utilization increased by ‘y%’, response time dropped by ‘z’ milliseconds. It won’t inform what caused the CPU to spike up, what caused memory utilization to increase, what caused the response time to degrade. If you can configure Cron job to capture thread dumps/GC logs on a periodic interval and invoke our REST API, we apply our intelligent patterns & machine learning algorithms to instantly identify the root cause of the problem.

Advantage 1: Whenever these sort of production problem happens, because of the heat of the moment, DevOps team recycles the servers with out capturing the thread dumps and GC logs. You need to capture thread dumps and GC logs at the moment when problem is happening, in order to diagnose the problem. In this new strategy you don’t have to worry about it, because your cron job is capturing thread dumps/GC logs on a periodic intervals and invoking the REST API, all your thread dumps/GC Logs are archived in our servers.

Advantage 2: Unlike APM tools which claims to add less than 3% of overhead, where as in reality it adds multiple folds, beauty of this strategy is: It doesn’t add any overhead (or negligible overhead). Because entire analysis of the thread dumps/GCeasy are done on our servers and not on your production servers..

Use case 2: Performance Tests

When you conduct performance tests, you might want to take thread dumps/GC logs on a periodic basis and get it analyzed through the API. In case if thread count goes beyond a threshold or if too many threads are WAITING or if any threads are BLOCKED for a prolonged period or lock isn’t getting released or frequent full GC activities happening or GC pause time exceeds thresholds, it needs to get the visibility right then and there. It should be analyzed before code hits the production. In such circumstance this API will become very handy.

Use case 3: Continuous Integration

As part of continuous integration it’s highly encouraged to execute performance tests. Thread dumps/GC Logs should be captured and it can be analyzed using the API.  If API reports any problems, then build can be failed. In this way, you can catch the performance degradation right during code commit time instead of catching it in performance labs or production.

How to invoke Garbage Collection log analysis API?

Invoking Garbage Collection log analysis is very simple:

  1. Register with us. We will email you the API key. This is a one-time setup process. Note: If you have purchased enterprise version with API, you don’t have to register. API key will be provided to you as part of installation instruction.
  2. Post HTTP request to http://api.gceasy.io/analyzeGC?apiKey={API_KEY_SENT_IN_EMAIL}
  3. The body of the HTTP request should contain the Garbage collection log that needs to be analyzed.
  4. HTTP Response will be sent back in JSON format. JSON has several important stats about the GC log. Primary element to look in the JSON response is: “isProblem“. This element will have value to be “true” if any memory/performance problems has been discovered. “problem” element will contain the detailed description of the memory problem.

CURL command

Assuming your GC log file is located in “./my-app-gc.log,” then CURL command to invoke the API is:

curl -X POST --data-binary @./my-app-gc.log http://api.gceasy.io/analyzeGC?apiKey= --header "Content-Type:text"

It can’t get any more simpler than that? Isn’t it? 🙂

Note:  use the option “–data-binary” in the CURL instead of using “–data” option. In “–data” new line breaks will be not preserved in the request. New Line breaks should be preserved for legitimate parsing.

Other Tools

You can also invoke the API using any webservice client tools such as: SOAP UI, Postman Browser Plugin,…..

postman

Fig: POSTing GC logs through PostMan plugin

Sample Response

{
{
  "isProblem": true,
  "problem": [
    "Our analysis tells that Full GCs are consecutively running in your application. It might cause intermittent OutOfMemoryErrors or degradation in response time or high CPU consumption or even make application unresponsive.",
    "Our analysis tells that your application is suffering from long running GC events. 4 GC events took more than 10 seconds. Long running GCs are unfavourable for application's performance.",
    "342 times application threads were stopped for more than 10 seconds."
  ],
  "jvmHeapSize": {
    "youngGen": {
      "allocatedSize": "7.5 gb",
      "peakSize": "6 gb"
    },
    "oldGen": {
      "allocatedSize": "22.5 gb",
      "peakSize": "22.5 gb"
    },
    "metaSpace": {
      "allocatedSize": "1.04 gb",
      "peakSize": "48.52 mb"
    },
    "total": {
      "allocatedSize": "30 gb",
      "peakSize": "28.5 gb"
    }
  },
  "gcStatistics": {
    "totalCreatedBytes": "249.49 gb",
    "measurementDuration": "7 hrs 32 min 52 sec",
    "avgAllocationRate": "9.4 mb/sec",
    "avgPromotionRate": "1.35 mb/sec",
    "minorGCCount": "62",
    "minorGCTotalTime": "1 min 19 sec",
    "minorGCAvgTime": "1 sec 274 ms",
    "minorGCAvgTimeStdDeviation": "2 sec 374 ms",
    "minorGCMinTIme": "0",
    "minorGCMaxTime": "13 sec 780 ms",
    "minorGCIntervalAvgTime": "7 min 25 sec 442 ms",
    "fullGCCount": "166",
    "fullGCTotalTime": "14 min 11 sec 620 ms",
    "fullGCAvgTime": "5 sec 130 ms",
    "fullGCAvgTimeStdDeviation": "5 sec 207 ms",
    "fullGCMinTIme": "120 ms",
    "fullGCMaxTime": "57 sec 880 ms",
    "fullGCIntervalAvgTime": "2 min 19 sec 104 ms"
  },
  "gcDurationSummary": {
    "groups": [
      {
        "start": "0",
        "end": "6",
        "numberOfGCs": 212
      },
      {
        "start": "6",
        "end": "12",
        "numberOfGCs": 4
      },
      {
        "start": "12",
        "end": "18",
        "numberOfGCs": 2
      },
      {
        "start": "42",
        "end": "48",
        "numberOfGCs": 1
      },
      {
        "start": "54",
        "end": "60",
        "numberOfGCs": 1
      }
    ]
  },
  "gcCauses": [
    {
      "cause": "Allocation Failure",
      "count": 57
    },
    {
      "cause": "Concurrent Mode Failure",
      "count": 162
    },
    {
      "cause": "Full GC - Allocation Failure",
      "count": 369
    }
  ],
  "commandLineFlags": " -XX:CMSInitiatingOccupancyFraction=65 -XX:+CMSScavengeBeforeRemark -XX:CMSWaitDuration=2000 -XX:ConcGCThreads=8 -XX:+DisableExplicitGC -XX:GCLogFileSize=104857600 -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=32212254720 -XX:InitialTenuringThreshold=4 -XX:+ManagementServer -XX:MaxHeapSize=32212254720 -XX:MaxTenuringThreshold=4 -XX:NewRatio=3 -XX:NumberOfGCLogFiles=10 -XX:OldPLABSize=16 -XX:ParGCCardsPerStrideChunk=32768 -XX:+PrintClassHistogram -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintPromotionFailure -XX:+PrintTenuringDistribution -XX:SurvivorRatio=3 -XX:ThreadStackSize=334 -XX:+UnlockDiagnosticVMOptions -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseGCLogFileRotation -XX:+UseParNewGC ",
  "heapTuningTips": [
    "It looks like you have over allocated Metaspace size. During entire run, Metaspace's peak utilization was only 4.53% of the allocated size. You can consider lowering the Metaspace Size."
  ],
  "tipsToReduceGCTime": [
    {
      "issue": "15.34% of GC time (i.e 54 min 13 sec 710 ms) is caused by 'Concurrent Mode Failure'. The CMS collector uses one or more garbage collector threads that run simultaneously with the application threads with the goal of completing the collection of the tenured generation before it becomes full. In normal operation, the CMS collector does most of its tracing and sweeping work with the application threads still running, so only brief pauses are seen by the application threads. However, if the CMS collector is unable to finish reclaiming the unreachable objects before the tenured generation fills up, or if an allocation cannot be satisfied with the available free space blocks in the tenured generation, then the application is paused and the collection is completed with all the application threads stopped. The inability to complete a collection concurrently is referred to as concurrent mode failure and indicates the need to adjust the CMS collector parameters. Concurrent mode failure typically triggers Full GC..",
      "solution": "The concurrent mode failure can either be avoided by increasing the tenured generation size or initiating the CMS collection at a lesser heap occupancy by setting CMSInitiatingOccupancyFraction to a lower value and setting UseCMSInitiatingOccupancyOnly to true. CMSInitiatingOccupancyFraction should be chosen carefuly, setting it to a low value will result in too frequent CMS collections."
    }
  ],
  "throughputPercentage": 99.996,
  "responseId": "8296b5c3-25c7-4157-92df-a54d9083bab7",
  "graphURL": "http://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAxNy8wMi8xNy8tLWFwaS1lMDk0YTM0ZS1jM2ViLTRjOWEtODI1NC1mMGRkMTA3MjQ1Y2NjOWU0NGEzMS0yMDg2LTRhMzAtOWU5YS1jMDc0ZWQ4MWNlZjgudHh0LS0=&channel=API"
}

JSON Response Elements

Element Description
responseId Unique transaction Id that is generated for every response. This is used for any debugging or diagnosis purposes
graphURL Graphical visualization of the GC log can be found at this location.
isProblem true’ is returned if any memory problems are found. ‘false’ is returned if no memory problems are found. This element can be used to build any alerts for proactive GC monitoring
problem Description of the memory problem is reported in this element. Like what type of memory problem, it is. What are the side-effects and symptoms it might cause
jvmHeapSize The data points in this section is gathered from the GC log, thus It may or may not match with the size that is specified by the JVM system properties (i.e. –Xmx, -Xms,…). Say you have configured total heap size (i.e. –Xmx) as 2gb, whereas at runtime if JVM has allocated only 1gb, then in this report you are going to see the allocated size as 1gb only.
youngGen
> allocatedSize Young Generation’s allocated size (i.e. specified at the JVM level)
> peakSize Young Generation’s peak utilization size at runtime
oldGen
> allocatedSize Old Generation’s allocated size (i.e. specified at the JVM level)
> peakSize Old Generation’s peak utilization size at runtime
metaSpace
> allocatedSize Metaspace’s allocated size (i.e. specified at the JVM level)
> peakSize MetaSpace’s peak utilization size at runtime
permGen
> allocatedSize Perm Generation’s allocated size (i.e. specified at the JVM level)
> peakSize Perm Generation’s peak utilization size at runtime
total
> allocatedSize Total allocated heap size (i.e. specified at the JVM level) includes Young + Old + Perm (or Metaspace)
> peakSize Peak utilization of the heap size at runtime
gcStatistics GC statistics summary is based on ‘real’ time reported in the GC Logs. For more details on difference between user, sys, real time refer to https://blog.gceasy.io/2016/04/06/gc-logging-user-sys-real-which-time-to-use/
> totalCreatedBytes Total amount of objects created by the application
> measurementDuration Total time duration application has been running
> avgAllocationRate Objects creation rate by the application. When more objects are allocated, young generation gets filled up quickly, and Minor GC runs more frequently.
> avgPromotionRate Objects promoted rate from Young Generation to Old Generation. When more objects are promoted to Old generation, Full GC will be run more frequently.
> minorGCCount Number of Minor GCs occurred in the application
> minorGCTotalTime Total time spent in Minor GCs
> minorGCAvgTime Average time taken by the Minor GCs (i.e. total time taken by all minor GCs / number of minor GCs)
> minorGCAvgTimeStdDeviation Standard Deviation of the Minor GC average times
> minorGCMinTIme Minimum Time of all the minor GCs
> minorGCMaxTime Maximum time of all the minor GCs
> minorGCIntervalAvgTime Average time interval between minor GCs
> fullGCCount Number of Full GCs occurred in the application
> fullGCTotalTime Total time spent in Full GCs
> fullGCAvgTime Average time taken by the Full GCs (i.e. total time taken by all full GCs / number of full GCs)
> fullGCAvgTimeStdDeviation Standard Deviation of the Full GC average times
> fullGCMinTime Minimum Time of all the full GCs
> fullGCMaxTime Maximum time of all the full GCs
> fullGCIntervalAvgTime Average time interval between Full GCs
gcDurationSummary
> groups Array of GC Duration summary group. Example of a group:
{
“start”: “0.2”,
“end”: “0.3”,
“numberOfGCs”: 5
}, Indicates that 5 GCs completed between 0.2 second and 0.3 seconds
>> start Group’s start time reported in seconds
>> end Group’s end time reported in seconds
>> numberOfGCs Number of GCs that completed between ‘start’ time and ‘end’ time
gcCauses array  reporting the GC Causes
> cause The reason which triggered the GC. Values could be: Concurrent Mode Failure, Allocation Failure, System.gc() calls
> count Number of time GC occurred because of this reason
commandLineFlags JVM arguments that were passed to the application
heapTuningTips Array. Optional Element. Provides tips to tune your JVM heap size
tipsToReduceGCTime Array. Optional Element. Provides tips to reduce GC time
 >issue Indicates the reason what caused the GC
 >solution Solution to reduce this GC cause
 throughputPercentage Percentage of time spent in processing real transactions vs time spent in GC activity. Higher percentage is a good indication that GC overhead is low. One should aim for high throughput.

Preserving GC Log file name in the report

This step is optional. Response JSON contains ‘graphURL’ element. This URL provides visual report of the GC log that was parsed through API. At the top of report GC file name is printed. If you want to print a unique file name in the report, then you can pass request parameter: ‘fileName‘. i.e. http://api.gceasy.io/analyzeGC?apiKey={API_KEY_SENT_IN_EMAIL}&fileName={GC_LOG_FILE_NAME}.

Whatever value you pass in the ‘fileName’ parameter will be printed at the top of the visual report. Users find this feature handy especially when they are sending GC logs from multiple production servers. When they are sending GC log files from the multiple production servers, in the file name they suffix the production server’s name, so that it becomes handy when they switch between multiple reports in the web browser.

Best Practice: Building alerts

For production monitoring or for performance tests, you can build alerts around the JSON response.

a. If “isProblem” element has value to be “true” then alert can generated with the content from “problem” element

b. You can also setup thresholds on the elements like: “fullGCCount”, “fullGCTotalTime”, “fullGCAvgTime”, “fullGCAvgTimeStdDeviation”, “fullGCMinTIme”, “fullGCMaxTime”,
“fullGCIntervalAvgTime”. If any these elements value exceeds your threshold then alerts can be triggered.