We truly believe enterprises are wasting millions of dollars in garbage collection. We equally believe enterprises are wasting these many millions of dollars even without knowing they are wasting. Intent of this post is to bring visibility on how several millions of dollars are wasted due to garbage collection.
What is Garbage?
All applications have a finite amount of memory. When a new request comes, the application creates objects to service the request. Once a request is processed, all the objects created to service that request are no longer needed. In other terms those objects become garbage. They have to be evicted/removed from the memory so that room is created to service new incoming requests.
Garbage collection evolution: Manual 🡪 Automatic
3 – 4 decades back, C, C++ programming languages were popularly used by the development community. In those-programming languages garbage collection needs to be done by the developers. i.e., application developers need to write code to dispose of unreferenced objects from the memory. If developers forget (or miss) to write that logic in their program, then the application will suffer from memory leak. Memory leaks will cause applications to crash. Thus, memory leaks were claimed to be quite pervasive back in those days.
In the mid-1990s when the Java programming language was introduced, it provided automatic garbage collection i.e., developers no longer have to write logic to dispose of unreferenced objects. Java Virtual machine will itself automatically remove unreferenced objects from memory. Definitely it was a great productivity improvement, developers enjoyed this feature. On top of it, a number of memory leak related crashes also came down. Sounds great so far, right? But there was one catch to this automatic garbage collection.
To do this automatic garbage collection, JVM has to pause the application to identify unreferenced objects and dispose them. This pausing can take anywhere from a few milliseconds to few minutes, depending on the application, workload & JVM settings. When an application is paused to do garbage collection, no customer transactions will be processed. Any customer transactions that are in the middle of processing will be halted. It will result in poor response time to the customers. So, this was the trade-off, i.e., for developer productivity and minimizing memory leak related crashes, application pause times got introduced in automatic garbage collection. By doing effective tuning we can bring down the pause time, but it cannot be eliminated.
This might sound like a minor performance hit to the customer’s response time. But it does not stop there, today enterprises are losing millions of dollars because of this automatic garbage collection. Below are the interesting facts/details.
Garbage collection Throughput
‘GC Throughput’ is one of the key metrics that is studied when it comes to Garbage collection tuning. This metric is cleverly reported in percentage. What is ‘GC Throughput %?’. It is basically the amount of time application spends in processing the customer transactions vs amount of time application spends in processing Garbage collection activities. Say suppose application has 98% as it’s GC Throughput, it means application is spending 98% of its time in processing customer transactions and remaining 2% of time in processing Garbage collection activities.
Does 98% GC throughput sound good to you? Since human minds are trained to read 98% as A grade score, definitely 98% GC throughput should sound good. But in reality, it is not the case. Let us look at the below calculations.
In 1 day, there are 1440 minutes (i.e. 24 hours x 60 minutes).
98% GC throughput means application is spending 28.8 minutes/day in garbage collection. (i.e., the application is spending 2% of time in processing GC activities. 2% of 1440 minutes is 28.8 minutes).
What is this telling us? Even if your GC throughput is 98%, your application is spending 28.8 minutes/day (i.e., almost 30 minutes) in Garbage collection. For that 28.8 minutes period your application is pausing. It’s not doing anything for your customer.
One way to visualize this problem is: Say you have bought a brand-new expensive car and you want to drive this car for a couple of hours. How will you feel if the car runs only for 1 hour and 50 minutes, but stops intermittently in the middle of the road for 10 minutes, and still ends up consuming gasoline? This is what is happening exactly in automatic garbage collection. JVM keeps pausing intermittently, while application is still processing customer transactions.
Even healthy application’s GC throughput ranges from 99% to 95%. Sometimes it could go even below than that. In the below table I have summarized how many dollars mid-size(1K instances/year), large-size(10K instances/year) and very large(100K instances/year) enterprises would be wasting based on their application’s GC throughput percentage.
|GC Throughput %||99%||98%||97%||96%||95%|
|Minutes wasted by 1 instance per day||14.4 min||28.8 min||43.2 min||57.6 min||72 min|
|Hours wasted by 1 instance per year||87.6 hrs||175.2 hrs||262.8 hrs||350.4 hrs||438 hrs|
|Dollars wasted by mid-size company (1K Instances per year)||$50.07K||$100.14K||$150.21K||$200.28K||$250.36K|
|Dollars wasted by large size company (10K Instances per year)||$500.77K||$1.00M||$1.50M||$2.00M||$2.50M|
|Dollars wasted by X-Large size company (100K Instances per year) ||$5.00M||$10.01M||$15.02M||$20.02M||$25.03M|
Here are the assumptions I have used for our calculation:
- Midsize enterprise would have their application running on 1000 EC2 instances. Large size enterprises would have their application running on 10,000 EC2 instances. Very large enterprises would have their application running on 100,000 EC2 instances.
- For our calculation, I assume these enterprises are running on t2.2x.large 32G RHEL on-demand instances in US West (North California) EC2 instances. Cost of this type of EC2 instance is $ 0.5716/hour.
From all the below graphs you can notice the amount of money midsize, large size and very large size enterprise would be wasting due to garbage collection:
Note 1: Here I have made calculations with assumptions GC throughput ranges only from 99% to 95%, several applications tend to have much poorer throughput. In such circumstances the amount of dollars wasted will be a lot more. Note 2: I have used t2.2x.large 32G RHEL instance for calculation. Several enterprises tend to use machines with much larger capacity. In such circumstances, the amount of dollars wasted will be a lot more.
Following are the counter arguments that can be placed against this study:
- For my study I have used AWS EC2 on-demand instances, rather I could have taken dedicated instances for my calculations. Difference between on-demand and dedicated instances is only approximately 30%. So, the price point can fluctuate only by 30%. Still 70% of the above cost is outrageous.
- Another argument can be AWS cloud is costly, I could have used some other cloud provider or bare metal machines or serverless architecture. Yes, these all are valid counter arguments, but they will shift the calculation only by a few percentages. But the case that garbage collection is wasting resources cannot be disputed.
You are open to articulate any other counter arguments in the comments section. I will try to respond to it.
In this post I have presented the case on how an exorbitant amount of money is wasted due to garbage collection. Unfortunate thing is: money is wasted even without our awareness. As applications developers/managers/executives we can do the following:
- We should try to tune garbage collection performance , so that our applications starts to spend very less time on Garbage collection.
- Modern applications tend to create tons of objects even to service simple requests. Here is our case study which shows the amount of memory wasted by the well celebrated spring boot framework. We can try to write efficient code, so that our applications tend to create very less number of objects to service the incoming requests. If our applications create a smaller number of objects, then very less garbage needs to be evicted from memory. If garbage is less, the pause time will also come down.