Should I be running my application with few instances (i.e. machines) with large memory size or a lot of instances with small memory size? Which strategy is optimal? This question might be confronted often. After building applications for 2 decades, after building JVM performance engineering/troubleshooting tools (GCeasy, FastThread, HeapHero), I still don’t know the right answer to this question. At the same time, I believe there is no binary answer to this question as well. In this article, I would like to share my observations and experiences on this topic.
Two multi-billion dollars enterprises story
Since our JVM performance engineering/troubleshooting tools has been widely used in major enterprises, I had an opportunity to see world-class enterprise applications implementations in action. Recently I had the chance to see two hyper-growth technology companies (If I say their name everyone reading this article will know them). Both companies are headquartered in Silicon Valley. Their business is technology, so they know what they are doing when it comes to engineering. They are wall-street darlings, enjoying great valuations. Their market cap is in the magnitude of several billions of dollars. They are the poster child of modern thriving enterprises. For our conversation let’s call these two enterprises as company-A and company-B.
It immensely surprises me to see how both enterprises has adopted *two extremes* when it comes to memory size. Company-A has set its heap size (i.e. -Xmx) to be 250gb, whereas company-B has set its heap size to be 2gb. i.e. company-A’s heap size is 125 times larger than Company-B’s heap size. Both enterprises are confident about their memory size settings. As they say: ‘Proof is in the pudding’, both enterprises are scaling and handling billions of business-critical transactions.
This is a great experience to see both companies who are into the same business, having more or less same revenue/same market cap, located in the same geographical region, at same point in time adopting two extremes when it comes to memory size. Given this real-life experience, what is the right answer? Large size or small size memory? My conclusion is: You can succeed with either strategy if you have a good team in place.
Large memory size can be expensive
Large memory size with few instances (i.e. machines) tends to be expensive than with small memory size, a greater number of instances. Here is simple math, based on the cost of an AWS EC2 instances in US East (N. Virginia) region:
m4.16xlarge – 256GB RAM – Linux on-demand instance cost: $3.2/hour
T3a small – 2GB RAM – Linux on-demand instance cost: $0.0188/hour
So, to have capacity of 256GB RAM, we would have to get 128 ‘T3a small’ instance (i.e. 128 instances x 2GB = 256GB).
128 x T3a small – 2GB RAM – Linux on-demand instance cost: $2.4064/hour (i.e. 128 x $0.0188/hour)
It means large memory size with few instances costs $0.793/hour (i.e. $3.2 – $2.4064) more than small memory size with a lot of instances. In other words, ‘large memory size with few instances strategy is 33% more expensive.
Of course, another counter-argument can be made is: you might need fewer engineers, less electricity, less real estate if you have a smaller number of machines. Patching, upgrading servers might be easier to do as well.
In some cases, the nature of your business itself dictates the memory size of your application. Here is a real-life incident that we faced: When we built HeapHero (Heap Dump analysis tool), our tool’s memory size had to be larger than heap dump file it parses. Say suppose heap dump file size is 100gb, then HeapHero tool’s memory size must be more than 100gb. There is no choice.
Say suppose you are caching large amount (say 200gb) of data for maximizing your application’s performance, then your heap size must be more than 200gb. You will not have a choice. Thus, in some cases, the business requirement will dictate your memory size.
Performance & Troubleshooting
If your memory size is large, then typically Garbage Collection pause times will also be high. Garbage collection is a process that runs in your application to clean-up unreferenced objects in memory. If your memory size is large, then the amount of garbage in the memory will also be large. Thus, amount of time taken to clean up garbage will also be high. When garbage collection runs, it pauses your application. But there are solutions to this problem:
- You can use pauseless JVM (like ‘Azul’)
- Proper GC tuning needs to be done to reduce pause times
Similarly, if you need to troubleshoot any memory problem, you will have to capture heap dumps from the application. A heap dump is basically a file which contains all information about your application’s memory like what objects were present, what are their references, how much memory each object occupies, …. Heap dumps of large memory size application will also tend to be very large. Analyzing large size heap dumps are difficult as well. Even world’s best heap dump tools like Eclipse MAT, HeapHero have challenges in parsing heap dumps that are more than 100gb. Reproducing these problems in test lab, storing these heap dump files, sharing these heap dump files are all challenges.
Emotions comes first, Rationale next
After reading books like ‘How we decide’ written by Jonah Lehrer – I am fairly convinced that your prior experience, emotions play a key role in deciding your application’s memory size. I used to work for a major financial institution. Chief architect of this financial institution was suggesting us to run our JVMs with very large memory size, rationale he gave was: “We used to run mainframes with very large memory size ”😊.
If you are working for very large corporations, then there is a 99.99% chance that you may not have to say on what should be the memory size for your application. Because that decision has already been made by elites/demi-gods who are sitting on the ivory tower😊. It might be to be hard to reverse or change that decision.
But if you have choice or option to make that decision, your decision for memory size can be most likely be influenced by your prior experience and emotions :). But either way, you can’t go wrong (i.e. going with few instances with large memory size or lot of instances with small memory size), provided you have the right team in place.