How to keep capacity predictions on target and cut CPU usage by 5x Lessons from capacity planning a Java enterprise application Kansas City, Sep 27 2016 Stefano Doni stefano.doni@moviri.com @stef3a linkedin.com/in/stefanodoni
A Business-centric Capacity Modelling framework IT Saturation Threshold IT Resource Utilization (e.g. CPU Utilization%) Current Working Area Residual Capacity Available In its current configuration, this system can manage up to 14k users before reaching saturation Maximum Business Capacity Business Volume (e.g. #users) 2
What s the Problem with Java Applications? Application CRASH! HW resources were healthy So where is the bottleneck? CPU Utilization % The Bottleneck Was Java Heap Memory! 3
Java Memory Bottlenecks: a devastating impact IT Resource Utilization (e.g. CPU Utilization%) Actual Business Capacity Java Memory Bottleneck Estimated Business Capacity Capacity is hugely overestimated! Java bottlenecks must be considered in the model! Business Volume (e.g. #users) KEY TAKEAWAY Traditional Capacity Planning techniques can severely overestimate the Business Capacity of Java Applications 4
Keeping your Capacity Predictions on Target even with Java Applications!
Java 101: Heap memory Server Memory Layout Free Memory Important Facts 1. The size of Java Heap Memory is fixed 2. When memory is exhausted, the Garbage Collection process kicks in and stops your application! Java Heap Memory Operating System and why should I care? Well, when the application stops, your Customers cannot shop. You re going to lose at least $3000 every second! KEY TAKEAWAY Exhaustion of Java Heap memory is one of the most common bottlenecks causing outages in Java applications 6
Is The Widely Used Java Heap Utilization A Good Metric for Capacity Planning? Heap Utilization (all app. servers) Heap Utilization Live Sessions 7
First challenge: finding the right metric Java Heap Utilization measures how much Heap memory is being used and is provided by most of Java monitoring solutions Java Heap Memory Heap Size Free Used Heap Memory Utilization % Heap Utilization is flat, irrespective of the workload increase # Users 8
What is Heap Utilization poor and How to come up with a Better Metric? Garbage Collection Events Heap Utilization Garbage Heap Utilization Live Data Size is the amount of memory consumed by the set of live lived objects required to run the application How about using the Live Data Size for capacity planning models? Time 9
The ultimate Java Memory KPI: Live data Java Heap Memory Free Garbage Heap Size xp Used Live Data e.g. Application memory footprint KEY TAKEAWAY Java Heap Utilization is a combination of live data and garbage. Live Data represents the real memory footprint of the application and is the correct KPI to use for capacity planning Mastering Java Applications Capacity - December 2015 #movinar 10
How to measure Live Data Most Java monitoring tools won t make Live Data available, however let s take a look at a Garbage Collection log file Example of garbage collection log (Oracle JVM w/ Concurrent Mark-Sweep) KEY TAKEAWAY Live Data can be derived from Garbage Collection logs 11
The Result of the Data Collection: Live Data Size looks Promising! Heap Utilization Live Data Size Live Sessions 12
The Final Test: Is Live Data Size Correlated with Live Sessions? YES! Live Data Size R-squared = 91% Live Sessions 13
Balance between cost and performance Wasted Capacity Conservative thresholds might lead to inefficient use of available capacity Performance Issues Aggressive thresholds might lead to to excessive GC Garbage Collections Heap Utilization Heap Utilization Live Data Threshold @ 80% Live Data Threshold @ 20% KEY TAKEAWAY Time A suggested threshold to start from is 50% of Heap (Old Gen) size Time 14
Putting together Java-aware and Business-centric From Java-aware Capacity Models To Business-centric Capacity Planning Live Data Utilization (bytes) New Estimated Business Capacity Current Infrastructure Current Users # App. Server Instances Required Infrastructure To Support Business Initiative Target Users Estimated # App. Server Instances 3500 45 4500 60 Business Volume (e.g. #users) 15
Detecting Poor Memory Usage Patterns and Anticipating Memory Leaks The model-based approach
A new Memory usage pattern emerged after a new Application release What is causing this? Live Data Size Live Sessions 17
Another Live Data Size Benefit: Anticipating Mem. Leaks Live Data Size Live Sessions Live Data Size High Mem Usage @ Low Load Live Sessions Based on this evidence, Devs investigated the app and found the actual memory leak. They later asked us to include this analysis as part of the release cycle 18
Efficiency: Are your CPUs used for the Business, or by the Garbage Collector? Stop the guessing and start measuring!
All of a Sudden, Something Really Weird Happened CPU Utilization CPU Utilization cut by 5x while doing the same amount of work! CPU Utilization Server Call Rate Server Call Rate No variation in business volumes, no new application release, no changes in physical infrastructure. The Change: +2 GB Java Heap! 20
GC CPU Utilization is not available in many Java monitoring tools. How can you measure it? Example of GC log fragment on Oracle JVM (--XX:+PrintGCDetails): Sum over the Interval % å CPUuser + CPUsys GarbageCollectorCPU = Interval x CPUNumber Eg. 300 secs (5 min) 21
After data collection: GC was the first consumer of CPU! CPU Utilization Almost all of the CPU cycles used by GC! Total CPU Utilization Garbage Collector CPU Util % After cluster expansion: Total CPU cut in half, GC CPU cut by 5x! The Garbage Collector might be the first consumer of your CPUs, well ahead the actual application code. Stop the guessing, start measuring it! 22
Scalability in 2015: Java Achille Heels? How to keep it under control!
Unexplained CPU Utilization Patterns During Memory Stressful Conditions CPU Utilization High CPU Utilization during the night, even though workload is zero after 9PM CPU Utilization Server Call Rate What drives CPU Utilization during the night? 24
Let s Find It Out! Linux top During The Anomaly Example of Linux top output, thread view (press H once in top) : One software thread consuming all of its CPU cycles? This is the background thread used by the GC! Example of Java Thread Dump (jstack <PID>) : 25
Can Java Garbage Collector Be A Scalability Bottleneck? Java Concurrent Mark and Sweep Garbage Collector (CMS) is concurrent and parallel ü Concurrent = perform work without stopping the application threads ü Parallel = it is multi-threaded, scales with number of CPUs But we discovered that: 1. Just one CMS Background thread is configured by default with up to 4 CPUs 1. Can be incresed via specific option, but watch out for excessive GC CPU Utilization 2. CMS might «fail» and be forced to single-threaded operation 3. Even best in class GCs still need to stop the application - Amdhal law applies! 26
Conclusions So What Have We Learned?
Key Take Aways What have we discovered? Traditional capacity models might severely overestimate the business capacity of Java applications The major consumer of your infrastructure resources might be the garbage collector Java memory management can have an impact on your application scalability Common monitoring tools might not provide all the metrics you need The key metrics to look for might not be reported by your typical toolset, but Monitoring/APM Tools might not Our contribution to close the gap An enhanced Capacity model takes into account Java memory and support what-if analyses, using innovative KPIs The need to get visibility into real garbage collection CPU utilization and how to gather it How to control the problem by keeping track of single-threaded problems Be sure to enable detailed GC logging an all your Java enterprise apps and integrate the KPIs in your CM solution! 28
Java Memory Stress Translates to poor Application Performance GC pause time (seconds) Application stopped for 66 seconds KEY TAKEAWAY Excessive GC stress might cause poor User Experience or even service failures you need to monitor it!
Questions?
Contacts Headquarters Via Schiaffino 11C 20158 Milan Italy T +39-024951-7001 USA East 283 Franklin Street Boston, MA 02110 T: +1-617-936-0212 USA West 425 Broadway Street Redwood City, CA 94063 T +1-650-226-4274 @moviri moviricorp moviri +moviri