Posted 2021-11-28Updated 2021-11-28operation12 minutes read (About 1778 words)

How to set up a reasonable memory limit for Java applications in Kubernetes

This article introduces some discovery of the Java memory usage in Kubernetes and how to set up a reasonable memory request/limit based on the Java heap requirement and the memory usage of the application.

Context

The trigger for me to look up the memory usage of Java applications in Kubernets is the increased OOM events in production, after investigation, it was not caused by JVM heap shortage, so I need to investigate where the non-heap memory goes.
(If you’re not familiar with the OOM events, you can check the article “How to alert for Pod Restart & OOMKilled in Kubernetes “)

Container Metrics

There are several metrics for memory usage in Kubernetes,

container_memory_rss (cadvisor)
The amount of anonymous and swap cache memory (includes transparent hugepages).
container_memory_working_set_bytes (cadvisor)
The amount of working set memory, this includes recently accessed memory, dirty memory, and kernel memory. Working set is <= “usage” and it equals to Usage minus total_inactive_file.
resident set size
It is container_memory_rss + file_mapped (file_mapped is accounted only when the memory CGroup is owner of page cache)

For Kubernetes, it depends on container_memory_working_set_bytes to oom-kill the container which exceeds the memory limit, we’ll use this metric in the following sections.

Heap Usage << Memory Limit

After we noticed several OOMs in the production, it’s time to figure out the root cause. According to the JVM metrics, I found the heap size was way less than the Kubernetes memory usage, let’s check an sample, its memory usage upper limit is as high as 90% of the total memory size.

initial and max heap size is 1.5G Set by XX:InitialHeapSize=1536m -XX:MaxHeapSize=1536m -XX:MaxGCPauseMillis=50
Kubernetes request and memory limit is 3G set by deployment.yaml,

resources:
  limits:
    memory: 3Gi
  requests:
    memory: 3Gi

In the monitoring dashboard, Kubernetes memory usage is close to 90% of the memory limit which is 2.7G

In other words, the non-heap memory took 2.7G-1.5G = 1.2G.

A close-up on Java memory

JVM contains heap and non-heap memory, let’s take a sample. The data is from an application with XX:InitialHeapSize=1536m -XX:MaxHeapSize=1536m -XX:MaxGCPauseMillis=50

Heap

If we set the max size of the heap, we can consider the upper limit is fixed and heap size is 1.5G, we can divide the heap memory into Eden, Survivor, Old regions if we were using G1 as our GC algorithm.

We can check heap info by jcmd, as we can see the used heap 593M is way less than the committed size 1.5G, so we are good with the heap usage.

bash-4.2$ jcmd 99 GC.heap_info
99:
 garbage-first heap   total 1572864K, used 608214K [0x00000000a0000000, 0x0000000100000000)
  region size 1024K, 255 young (261120K), 1 survivors (1024K)
 Metaspace       used 82997K, capacity 85390K, committed 89168K, reserved 1126400K
  class space    used 9600K, capacity 10640K, committed 11980K, reserved 1048576K
Non-Heap Analysis with NMT

To get more debug information, I enabled the native memory tracking (switch on Native memory tracking by adding XX:NativeMemoryTracking=[off | summary | detail] to the Java options) in the app, I chose detail to show more details of the memory usage.

After the application ran for 5 days, the memory usage was increasing slowly to 78.29% of the memory limit (3G), namely 2.35G.

Let’s use jcmd to show where the memory went,

bash-4.2$ jcmd 99 VM.native_memory
99:

Native Memory Tracking:

Total: reserved=3597274KB, committed=2291546KB
-                 Java Heap (reserved=1572864KB, committed=1572864KB)
                            (mmap: reserved=1572864KB, committed=1572864KB)

-                     Class (reserved=1128971KB, committed=91739KB)
                            (classes #13083)
                            (  instance classes #12390, array classes #693)
                            (malloc=2571KB #40364)
                            (mmap: reserved=1126400KB, committed=89168KB)
                            (  Metadata:   )
                            (    reserved=77824KB, committed=77188KB)
                            (    used=73396KB)
                            (    free=3792KB)
                            (    waste=0KB =0.00%)
                            (  Class space:)
                            (    reserved=1048576KB, committed=11980KB)
                            (    used=9601KB)
                            (    free=2379KB)
                            (    waste=0KB =0.00%)

-                    Thread (reserved=73325KB, committed=7621KB)
                            (thread #71)
                            (stack: reserved=72960KB, committed=7256KB)
                            (malloc=252KB #428)
                            (arena=113KB #140)

-                      Code (reserved=250907KB, committed=48115KB)
                            (malloc=3219KB #15038)
                            (mmap: reserved=247688KB, committed=44896KB)

-                        GC (reserved=131736KB, committed=131736KB)
                            (malloc=40392KB #703266)
                            (mmap: reserved=91344KB, committed=91344KB)

-                  Compiler (reserved=965KB, committed=965KB)
                            (malloc=832KB #1061)
                            (arena=133KB #5)

-                  Internal (reserved=371646KB, committed=371646KB)
                            (malloc=371614KB #2010417)
                            (mmap: reserved=32KB, committed=32KB)

-                     Other (reserved=899KB, committed=899KB)
                            (malloc=899KB #70)

-                    Symbol (reserved=18735KB, committed=18735KB)
                            (malloc=16042KB #163964)
                            (arena=2693KB #1)

-    Native Memory Tracking (reserved=46557KB, committed=46557KB)
                            (malloc=513KB #7270)
                            (tracking overhead=46044KB)

-               Arena Chunk (reserved=178KB, committed=178KB)
                            (malloc=178KB)

-                   Logging (reserved=4KB, committed=4KB)
                            (malloc=4KB #187)

-                 Arguments (reserved=24KB, committed=24KB)
                            (malloc=24KB #493)

-                    Module (reserved=208KB, committed=208KB)
                            (malloc=208KB #1919)

-              Synchronizer (reserved=246KB, committed=246KB)
                            (malloc=246KB #2070)

-                 Safepoint (reserved=8KB, committed=8KB)
                            (mmap: reserved=8KB, committed=8KB)

As the output said, the total committed memory is 2.18G, committed heap size is the same as we specified, 1.5G.

For the other sections, Internal took 362M (371646KB), GC took 128M (131736KB), Class took 89M (91739KB), Code took 47M (48115KB), Symbol took 18M (18735KB), Thread took 7M, …

(Didn’t take Native Memory Tracking into account because it was the overhead of the tracing, not the real situation in production.)

Since I set an NMT baseline early, we can run jcmd <process-id> VM.native_memory detail.diff to know which method consumed the memory.

As the output below (full output), I omitted some sections which didn’t increase a lot. Compared to the baseline, the total committed memory increased 437M (448201KB), the Internal section increased the most, 361M (373425KB).

Native Memory Tracking:

Total: reserved=3599409KB +443893KB, committed=2293681KB +448201KB

...

-                      Code (reserved=250907KB +1KB, committed=48115KB +4321KB)
                            (malloc=3219KB +1KB #15038 +444)
                            (mmap: reserved=247688KB, committed=44896KB +4320KB)

-                        GC (reserved=131889KB +31893KB, committed=131889KB +31893KB)
                            (malloc=40545KB +31893KB #706519 +672802)
                            (mmap: reserved=91344KB, committed=91344KB)

...

-                  Internal (reserved=373425KB +369869KB, committed=373425KB +369869KB)
                            (malloc=373393KB +369869KB #2020176 +2003461)
                            (mmap: reserved=32KB, committed=32KB)
  ...

[0x00007ff2b5ab0cb1] GCNotifier::pushNotification(GCMemoryManager*, char const*, char const*)+0x71
[0x00007ff2b5e099ce] GCMemoryManager::gc_end(bool, bool, bool, bool, GCCause::Cause, bool)+0x27e
[0x00007ff2b5e0b95a] TraceMemoryManagerStats::~TraceMemoryManagerStats()+0x2a
[0x00007ff2b5a537f8] G1CollectedHeap::do_collection_pause_at_safepoint(double)+0x8a8
                             (malloc=31508KB type=Internal +31303KB #672175 +667804)

[0x00007ff2b5e091b9] GCStatInfo::GCStatInfo(int)+0x29
[0x00007ff2b5ab0c8d] GCNotifier::pushNotification(GCMemoryManager*, char const*, char const*)+0x4d
[0x00007ff2b5e099ce] GCMemoryManager::gc_end(bool, bool, bool, bool, GCCause::Cause, bool)+0x27e
[0x00007ff2b5e0b95a] TraceMemoryManagerStats::~TraceMemoryManagerStats()+0x2a
                             (malloc=168044KB type=Internal +166951KB #672175 +667804)
...

You can find out the memory in Internal was consumed by TraceMemoryManagerStats::~TraceMemoryManagerStats() which is related to GC, so it seems GC will create some GC data and the data size is slowly increasing. GC+Internal consumed 493M. So now we know where the non-heap memory goes.

G1 Tuning?

Java 11 uses G1 as the default GC algorithm, CMS (Concurrent Mark Sweep) is deprecated and Java 11 mentioned

The general recommendation is to use G1 with its default settings, eventually giving it a different pause-time goal and setting a maximum Java heap size by using -Xmx if desired.

So that means by using G1, a complicated configuration is not that necessary, you just need to make a wish and G1 will try its best to implement it. It also indicates a bit why G1 will consume more and more memory, it might gather some information about the memory behavior to optimize the memory allocation.

To fit a Java application to the Kubernetes, we need to specify several things:

InitialHeapSize and MaxHeapSize

Setting this is to limit the heap memory, setting them to the same value will reduce the heap resizing. You can either set the MaxHeapSize to a static value based on the usage or to a ratio of the memory limit like 50%, 60% depending on your real usage. So these two goals will conflict, in most cases, we set our goal of the pause time and G1 will configure based on the goal.

MaxGCPauseMillis

The default value of it is 200, G1 will try to balance the throughout and the pause time based on this value. There are two directions in GC tuning:“Increase the throughout” means reducing the overall GC time.
“Improve the GC pause time” means doing GC more frequently, for example, it will reduce the size of Young region (eden, survivor region) to trigger the young GC more often.

Calculate the required memory based on monitoring

As the memory analysis showed, the required memory = the max heap size + JVM non-heap (GC, Metaspace, Code, etc.), considering the application might need more native memory when using JNI APIs, like java.util.zip.Inflater will allocate some native memory for (de)compression.

It’s hard to give an exact memory limit at first, we can always start with a loose limit and leave more room for the non-heap. After the application runs in the production for some time and the metrics are in place, we can adjust the memory limit based on the monitoring data.

To help the developers to realize if the memory limit is reasonable, we can set some thresholds for the application resource usage, if the app falls into these holes, we will generate some warnings to the developers.

Memory Request is too high = 0.6 >= mem_usage(p95) / mem_request
Memory Request is too low = 0.9 <= mem_usage(p95) / mem_request
Memory Limit is too low = 0.85 <= mem_usage(p95) / mem_limit

The prometheus we use is memory usage P95 in last 7 days.

quantile_over_time(
    0.95,
    max by (namespace, container) (container_memory_working_set_bytes{namespace=~"<namespace>.*", container="<container>", pod=~"<pod>-.+"})[7d:]
)

You can put the metrics on the monitoring dashboard and trigger alerts of warning level to the responsive team and iterate the memory limit accordingly, when you have more data I think this can also be automated.

Conclusion

For Java applications, we recommend set the max heap with either a static value or a reasonable ratio (40% ~ 60%) based on the heap usage, make sure to leave enough space for GC and other native memory usage.

For other applications, we can set up the required memory based on the monitoring data, we should always give enough free memory for the application, a good start is the three limitations we set above.