How to debug Kubernetes OOMKilled when the process is not using memory directly

How to debug Kubernetes OOMKilled when the process is not using memory directly

We investigated the memory increase problem some time ago and learned a lot about JVM metrics. This happened again, we noticed several Java applications deployed in Kubernetes got the memory usage increasing gradually until it reached the memory limit, even after several times of increasing the memory limit, the usage can always hit above 90%, sometimes the container will be OOMKilled.

Read more
How to set up a reasonable memory limit for Java applications in Kubernetes
How to alert for Pod Restart & OOMKilled in Kubernetes
Use Traffic Control to Simulate Network Chaos in Bare metal & Kubernetes
Implement zero downtime HTTP service rollout on Kubernetes

Implement zero downtime HTTP service rollout on Kubernetes

You might have encountered some 5xx errors during http service rollout on Kubernetes and wonder how to make it more reliable without these errors, this article will first explain where this errors come from and how to fix them and implement zero downtime.

Read more
How does Prometheus query work? - Part 1, Step, Query and Range
InfluxDB command cheatsheet

InfluxDB command cheatsheet

This article is InfluxDB command cheatsheet about how to interact with influxDB server and query the metrics. The InfluxDB version I tested is v1.7.10

Read more
how to build the smallest docker image as fast as you can
How to check and monitor SSL certificates expiration with Telegraf

How to check and monitor SSL certificates expiration with Telegraf

As a developer or operator of a Website, the certificate expiration could happen and make the services not work. I’ll introduce how to monitor certificates like SSL,JKS,P12 using Telegraf.

Certificates are broadly used for security reasons, they can be used within internal service or public service communication. The most common certificate is TLS used for verifying the identity of the HTTPS service. To increase security, the certificate will not be always valid because of expiration. To prevent the certificate expiry, we should rotate them periodically and meanwhile monitor them and alert if expired. Telegraf is a popular metric collecting tool to implement this.

Read more
Practice datacenter failover in production

Practice datacenter failover in production

Distributed system is like human body, it will have issues and break. There’s a theory that we feed it with issues deliberately and constantly, the body will be more and more stable and robust. It’s the same to system, put some issues to datacenters and let them failover automatically.

Read more
Your browser is out-of-date!

Update your browser to view this website correctly.&npsb;Update my browser now

×