Kubernetes has become the de facto approach to container microservices. It offers extensibility, stability, and insight that doesn’t exist in other open source solutions. It has become a staple in the DevOps toolchain and it shows no signs of slowing down. When companies wish to run Kafka or Elasticsearch, Kubernetes has become the go-to option.

However, there is a problem.

When your Kubernetes cluster grows, your operational challenges begin to scale. Fast. Add Kafka and Elasticsearch to the mix and you’ve got a complex engineering challenge.

Without a sensible set of measures in place, you’re going to find yourself out of touch with the true state of your cluster. So how do you keep up? Here are three things that will help you operate your cluster with confidence.

1: Implement Observability by Default

Observability is easy to define. “Observability is the extent to which you can understand the internal state or condition of a complex system based only on knowledge of its external outputs.” Where monitoring gives you a graph, observability gives you a query box. Everyone understands how important this feature is, but they make an early and fatal mistake. 

When deploying services into a cluster, their metrics must be made available out of the box. Every pod reports its CPU, memory, network in, network out, and so on. These metrics can form the basis for actionable insights. 

Kafka and Elasticsearch are incredibly complex systems, and these metrics are essential in knowing your cluster health. Sudden CPU spikes or sawtooth memory graphs may indicate garbage collection issues. This is just one of the classes of problems that require low-level data.

Implementing observability by default means consistent, clean, powerful, and complete data.

2: Choose Your Instance Types Wisely

Clusters often exist with a single instance type. This can be a very large instance type, such as a c5.9xlarge, or a collection of tiny virtual machines. Each option has its drawbacks,but you can think about your instance types in a general way. If you choose large servers, you may find that you’re spending a great deal of money on resources that are not being effectively consumed. If you pick small servers, you will find that there are some applications that require too much memory. What can you do?

Rather than sticking to one instance type, create a toolkit of instance types. In the past, I have seen “small,” “medium,” and “large” node pools that can be tapped into. This is a good idea because it creates a multi-type capability. Whenever you implement your different node pools, assume you’ll need more! If you’re using Infrastructure as Code, document the process or drive the code from a list.

Multiple node types will benefit you in numerous ways:

  • Reducing the amount of overhead in wasted resources
  • Hosting small and large applications
  • Create different nodes with permissions or network access

When you’re running Kafka on your Kubernetes cluster, you might want to look into the i3 instance type on AWS. These nodes have storage optimization, which can complement Kafka’s blazing-fast write capabilities. However, this performance benefit comes with a risk – read on to learn more about this.

Elasticsearch holds a great deal of data in memory on the data nodes. This means that you’ll need to focus on memory-optimized instances. Elasticsearch has a few different types of nodes. For your data nodes, you can look at the R5 class of instances on AWS. For your master nodes, you can look at compute-optimized instances like C5 for smaller instance types. Once you exceed around 75 nodes, you should also consider R5 instance types for your master node.

3: Don’t forget about storage

When scaling your cluster, it’s very tempting to consider the instance type to be the most crucial element. This is a dangerous assumption to make because while the instance may be optimized for storage, you’re going to limit your performance if you end up selecting a basic storage solution.

I3 storage types offer up to 25gbps throughput, but only if you use the built-in storage. But this comes at a cost. If your i3 node dies, you risk data loss, and, in addition, while that node is down, you lose access to the data. If you subsequently choose to bind EBS to your i3 volume, this speed drops down to 14gbps. 

So what should you use for storage?

Unless you’re planning on processing a frightening amount of data, you don’t need to take the risk. Instead, focus on the R5 family of instance types and make use of EBS storage. This offers a great blend of performance and stability. 

You should avoid the gp2 and gp3 EBS options. They don’t offer low latency capabilities. Focus on the io1 and io2 types. They combine high performance with the resilience that EBS offers. If you really need more, you can look at the io2 block express option. 


There are many more details to consider when you’re scaling your cluster, but these three are among the most important. Think about your instance types, optimize your storage, and bake observability into your cluster by default. These maxims will help you to avoid the problems and allow you to drive forward and seize your value.