Different types of scaling in Apache Kafka

Written by Fabio Pardi

Most big data projects will have to deal with scalability at some point, either to scale up or down. Apache Kafka is a remarkable piece of software that connects the world of “big data” and offers different scaling capabilities.

Data is essential in today's world. The more data used, the more disk space is needed. There might be a moment when disk space runs out. Fortunately, Kafka has the capability of scaling up or down.

Are you scaling vertically or horizontally?

Scaling in the IT domain can be accomplished either vertically or horizontally.

  • Scaling vertically (up) means giving more resources to the machine. Scaling a Kafka installation vertically provides a broker with more CPU, RAM, and/or disk.
  • Scaling horizontally (out) means adding brokers to the Kafka cluster without necessarily adding resources to the single brokers.

The following is a detailed analysis of the two solutions and when to apply them:

Scaling vertically

Before adding resources to Kafka, there are a few things to consider. First, it is best to have an evenly distributed cluster regarding resources. In other words, if there are three brokers, it is a good idea to add the same amount of resources to all brokers. Maintaining the same amount of RAM, CPU, disk space, and possibly the same specs (i.e., the same disk speed) is important.

While RAM and disk space are usually the problems, the CPU rarely has issues. If the broker is large with compaction enabled, many topics and partitions, and much used by our clients, then more CPU might be needed. Adding disk space might be necessary if a lot of data is hosted, and there will be more to come.

CloudKarafka has specified a fixed amount of RAM that the JVM allocates instantly, allowing room for the OS to operate.

Scaling vertically has limits, and while it helps, it is also more expensive (large disks typically cost more).

Scaling horizontally

Kafka is meant to run in a cluster. Think of Kafka as a wolf rather than a bear. Kafka shines when there are multiple brokers, in fact, it was designed to scale horizontally. Having three or more brokers not only enables high availability (HA), but is also a good way of spreading the load.

More often than not, administrators take this route when it comes to scaling. This kind of scalability is possible whenever the load can be spread over the brokers. For instance, when multiple topics are present and/or (best practice) when the topics have multiple partitions.

The number of partitions often gives the limitation of this approach. Imagine, for simplicity, serving one big topic with nine partitions. Initially, there are three brokers, and they host three partitions each. When the topic becomes 9 TB, for example, each broker should host 3 TB of space. Next, the consideration would be to scale to 9 brokers hosting 1 TB each.

If the data grows further, there will be no more room to scale horizontally. So scaling must then occur vertically by adding space to the brokers (or looking for other possible solutions, like increasing the number of partitions).

Don't use too many replicas

Having too many replicas creates too much overhead for minimal improvement in HA. Furthermore, it's infrequent for multiple nodes to go down if they are correctly separated geographically (availability zones).

A balance between both vertical and horizontal scaling can sometimes be the best option. CloudKarafka doesn't allow scaling up to many nodes with small instance types since the amount of "chit-chat" between them can consume too many resources.

Scaling out

There might also be a case for reducing the assigned resources, for instance, after an initial load of data when the traffic requirements are lower. Previously, assigning fewer resources to each broker was enough if scaling occurred vertically. If the broker runs on VMs or K8s, the scale-out is simple, but it might not be a viable option if it runs on hardware.

Instead, if scaling was accomplished horizontally, the number of brokers in the cluster can be reduced. This operation is simple (K8s, virtual machines, hardware, CloudKarafka), but the extra step of 'freeing' the brokers before shrinking the cluster must be performed. Freeing means telling Kafka to move the hosted partitions to other brokers that will survive the scale-out process, which is handled automatically in CloudKarafka.

Moving partitions might take some time, depending on the amount of data hosted in the cluster. Checking for In-Sync Replicas (ISR) is an excellent way to see when the partition-moving process is finished.

This process can be automated by creating a "Kafka auto scaler" to save on resources based on the amount of traffic and the load on the brokers.

Conclusions

Kafka is meant to run in a cluster with at least three brokers. It was designed to scale horizontally. In terms of best practices, it is wise not to use too many brokers, especially not on small instance types, when too many resources would be used to communicate between them.

Kafka operates best with margins in CPU and RAM. Rebalancing partitions can be costly, so stay within the edge of what the individual broker can handle. If another broker goes down, the load will increase on the other replicas.

It is easy to add and remove nodes in CloudKarafka, and to select the size of the cluster regarding instance types. Check out the plan page to view available options.

CloudKarafka - Industry Leading Apache Kafka as a Service