Apache Kafka replication factor – What's the perfect number?

Written by Lovisa Johansson

Is there a "perfect number" to choose as Kafka's ultimate replication factor? The answer is it depends on your setup. However, to guarantee high availability, a few tips and tricks can help you along the way.

There are numbers in mathematics known as 'perfect numbers'. This happens if the number equals the sum of its divisors, like 6 and 28. In the Kafka world, things are different. Figuring out how to set the 'perfect number' depends on a few extra considerations.

What is the 'Replication Factor'?

Apache Kafka ensures high data availability by replicating data via the replication factor in Kafka. The replication factor is the number of nodes to which your data is replicated.

When a producer writes data to Kafka, it sends it to the broker designated as the 'Leader' for that topic:partition in the cluster. Such a broker is the entry point to the cluster for the topic's data:partition.

If we use 'replication factor' >1, writes will also propagate to other brokers called 'followers.' This fundamental operation enables Kafka to provide high availability (HA).

If the producer's settings are set to wait for all brokers to acknowledge a write, it will wait until this completes. After that, the producer moves on to the next batch of data.

The acknowledge-value is a producer configuration parameter in Apache Kafka. Read more about Kafka acknowledging values in this article: What does In-Sync Replicas in Apache Kafka Really Mean?

What are min.insync.replicas?

Administrators and topic creators can manage the replication of messages at a global level or per topic. The replication factor and the number of minimum replicas can be defined to be in sync with the leader before any write queries occur.

If there are not enough in-sync replicas, then the broker will refuse to accept writes for that specific topic:partition. With that in mind, it's easy to understand why min.insync.replicas must be less or equal to the replication factor.

Whenever a topic:partition is written with insufficient in-sync replicas, the error message: 'Not enough in-sync replicas' will be received.

What is a replica not in sync?

There can be several reasons why a replica is not in sync. The simplest is that the broker holding the replica is down. When the broker is up again, it will try to catch up, and when it has caught up, it will be marked as in sync.

It is important to consider the Kafka cluster as a whole entity. Brokers can fail, but the cluster should remain available.

What is the perfect number to set for the replication factor?

There is no perfect number for the replication factor, but there are wrong numbers to set.

The first thing to consider is that the replication factor must be equal to, or smaller than, the available brokers. In other words, if you have only one broker, your replication factor cannot be higher than 1, or your topic:partition will not accept writes.

To guarantee high availability, the minimum replication factor must be 3. Are you surprised it isn't 2?

The simple reason is that with three brokers hosting a copy of the same data, one broker is down without causing any issues. If there are only two copies of the data, new writes can still be accepted because, if another broker fails, a copy of the data will survive.

A cluster can not accept new data if it only has a replication factor of 2 after losing one of the brokers. Setting a replication factor of 3 and min.insync.replicas of 2 guarantees that the clients can write data and that acknowledged writes guarantee persistence.

What about a data center flood where all brokers are in the same data center? As one can imagine, data will be unavailable. However, it might still recover the disks and their data when the level becomes normal again.

Additional considerations

  • Remember that a higher replication factor requires more disk space. Data replicates N times instead of only once, which is one of the prices to pay for HA.
  • To run in HA, the minimum number of brokers needed is three. The settings for the topics must have a minimum replication factor of 3 and minimum in-sync replicas of 2. Those are the 'sleep well' minimum settings. More brokers also help distribute the load over more instances, provided the number of topics is chosen properly.
  • If HA is important, consider increasing the number of replicas even more (but not necessarily increasing the min.insync.replicas to the same number!) min.insync.replicas can affect HA negatively if the replication factor is the same as the in-sync replicas when one single broker in the cluster is down.
    E.g., a cluster with three brokers and a min.insync.replicas of 3.; it is not possible to write to a topic, i.e., it is not available when one broker out of 3 is down. In that case, min.insync.replicas guarantees redundancy, but it can result in reduced availability.

Let's continue...

Get started with Apache Kafka

We offer fully managed Apache Kafka clusters with epic performance & superior support

Get a managed Apache Kafka server for FREE

CloudKarafka - Industry Leading Apache Kafka as a Service