Rolling restart of Apache Kafka

Written by Magnus Landerblom

All restarts of Apache Kafka are now performed one broker at a time, with health checks between each restart. This is to minimize downtime and keep your business going during a full cluster restart.

A restart of your Apache Kafka cluster is required when you are performing some operations for your cluster; for example, configuration updates, version upgrades or cluster maintenance. You are from now on able to do all these things, while still keeping the cluster up and running all the time. A rolling restart means that only one Kafka broker is restarted at a time. The rolling restart doesn’t proceed to restart another broker until the first one has been started again and is in sync with the cluster. This keeps your cluster online all the time, and with no message lost.

A rolling restart of your Kafka cluster can be triggered via the CloudKarafka control panel.

Your cluster is restarted by you, the customer, when you press the restart button. The rolling upgrade is also performed during Kafka version upgrades or when you enabled or disable the metrics plugin.

Performing a rolling restart

  1. Make sure that the cluster is in a healthy state Make sure that the cluster is healthy before the rolling restart is started; confirm that no broker has under replicated partitions.
  2. Restart one broker (do not start with the broker who is the controller) Begin the rolling restart by restarting one broker. The broker which is the active controller will be restarted last to minimize controller movement in the cluster.
  3. Ensure that the restarted broker has zero under replicated partitions When the broker is back up again, wait until that broker has zero under replicated partitions.
  4. Restart all broker in the cluster by repeating number 2 and 3 Repeat number 2 and 3 for all brokers in the cluster. In the end, restart the broker who is the controller and wait for the under replicated partitions count to drop to zero for the last broker.

Please note that all these actions are handled by CloudKarafka automatically when you press the restart button in the control panel.

Restart the controller in the end

All brokers in a cluster can act as a controller. There is only one active controller at a time. When the broker who is acting as the controller is restarted, the controller has to be transferred to another broker, which also means transferring the data to a new broker. This might require some time and resources from your cluster. Making sure that the broker acting as the controller is the one restarted last can minimize the controller movement and give a more efficient restart.

Things to think about before performing a rolling restart

Some cases might affect the systems that integrate with Apache Kafka, even if a by-the-book rolling restart has been performed.

If you have topics that have replication set to one

Partitions that live on only one broker will be offline during the restart, even during a rolling restart since the partition only exists on one broker.

Configured min.insync.replicas

min.insync.replicas decide how many brokers that must acknowledge a producer when a message is sent with acks=all. min.insync.replicas increase the durability of your data since you will know that once the producer gets an acknowledge, you can be certain that the data is stored in configured numbers of brokers. However, this setting might completely stop your service during a rolling restart. If you have a cluster with three nodes and you set min.insync.replicas to three, your producers will require acks from three brokers, but this is not possible when one broker is down, which it is during the restart. You will not be able to produce any messages until the restart has finished. In these cases is a regular restart (where you are taking down all brokers at the same) faster than a rolling restart.

CloudKarafka - Industry Leading Apache Kafka as a Service