Important Notice: CloudKarafka is shutting down. Read all about it in our End of Life Announcement

Zookeeper and Kafka: story of a friendship

Written by Fabio Pardi

The Kafka ecosystem is changing and Zookeeper is slowly getting out of the picture. This article will give you an overview of the reasons behind this exciting revolution. What role will Zookeeper play in the Kafka Ecosystem in the future?

Since Kafka's inception in 2012, Zookeeper has been mentioned over and over. Kafka and Zookeeper are two open source softwares provided by the Apache Software Foundation and they walked together over the years like two good friends.

Over time, some people may not have been aware of the differences between Kafka and Zookeeper, or that Zookeeper has played an important role for Kafka. 'Just install them both, they are needed for the cluster to work' I heard once.. It is even not uncommon to see Kafka and Zookeeper installed on the very same Virtual Machine, which is handy - but not the optimal architecture.

While Kafka takes care of the data we push to it, Zookeeper operates underwater and makes sure that everything works 'the right way'. As mentioned already in greater detail in one of our past articles , Zookeeper is used by Kafka to store cluster's metadata such as the full list of active brokers, topic names, location of partitions, in-sync-replicas (ISR) and it is also used to elect a broker to be the cluster Controller. As a matter of fact, according to their creators, Zookeeper is actually able to do much more. It is a system made to maintain "configuration information, naming, providing distributed synchronisation, and providing group services" ¹

All the above in a distributed way, providing High Availability (HA). Many important projects have embraced Zookeeper over the years, Clickhouse and Hadoop to mention a few.

How Zookeeper fits Kafka Ecosystem in 2023

But over the years, as sometimes happens between good friends, the relationship between Kafka and Zookeeper changed; Kafka realised that it was sometimes a burden to work with Zookeeper.

For instance, topic deletions were slow, the more partitions were hosted the more painful it was. Also brokers were not informed immediately about some changes, like when the In-Sync-Replicas topology changed. ²

All the above problems arose from the simple fact that metadata was not hosted on Kafka itself, but on a different system which under certain circumstances terribly delayed some operations and performances were drastically deteriorating at the increase of latency between Kafka and Zookeeper.

If only Kafka could host metadata itself, it will not need to refresh the Controller state from Zookeeper and there will be more resources to allocate to Kafka (although Zookeeper CPU, RAM and disk usage is usually modest, it still counts). In that case, Kafka could work without dependencies to others, data could be stored in the brokers. And when a new broker restarts it needs only to pick the changes generated for the time being offline, like it happens for any other topic.

Besides the above great advantages, there are other reasons for simplifying the architecture, for instance less things to know for the cluster administrators, or just having one Kafka node able to work independently (please note, this is not recommended in production!).

Therefore KIP-500 was created in 2019
(KIP= Kafka Improvement Proposals)

KIP-500: a real revolution!

Time goes by, developers and all the people around them worked hard to make the proposal come true. Following KIP-500, Zookeeper is then replaced by Controller Nodes, which act as separate processes from the Kafka broker, inside a dedicated JVM.

Controller nodes are storing everything that used to be in Zookeeper. Indeed, Controller Nodes store such data in a topic, replicated N times. The Kafka way! With N being the number of controllers, in an odd number, as it used to be for Zookeeper in order to obtain quorum. A variant of Raft algorithm (called KRaft: Kafka Raft) is used to elect a leader among Controllers (called 'Active Controller') and requires majority of nodes (eg: on a 3 Controllers setup, only 1 node can afford to be unreachable in order to guarantee the service).

If you want to know in greater detail about Raft, there is this wonderfully easy to follow video made by one of its creators: "Designing for Understandability: The Raft Consensus Algorithm"

Or if you are more into books than videos, you can then read one of the most read dissertations in the Computer Science history: Mr Ongaro’s ‘CONSENSUS: BRIDGING THEORY AND PRACTICE’

KRaft is a dialect of Raft, and differs slightly from it, leveraging the Kafka way of working: for instance changes are pulled from the Active Controller, instead of being pushed by it. What happens is that with this new architecture the nodes are periodically pulling from the controller and this mechanism also doubles as a heartbeat to keep track of alive nodes.

Similarly to Raft, KRaft is plain easy to understand because it has been designed to be simple. That, in our opinion, should be a lesson for everybody to learn: follow the KISS principle! In Kraft, Controller data is stored on disk and controller election are much faster because the newly elected Controller will not have to load all the cluster state anew, but it is already in sync and ready to take over. That is simply because standby controllers are just the other nodes participating in the quorum.

Similarly, when a topic gets created or deleted the Controller will not need to reload the whole list of topics from Zookeeper as it used to be, but simply a new entry will be created in the metadata partition.

All the above is a great revolution for Kafka: it simplifies the architecture and it speeds up the operations! Kafka clusters can be configured to use the new setup, while old clusters running with Zookeeper need indeed to be migrated. For the former, we are ready to go into production without Zookeeper. While for the legacy setups it is required to upgrade to a ‘bridge release’.

The ‘bridge release’ is a very interesting concept; It allows administrators to upgrade to a version that can run in the legacy mode (with Zookeeper) or in the new mode. Once the switchover is performed, it will be possible to completely get rid of Zookeeper. But be aware! At the time of writing, bridge release is not yet ready for production.

Exciting times for all the Kafka lovers!


April 2021 - Kafka 2.8: KRaft mode was released in early access

September 2021 - Kafka 3.0: KRaft was in preview.

October 2022 - Kafka 3.3: This release marks KRaft mode as production ready for new clusters only

February 2023 - Kafka 3.4: With this version of Kafka it is possible to migrate Kafka clusters from Zookeeper to KRaft mode with no downtime. It is only for testing and not yet production ready.

Going forward.

2023/04: Kafka 3.5 released with both KRaft and ZK support. Upgrades from ZK will be production ready. ZooKeeper mode deprecated.
2023: Additional 3.x releases

2024/xx: Kafka 4.0 released with only KRaft mode supported.

KIP500 and CloudKarafka

At this stage, CloudKarafka primarily relies on ZooKeeper due to its tested stability and reliability. Even if KIP-500 offers an appealing alternative by eliminating the need for ZooKeeper in Kafka, it is important to remember that this technique is relatively new and not yet as well-supported or tested as ZooKeeper in production environments.

However, we do support version 3.4.0, also known as the 'bridge' release, which allows for running Kafka without ZooKeeper.

At CloudKarafka, we are closely monitoring the development of KIP-500 and remain hopeful that we will soon be able to offer a stable and reliable product without the dependency on ZooKeeper.

We hope you enjoy this article!

About CloudKarafka

CloudKarafka is a trusted hosting provider of Apache Kafka. Provided by 84codes, a Swedish tech company dedicated to simplifying cloud infrastructure for developers. If you have any queries or problems, our support team are on hand 24/7 to help you. Just send an email to support@cloudkarafka.com.

All the best!