Apache Kafka is a publish-subscribe messaging system. By saying that, we need to describe a messaging system. A messaging system lets you send messages between processes, applications, and servers.
Gives a brief understanding of messaging and important Kafka concepts are defined. We will also show you how to set up your first Apache Kafka instance.
Step-by-step instructions that show how to set up a connection, how to publish to a topic, and how to subscribe from the topic.
Simply said; Apache Kafka is a software where topics can be defined (think of a topic as a category). Applications may connect to this system and transfer a message onto the topic. A message can include any kind of information. It could, for example, have information about an event that has happened on your website, or it could be just a simple text message that is supposed to trigger an event. Another application may connect to the system and process messages from a topic.
A Kafka cluster consists of one or more servers (Kafka brokers), which are running Kafka. Producers are processes that publish data (push messages) into Kafka topics within the broker. A consumer of topics pulls messages off a Kafka topic.
A Topic is a category/feed name to which messages are stored and published. Messages are byte arrays that can store any object in any format. As said before, all Kafka messages are organized into topics. If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic. Producer applications write data to topics and consumer applications read from topics. Messages published to the cluster will stay in the cluster until a configurable retention period has passed by. Kafka retains all messages for a set amount of time, and therefore, consumers are responsible to track their location.
Kafka topics are divided into a number of partitions, which contains messages in an unchangeable sequence. Each message in a partition is assigned and identified by its unique offset. A topic can also have multiple partition logs like the click-topic has in the image to the right. This allows for multiple consumers to read from a topic in parallel.
In Kafka, replication is implemented at the partition level. The redundant unit of a topic partition is called a replica. Each partition usually has one or more replicas meaning that partitions contain messages that are replicated over a few Kafka brokers in the cluster. As we can see in the pictures - the click-topic is replicated to Kafka node 2 and Kafka node 3.
It's possible for the producer to attach a key to the messages and tell which partition the message should go to. All messages with the same key will arrive at the same partition.
Partitions allow you to parallelise a topic by splitting the data in a particular topic across multiple brokers.
Every partition (replica) has one server acting as a leader and the rest of them as followers. The leader replica handles all read-write requests for the specific partition and the followers replicate the leader. If the leader server fails, one of the follower servers become the leader by default. When a producer publishes a message to a partition in a topic, it is forwarded to its leader. The leader appends the message to its commit log and increments its message offset. Kafka only exposes a message to a consumer after it has been committed and each piece of data that comes in will be stacked on the cluster.
Consumers can read messages starting from a specific offset and are allowed to read from any offset point they choose. This allows consumers to join the cluster at any point in time.
Consumers can join a group called a consumer group. A consumer group includes the set of consumer processes that are subscribing to a specific topic. Each consumer in the group is assigned a set of partitions to consume from. They will receive messages from a different subset of the partitions in the topic. Kafka guarantees that a message is only read by a single consumer in the group.
Consumers pull messages from topic partitions. Different consumers can be responsible for different partitions. Kafka can support a large number of consumers and retain large amounts of data with very little overhead. By using consumer groups, consumers can be parallelised so that multiple consumers can read from multiple partitions on a topic, allowing a very high message processing throughput. The number of partitions impacts the maximum parallelism of consumers as you cannot have more consumers than partitions.
Data/messages are never pushed out to consumers, the consumer will ask for messages when the consumer is ready to handle the message. The consumers will never overload themselves with lots of data or loose any data since all messages are being queued up in Kafka. If the consumer is behind while processing messages, it has the option to eventually catch up and get back to handle data in real time.
According to Kafka, the original use case for Kafka was to track website activity - included page views, searches, uploads or other actions users may take. This kind of activity tracking often require a very high volume of throughput, messages are generated for each action.
In this tutorial, we follow a scenario where we have a simple website. Users can click around, sign in, write blog articles, upload images to articles and publish those articles. When an event happens in the blog (e.g when someone logs in, when someone presses a button or when someone uploads an image to the article) a tracking event and information about the event will be placed into a message, and the message will be placed on a specified Kafka topic. We will have one topic named "click" and one named "upload".
We will choose to setup partitioning based on the user's id. A user with id 0, will map to partition 0, and user with id 1 to partition 1 etc. Our "click" topic will be split up into three partitions (three users) on two different machines.
A lot of good use cases and information can be found in the documentation for Apache Kafka.
Kafka works well as a replacement for more traditional message brokers, like RabbitMQ. Messaging decouples your processes and creates a highly scalable system. Instead of building one large application, is it beneficial to decouple different parts of your application and only communicate between them asynchronously with messages. That way different parts of your application can evolve independently, be written in different languages and/or maintained by separated developer teams. In comparison to many messaging systems, Kafka has better throughput. It has built-in partitioning, replication, and fault-tolerance that makes it a good solution for large-scale message processing applications.
A lot of people today, use Kafka as a log solution - that typically collects physical log files of servers and put them in a central place for processing. With Kafka, you can publish an event for everything happening in your application. Other parts can subscribe to these events and take appropriate actions.
Here are important concepts that you need to remember before we dig deeper into Apache Kafka - explained in one line.
To be able to follow this guide you need to set up a CloudKarafka instance or you need to download and install Apache Kafka and Zookeeper. CloudKarafka automates every part of the setup - it provides a hosted Kafka solution, meaning that all you need to do is sign up for an account and create an instance. You do not need to set up and install Kafka or care about cluster handling, CloudKarafka will do that for you. CloudKarafka can be used for FREE with the plan developer duck. Go to the plan page and sign up for any plan and create an instance.
When your instance is created, press on details for your instance. Before you start coding you need to ensure that you can set up a secure connection. You either need to download certificates or set up VPC peering to your AWS VPC. This tutorial show how to get started with the free instance, developer duck since everyone should be able to complete this guide. If you are going to set up a dedicated instance, we recommend you to have a look here.
To get started with your free instance you need to download the Certificates (connection environment variables) for the instance. You can find the download button from the instances overview page. It is named: Certs as in the picture above. Press the button and save the given .env file into your project. The file contains environmental variables that you need to use in your project.
You can now start by opening the Topic view, to get an overview and set up your Kafka topics on your server. You are free to decide partitions, replicas, retention byte and retention time in ms. I have created two topics as in the picture below, vpyo-click and vpyo-upload. The first four letters are there to describe your specific topics since you are on a free shared server and other users might create a topic with the same name.
To be able to communicate with Apache Kafka you need a library that understands Apache Kafka. You need to download the client-library for the programming language that you intend to use for your applications. A client-library is an applications programming interface (API) for use in writing client applications. A client library has several methods that can be used, in this case, to communicate with Apache Kafka. The methods should be used when you, for example, connect to the Kafka broker (using the given parameters, host name for example) or when you publish a message to a topic. Both consumers and producers can be written in any language that has a Kafka client written for it.
Sample code will be given in part 2, starting with Part 2.1 - Ruby, followed by Part 2.2 - Java, and Part 2.3 Python. It is possible to have different programming languages on different parts of the system. The publisher could, for example, be written in node.js and the subscriber in Python.
Hope this article helped you gain some understanding about Apache Kafka!
Enjoy the service and contact us if you have any questions or feedback!