Part 1: Apache Kafka for beginners - What is Apache Kafka?

Written by Lovisa Johansson

The first part of Apache Kafka for beginners explains what Kafka is - a publish-subscribe-based messaging system that is exchanging data between processes, applications, and servers. It will give you a brief understanding of messaging and distributed logs, and important concepts will be defined. The article will explain the steps to go through when setting up a connection and how you can subscribe to messages from topics.

Publish-subscribe messaging system

Apache Kafka is a publish-subscribe messaging system. By saying that, we need to describe a messaging system. A messaging system lets you send messages between processes, applications, and servers.

kafka message queue

Simply said; Apache Kafka is a software where topics can be defined (think of a topic as a category).

TABLE OF CONTENTS


  1. Apache Kafka for beginners part 1 - What is Apache Kafka?

    Gives a brief understanding of messaging and important Kafka concepts are defined. We will also show you how to set up your first Apache Kafka instance.

  2. Apache Kafka step-by-step coding instructions

    Step-by-step instructions that show how to set up a connection, how to publish to a topic, and how to subscribe from the topic.

Applications may connect to this system and transfer a message onto the topic. A message can include any kind of information. It could, for example, have information about an event that has happened on your website, or it could just be a simple text message that is supposed to trigger an event. Another application may connect to the system and process messages from a topic.

Kafka Broker

A Kafka cluster consists of one or more servers (Kafka brokers), which are running Kafka. Producers are processes that publish data (push messages) into Kafka topics within the broker. A consumer of topics pulls messages off a Kafka topic.

Apache Kafka Getting Started
Apache Kafka Consumer Producer Broker

Kafka Topic

A Topic is a category/feed name to which messages are stored and published. Messages are byte arrays that can store any object in any format. As said before, all Kafka messages are organized into topics. If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic. Producer applications write data to topics and consumer applications read from topics. Messages published to the cluster will stay in the cluster until a configurable retention period has passed by. Kafka retains all messages for a set amount of time, and therefore, consumers are responsible to track their location.

Apache Kafka Topic

Kafka topic partition

Kafka topics are divided into a number of partitions, which contains messages in an unchangeable sequence. Each message in a partition is assigned and identified by its unique offset. A topic can also have multiple partition logs like the click-topic has in the image to the right. This allows for multiple consumers to read from a topic in parallel.

In Kafka, replication is implemented at the partition level. The redundant unit of a topic partition is called a replica. Each partition usually has one or more replicas meaning that partitions contain messages that are replicated over a few Kafka brokers in the cluster. As we can see in the pictures - the click-topic is replicated to Kafka node 2 and Kafka node 3.

Apache Kafka Partition

It's possible for the producer to attach a key to the messages and tell which partition the message should go to. All messages with the same key will arrive at the same partition.

Partitions allow you to parallelize a topic by splitting the data in a particular topic across multiple brokers.

Every partition (replica) has one server acting as a leader and the rest of them as followers. The leader replica handles all read-write requests for the specific partition and the followers replicate the leader. If the leader server fails, one of the follower servers become the leader by default. When a producer publishes a message to a partition in a topic, it is forwarded to its leader. The leader appends the message to its commit log and increments its message offset. Kafka only exposes a message to a consumer after it has been committed and each piece of data that comes in will be stacked on the cluster.

Consumers and consumer groups

Consumers can read messages starting from a specific offset and are allowed to read from any offset point they choose. This allows consumers to join the cluster at any point in time.

Apache Kafka Consumer

Consumers can join a group called a consumer group. A consumer group includes the set of consumer processes that are subscribing to a specific topic. Each consumer in the group is assigned a set of partitions to consume from. They will receive messages from a different subset of the partitions in the topic. Kafka guarantees that a message is only read by a single consumer in the group.

Consumers pull messages from topic partitions. Different consumers can be responsible for different partitions. Kafka can support a large number of consumers and retain large amounts of data with very little overhead. By using consumer groups, consumers can be parallelised so that multiple consumers can read from multiple partitions on a topic, allowing a very high message processing throughput. The number of partitions impacts the maximum parallelism of consumers as you cannot have more consumers than partitions.

Data/messages are never pushed out to consumers, the consumer will ask for messages when the consumer is ready to handle the message. The consumers will never overload themselves with lots of data or loose any data since all messages are being queued up in Kafka. If the consumer is behind while processing messages, it has the option to eventually catch up and get back to handle data in real time.

Apache Kafka Example in this tutorial - Website activity tracking

According to the creators of Apache Kafka, the original use case for Kafka was to track website activity - including page views, searches, uploads or other actions users may take. This kind of activity tracking often require a very high volume of throughput, messages are generated for each action.

In this tutorial, we follow a scenario where we have a simple website. Users can click around, sign in, write blog articles, upload images to articles and publish those articles. When an event happens in the blog (e.g when someone logs in, when someone presses a button or when someone uploads an image to the article) a tracking event and information about the event will be placed into a message, and the message will be placed on a specified Kafka topic. We will have one topic named "click" and one named "upload".

We will choose to setup partitioning based on the user's id. A user with id 0, will map to partition 0, and user with id 1 to partition 1 etc. Our "click" topic will be split up into three partitions (three users) on two different machines.

Apache Kafka Web Tracking

  1. A user with user-id 0 clicks on a button on the website.
  2. The web application publishes a message to partition 0 in topic "click".
  3. The message is appended to its commit log and the message offset is incremented.
  4. The consumer can pull messages from the click-topic and show monitoring usage in real-time, or it can replay previously consumed messages by setting the offset to an earlier one.

Other use cases

A lot of good use cases and information can be found in the documentation for Apache Kafka.

Message queue

Kafka works well as a replacement for more traditional message brokers, like RabbitMQ. Messaging decouples your processes and creates a highly scalable system.

Apache Kafka illustration

Instead of building one large application, is it beneficial to decouple different parts of your application and only communicate between them asynchronously with messages. That way different parts of your application can evolve independently, be written in different languages and/or maintained by separated developer teams. In comparison to many messaging systems, Kafka has better throughput. It has built-in partitioning, replication, and fault-tolerance that makes it a good solution for large-scale message processing applications.

Event streams, tracking and logging

A lot of people today use Kafka as a log solution - that typically collects physical log files of servers and put them in a central place for processing. With Kafka, you can publish an event for everything happening in your application. Other parts can subscribe to these events and take appropriate actions.

Apache Kafka and server concepts

Here are important concepts that you need to remember before we dig deeper into Apache Kafka - explained in one line.

  • Producer: Application that sends the messages.
  • Consumer: Application that receives the messages.
  • Message: Information that is sent from the producer to a consumer through Apache Kafka.
  • Connection: A connection is a TCP connection between your application and the Kafka broker.
  • Topic: A Topic is a category/feed name to which messages are stored and published.
  • Topic partition: Kafka topics are divided into a number of partitions, which allows you to split data across multiple brokers.
  • Replicas A replica of a partition is a "backup" of a partition. Replicas never read or write data. They are used to prevent data loss.
  • Consumer Group: A consumer group includes the set of consumer processes that are subscribing to a specific topic.
  • Offset: The offset is a unique identifier of a record within a partition. It denotes the position of the consumer in the partition.
  • Node: A node is a single computer in the Apache Kafka cluster.
  • Cluster: A cluster is a group of nodes i.e., a group of computers.

Set up an Apache Kafka instance

To be able to follow this guide you need to set up a CloudKarafka instance or you need to download and install Apache Kafka and Zookeeper. CloudKarafka automates every part of the setup - it provides a hosted Kafka solution, meaning that all you need to do is sign up for an account and create an instance. You do not need to set up and install Kafka or care about cluster handling, CloudKarafka will do that for you. CloudKarafka can be used for free with the plan Developer Duck. Go to the plan page and sign up for any plan and create an instance.

When your instance is created, click on details for your instance. Before you start coding you need to ensure that you can set up a secure connection. You either can download certificates, use SASL/SCRAM or set up VPC peering to your AWS VPC. This tutorial show how to get started with the free instance, Developer Duck since everyone should be able to complete this guide. If you are going to set up a dedicated instance, we recommend you to have a look here.

Apache Kafka Instances
Apache Kafka Free Plan

Get started on the free Apache Kafka plan

To get started with your free instance you need to download the Certificates (connection environment variables) for the instance. You can find the download button from the instances overview page. It is named: Certs as in the picture above. Press the button and save the given .env file into your project. The file contains environmental variables that you need to use in your project.

You can also authenticate using SASL/SCRAM. When using SASL/SCRAM you only need to locate the username and password on the "Details" page and use them in your code.

You can now start by opening the Topic view, to get an overview and set up your Kafka topics on your server. You are free to decide partitions, replicas, retention byte and retention time in ms. I have created two topics as in the picture below, vpyo-click and vpyo-upload. The first four letters are there to describe your specific topics since you are on a free shared server and other users might create a topic with the same name.

Apache Kafka Topic

Publish and subscribe messages

To be able to communicate with Apache Kafka you need a library that understands Apache Kafka. You need to download the client-library for the programming language that you intend to use for your applications. A client-library is an applications programming interface (API) for use in writing client applications. A client library has several methods that can be used, in this case, to communicate with Apache Kafka. The methods should be used when you, for example, connect to the Kafka broker (using the given parameters, host name for example) or when you publish a message to a topic. Both consumers and producers can be written in any language that has a Kafka client written for it.

Steps to follow when setting up a connection and publishing a message/consuming a message.

  1. First of all, we need to set up a secure connection. A TCP connection will be set up between the application and Apache Kafka.
  2. In publisher: Publish a message to a partition on a topic.
  3. In subscriber/consumer: Consume a message from a partition in a topic.

Sample code

Sample code will be given in part 2, starting with Part 2.1 - Ruby, followed by Part 2.2 - Java, and Part 2.3 Python. It is possible to have different programming languages on different parts of the system. The publisher could, for example, be written in node.js and the subscriber in Python.

Hope this article helped you gain some understanding about Apache Kafka! Enjoy the service and contact us if you have any questions or feedback!

Let's continue...


Get started with Apache Kafka

We offer fully managed Apache Kafka clusters with epic performance & superior support

Get a managed Apache Kafka server for FREE

CloudKarafka - Industry Leading Apache Kafka as a Service