Kafka Consumer Groups: From Compensations to Rebalances

Introduction: The necessity of consumer groups

The problem with individual consumers

Imagine you have a single application that needs to read and process a massive, continuous stream of data from a Kafka topic. As the volume of data grows, this single application can no longer keep up. It becomes a bottleneck and falls further and further behind – a problem known as consumer lag. This not only slows down processing, but also leads to a single point of failure. If this one application fails, the entire data consumption comes to a standstill.

The challenge of scaling

You could try running a second instance of your application, but how do you coordinate it? If both instances simply connect to the same topic, both will try to read and process every single message. This leads to redundant work where the same data is processed twice, leading to inconsistencies and wasted resources. What you need is a way to intelligently distribute the workload and scale horizontally without duplicating work.

What exactly is a consumer group?

Defining a consumer group

A consumer group is a collection of one or more consumer instances that read data from a common group of topics in a coordinated manner. All consumers within a group are identified by a unique group.id and work together to process the data stream from one or more topics. The basic rule of a consumer group is that for a given topic, each partition is only consumed by one consumer within the group. This ensures that each message is processed exactly once by a consumer in this group, which prevents redundant work.

The partition assignment

When you start a consumer group, the consumers do not pull messages at random. Instead, Kafka’s Group Coordinator is responsible for assigning topic partitions to the consumers in the group. For example, if you have a topic with three partitions (partition 0, partition 1, partition 2) and a consumer group with two consumers (consumer A and consumer B), the group coordinator could assign partition 0 and partition 1 to consumer A and partition 2 to consumer B. When a third consumer joins the group, the group coordinator will rebalance the partition assignments to ensure that the workload is distributed as evenly as possible. This dynamic partition allocation is key to the scalability and fault tolerance of the group.

The role of the partitions and the group coordinator

Partitions as a unit of parallelism

To understand how a consumer group works, you must first understand the role of partitions. A Kafka theme is divided into one or more partitions, which are the basic units of parallelism in Kafka. Each partition is a fully ordered, immutable sequence of records. When producers write data to a topic, they distribute the messages across its partitions. Consumers in turn read data from these partitions. The more partitions a topic has, the more parallel consumers can be active within a group, which enables a higher degree of horizontal scalability.

The function of the group coordinator

The Group Coordinator is a designated Kafka broker who serves as the central manager for a consumer group. Each consumer in a group sends heartbeats to this coordinator at regular intervals to signal that it is still active. The coordinator is responsible for managing the state of the group and orchestrating partition assignments. If a new consumer joins the group, an existing consumer leaves the group or a consumer does not send a heartbeat in time, the group coordinator initiates a rebalance. During this process, some consumers will have their partition assignments removed and reassigned to ensure that the workload is evenly distributed among the active members of the group. The group coordinator is also responsible for managing and storing the committed offsets for the group.

Understanding the rebalance process

What is a rebalance?

A rebalance is the process by which Kafka redistributes ownership of partitions among the consumers in a consumer group. This is an important mechanism to ensure the scalability and fault tolerance of the group. In the event of a rebalance, the group coordinator revokes the current partition assignments and reassigns them to the active members of the group. The aim is always to achieve an even distribution of the workload and to ensure that each consumer has a fair share of the partitions they can read from.

The triggers for a rebalance

Rebalancing is not a continuous process, but is triggered by certain events. The most common triggers are:

A new consumer instance joins the group, e.g. when you scale your application.
An existing consumer instance leaves the group properly, e.g. when the application is shut down.
An existing consumer instance fails unexpectedly and no longer sends heartbeats to the group coordinator. This is why the session.timeout.ms configuration is so important, as it defines how long the coordinator waits before it considers a consumer to be dead and triggers a rebalance.
The partitions of a theme change, for example, when an administrator adds new partitions to a theme.

The effects on consumption

Although rebalances are necessary to maintain a healthy and resilient consumer group, they have a temporary impact on data consumption. During a rebalance, consumers stop processing messages from their assigned partitions and wait for the new assignments from the group coordinator. This leads to a brief pause in consumption, which can result in a slight increase in delay for the consumers. For this reason, it is important that you design your applications to minimize unnecessary rebalances and handle the temporary interruption appropriately.

Managing Consumer Offsets: The secret to statefulness

What are offsets?

In Kafka, an offset is a simple integer that uniquely identifies each message within a partition. It acts as a bookmark and represents the last message that a consumer successfully read and processed. Each time a consumer reads a message from a partition, its position advances to the next offset. In this way, a consumer tracks its progress in the message stream of a partition. Offsets are key to building resilient applications because they allow a consumer to pause reading and resume later from where it left off without losing data.

The role of committed offsets

A consumer not only keeps track of its current position, but also commits its offsets in a special, compressed internal Kafka topic called __consumer_offsets. This is a critical step because it makes the consumer’s state permanent. When a consumer group rebalances or a consumer instance crashes, a new consumer that takes over its partitions can read the last transmitted offset from the __consumer_offsets topic and start processing from that point. This mechanism ensures that messages are not unnecessarily reprocessed after a failure and that no messages are skipped, thereby guaranteeing the integrity of message transmission.

Committing offsets: the compromise

There are two ways to commit offsets: automatically or manually. The enable.auto.commit configuration, when set to true, instructs the consumer to commit offsets automatically at a regular interval (controlled by auto.commit.interval.ms). While this approach is convenient, it can lead to “at-least-once” semantics in message transmission, as messages have been processed but not yet transmitted before a crash. To have more control and achieve exact-once processing in certain scenarios, you can disable automatic delivery and deliver offsets manually after your application has successfully processed a batch of messages. This allows you to control exactly when a message is considered “done”

The three delivery semantics: At least once, at most once and exactly once

At most once

Delivery “at most once “ means that a message is delivered to a consumer either once or not at all. This is the least reliable delivery semantic. It occurs when a consumer sets its offset before a message is processed. If the consumer crashes after the offset has been confirmed but before the message has been fully processed, the message is lost and will never be consumed again. This approach offers the best performance and the lowest latency, as there is no need to wait for a process to complete before committing. However, it is usually only suitable for use cases where data loss is acceptable, e.g. when capturing non-critical metrics or logs.

At least once

The delivery “at least once “ is the standard and most common semantics in Kafka. It guarantees that a message is not lost and is delivered at least once. This is achieved by confirming the offset after the consumer has successfully processed the message. If the consumer crashes after processing but before the offset is set, the offset is not updated. If the consumer restarts or a new consumer takes over, it will read from the last confirmed offset and process the messages again, resulting in duplicates. While this assumes that your application is idempotent (i.e. can process the same message multiple times without causing side effects), it ensures that no data is lost.

Exactly Once

Delivery “exactly once “ ensures that each message is delivered and processed exactly once to a consumer, without duplicates and without data loss. This is the most demanding semantic requirement and is usually only possible for Kafka-to-Kafka workflows, e.g. when a consumer reads from one topic, processes the data and then writes the result to another topic. This is achieved with Kafka’s transactional API. It allows you to atomically transfer a group of messages created in an output topic and the corresponding consumer offsets from the input topic. If the transaction fails, neither the output messages nor the consumer offsets are transferred, ensuring a consistent state.

General challenges and their solution

Too many consumers

A common mistake when trying to scale a consumer group is to add more consumers than there are partitions in the theme. Remember that within a consumer group, each partition can only be used by one consumer at a time. If you have a theme with 10 partitions and a consumer group with 12 consumers, two of these consumers will be idle. They will not be assigned partitions and will just sit there wasting resources and not contributing to processing. The solution is to make sure that the number of consumers in your group is less than or equal to the number of partitions in the topics you are consuming from.

Consumer Lag

*the *Consumer Lag** is one of the most important metrics to monitor. It indicates the number of messages by which a consumer lags behind the last message produced for the partition of a topic. A consumer is “lagging” if it cannot process the messages as quickly as they are produced. A high or increasing lag indicates that your consumer group is not keeping up with the flow of data. This may be due to slow processing logic, network latency or simply too few consumer instances. To fix the lag, you can either optimize your application’s processing code to become more efficient, or you can scale your consumer group by adding more consumers, up to the number of partitions.

Frequent rebalances

Frequent rebalances can be a significant problem in production as they can cause a temporary interruption in consumption and increase latency. The most common cause is that a consumer instance cannot send a heartbeat to the group coordinator within the configured session.timeout.ms. This can be caused by a long pause in garbage collection, a network problem or slow processing blocking the consumer’s thread. To fix this, you can increase the session.timeout.ms and max.poll.interval.ms settings, which will give the consumer more time to process messages and send a heartbeat. You can also monitor the health of your application and make sure it has enough resources to avoid unexpected delays.

Important configuration settings for consumer groups

The basics: identity and offset management

When configuring a Kafka consumer, several key settings have a direct impact on how it behaves within a consumer group. The most basic is group.id, a unique string that identifies the consumer group. All consumers who share this group.id are considered part of the same group and coordinate to consume a topic. For offset management, auto.offset.reset is an important setting. It defines what to do when a consumer starts and no binding offset can be found for it in the __consumer_offsets topic. The two common values are earliest, which tells the consumer to start reading at the beginning of the partition, and latest, which tells it to start with the latest message.

Committing offsets and session management

The enable.auto.commit setting controls whether offsets are automatically committed in the background. If it is set to true, offsets are committed by the consumer at a regular interval defined by auto.commit.interval.ms. For a manual check, this value should be set to false. To avoid unnecessary recalibration, two further settings are important: session.timeout.ms and heartbeat.interval.ms. The session.timeout.ms setting determines how long the group coordinator waits for the heartbeat of a consumer before it assumes that it has failed and triggers a rebalance. The heartbeat.interval.ms is the frequency at which the consumer sends these heartbeats. It should be set to a value that is less than session.timeout.ms to ensure that the load can regularly signal that it is ready for operation.

Best practices for production environments

Monitoring the most important metrics

In a production environment, it is not enough to simply run a consumer group; you need to monitor its health and performance. The most important metric to track is consumer Lag, which indicates how far behind a consumer is in a topic’s partition. A high or increasing lag is an early warning sign that your consumer group is not keeping up with incoming data. Other important metrics include the rate of messages consumed and, just as importantly, the frequency of rebalances. Frequent rebalances can indicate instability in your system, such as consumers crashing or being too slow to send heartbeats, which can lead to higher latency and lower throughput.

The right number of partitions

Choosing the right number of partitions for your topics is a fundamental decision that has a direct impact on the scalability of consumer groups. You cannot have more active consumers in a group than you have partitions. As a rule of thumb, you should start with a sufficient number of partitions to meet your current and future scaling requirements. A theme with too few partitions can limit the scalability of your consumer group, while a theme with too many partitions can cause unnecessary overhead and resource consumption. Consider the expected throughput and the number of consumers you plan to deploy.

Graceful shutdowns and idempotent processing

To avoid unnecessary rebalances and ensure data integrity, you should always implement graceful shutdowns for your consumer applications. If your application receives a shutdown signal, it should stop its consumer and transfer its last offsets. This signals to the group coordinator that the consumer is intentionally leaving the group and triggers a graceful rebalance without the delay of a session timeout. Since delivery “at least once” is common, your message processing logic should also be idempotent. This means that your application should be able to process the same message multiple times without any negative side effects. For example, if you are updating a database record, the update should be based on a unique key to ensure that applying the same update multiple times has the same result as applying it once.

Conclusion: The power of groups for scalability and resilience

Why consumer groups are important

Kafka consumer groups are not just an optional feature, but a cornerstone in building robust, scalable and fault-tolerant streaming data applications. By allowing multiple consumer instances to work together under a single group.id, Kafka provides an elegant solution to a fundamental problem: how to process a massive stream of data without creating bottlenecks or single points of failure. The group’s ability to dynamically distribute partitions among its members ensures that your processing power can scale horizontally as your data volume grows.

The key messages: scalability and resilience

The key strength of consumer groups lies in their ability to provide both scalability and resilience. Scalability is achieved through parallel processing across multiple consumers, allowing you to increase throughput simply by adding more instances to the group. Resilience is achieved through the rebalance mechanism. If a consumer fails, the group coordinator automatically reassigns its partitions to the remaining healthy consumers, ensuring that data processing continues with minimal interruption. This self-healing makes consumer groups an essential component for any business-critical application that relies on Kafka.

The future of your application

Mastering the concepts of consumer Groups – from understanding partitions and offsets to managing rebalances- will give you the ability to build applications that can handle real-world data streams. With this knowledge, you will be able to design systems that are not only efficient today, but can grow and adapt to future needs, solidifying Kafka’s role as a powerful platform for data streams.