Kafka is a rare joy to work with in the distributed data systems space. Sometimes the tools can be unwieldy, daunting in their complexity and prone to surprising behavior.
Kafka is simple given its power and working with it in production over the last two years of my career has been extremely rewarding.
This article is to share a bit of what I’ve learned about consuming from Kafka topics and doing work on the data.
The reason I am choosing to focus on just consuming for the moment is because it is by far the more complicated and open-ended side of Kafka’s paradigm. Producing messages is just about the easiest thing in the world by design and nature. Fire and forget, who cares who’s listening?
1. Decouple Consumption and Processing
This one is #1 for a reason. The first thing you’ll realize when you start working with a Kafka consumer is that it has basically one purpose — to stream data from the brokers really really fast.
The consumer provides a number of ways to control fetching data, but generally that’s what it is for. The consumer is also responsible for letting Kafka know that you are done processing data (committing offsets) and this is a very important final step in your data-flow.
The meat of the consumer sandwich is what you do with the data in between receiving it from the brokers and acknowledging that you are finished with it.
Many of the simple Hello World type examples of Kafka consumption will show data-manipulation (doing work) right in the consumer’s poll/commit loop (synchronously even) and/or take advantage of the auto commit feature of the high level Kafka consumer API for time-interval based message acknowledgment.
While these examples are convenient for simplicity and well suited for simpler operations (like dumb, idempotent sinks) as soon as you start trying to inject any kind of business logic, more complex IO, or maybe re-production to a new topic you may find you need more precise control.
This is where decoupling your consumer thread from your “worker threads” becomes especially handy. Caveat: we are assuming you have threads in your runtime or some kind of concurrency abstraction like CSP-style channels.
To spell it out, you consume messages as fast as digitally possible from your consumer and you dump them into another routine that knows how to process them and report back with failures or success asynchronously.
This achieves two key conceptual advantages:
- Your consumer can now continuously fetch for as long as the message processing queue decides not to block (apply back-pressure). This essentially creates a memory buffer of messages waiting to be processed and can be used as a signal for very cool advanced techniques like micro-scaling.
- Your worker processes are now much more pure and easily testable now. They do not care where the data is coming from specifically. You can write purely algorithmic transformations and then end with “Success -> Output” or “Failure -> Reason” in an asynchronous fashion and the actual business logic or algorithm does not concern itself with the consequence of its outcome.
There is also some resource overhead bonus because you are only reserving a single thread (if we’re talking about something like the JVM) for performing network I/O to the Kafka brokers.
How that happens? Simply because Kafka consumers use a single IO thread and can subscribe to an arbitrary set of topics and fetch in batches across topics and partitions. This is highly efficient network usage.
Compare that to if you had one consumer thread per topic (or worse yet, topic and partition). This would be massively wasteful from a resource standpoint on your consumer node, and very excessive on the broker side as it creates many more incoming connections not to mention latency.
It’s a little trickier to reason about this separation and implement it correctly in a fault-tolerant manner. It is however, well worth it if you are going to do anything more complicated than just blind streaming data replication or other types of jobs that do not require per-message guarantees or transformations.
For some good examples on how to do this and a good overview of the options, I recommend taking a look at the original Javadoc for the Kafka consumer interface. While it’s possible to implement this in Golang, Python, and various other Kafka-supported runtimes, you may find the Javadoc has some wisdom not always found in the documentation for the other runtimes.
2. Create Back-pressure
Anyone who has built streaming systems will think this is obvious, but I’ve found beginners do not.
Simply put, an unchecked Kafka consumer is a DoS attack on your application waiting to happen.
To be more specific, Kafka is capable of delivering messages over the network at an alarmingly fast rate. Odds are that even the most efficient consumers will reach a point where they cannot keep pace depending on how high the volume and rate gets. This is especially true if your message workers have a slow operation embedded in them like an external network call or heavy computation.
A great way to benchmark your consumer, is to initiate a replay of a large topic from its beginning on a single instance of your consumer. The message throughput just before it runs out of memory and dies is roughly how much a single instance of your task can handle for a given amount of allocated memory.
Armed with a benchmark like this, you can introduce a back-pressure system on the Kafka consumer via a blocking queue or comparable abstraction (depending on the Kafka version) to either pause the consumer using the API, or block the poll loop to prevent the fetching of additional messages until your application has caught up.
If through bench-marking you can correlate the rough ratio between queue size and memory usage, you can set the max queue size automatically relative to the memory allocation made for a given consumer instance. This is a safety feature to mitigate the possibility of the task running out of memory. Again, this can also be used a micro-scaling trigger.
A note on BLOCKING the consumer poll loop:
In Kafka versions below 10, you should not create back-pressure by doing this. Doing so will also block the consumer heartbeat and if using the consumer group management features of Kafka this will cause the brokers to think your consumer has died entirely. This is not good as it will continuously trigger rebalances which in some cases has the potential to bring a consumer group to its knees and create a never ending backlog of re-processed messages. In Kafka versions 10 and above, this was corrected and the heartbeat was moved to a separate thread.
For interested readers, Zach Tellman (@ztellman) covers back-pressure (among other things) in this talk examining some of the the fundamental characteristics of high volume data-flow. To summarize with a quote from the talk:
Unbounded queues are fundamentally broken.
We’ll leave it at that for now, but there is much more to learn about Kafka Consumer patterns, tricks and gotchas. Stay tuned for more articles on the topic.