A few months ago I got in touch with Kafka – the message broker Apache Kafka rather than Franz Kafka the novelist.
Apache Kafka is simply a message broker which claims to be:
- Fast – Hundreds of MB r/w per second from thousands of clients
- Central data backbone – Data streams are partitioned and spread over a cluster of machines
- Durable – Messages are persisted on disk and replicated within the cluster
- Fault-tolerance – Cluster-centric design
First cup and sketch
Digging into the documentation was quite straight forward. Even walking through the Quick Start and setting up the first tiny 2 node cluster was possible by just doing what is outlined. But Kafka depends and relies on Apache Zookeeper, so this tool has to be installed first. Besides that, Kafka comes with dozens of parameters to configure dealing with the complexity of a distributed system. A few console-tools exist to administrate Kafka like creating topics as well to produce messages and to consume messages from a topic.
In version 0.8x three APIs are provided to send(produce) and receive(consume) messages from Kafka. One to produce messages and two to consume messages from Kafka. The producer API is quite handy. A Highlevel API provides a simple way to consume massages and hides the complex details, but has limited control. A Simple API provides full control to consume messages, but the integration code works on a low-level. Quite frankly the Simple API should be named „Complex API“. Anyway with version 0.9x a new client API will be shipped which could be easier to use and could work without Zookeeper.
Log centric approach
Due to it’s distributed design and messaging concept, Kafka can be used as a central data log; providing data e.g. user activity on web-pages in realtime to several consumers.
Jay Kreps, one of the founder of Kafka at Linked-In describes this log approach and further fundamental concepts of Kafka in the following book, session and post – so I don’t want to repeat it here.
- I Heart Logs: Event Data, Stream Processing, and Data Integration
- The Log: What every software engineer should know about real-time data’s unifying abstraction
- Airbnb Tech Talk: Jay Kreps – Building Linked-In’s Real-time Pipeline
In the end this approach allows building decoupled systems emerging into an event driven system.
The ecosystem collaborating with Kafka is growing and supporting a couple of scenarios to deal with data, especially big-data in real-time.
Later in the evening Camus stopped by and claimed to be the pipeline between Kafka and Apache Hadoop. Indeed Camus, like Kafka, a project open sourced by Linked-In, can run on a Hadoop cluster. It simply consumes data, out of Kafka topics and pushes it as files, in one folder per topic, into the HDFS.
UPDATE Sept. ’15: Camus is superseded by Gobblin.