A coffee with Kafka

A few months ago I got in touch with Kafka – the message broker Apache Kafka rather than Franz Kafka the novelist.
Apache Kafka is simply a message broker which claims to be:

Fast – Hundreds of MB r/w per second from thousands of clients
Scalable
Central data backbone – Data streams are partitioned and spread over a cluster of machines
Durable – Messages are persisted on disk and replicated within the cluster
Fault-tolerance – Cluster-centric design

First cup and sketch

Digging into the documentation was quite straight forward. Even walking through the Quick Start and setting up the first tiny 2 node cluster was possible by just doing what is outlined. But Kafka depends and relies on Apache Zookeeper, so this tool has to be installed first. Besides that, Kafka comes with dozens of parameters to configure dealing with the complexity of a distributed system. A few console-tools exist to administrate Kafka like creating topics as well to produce messages and to consume messages from a topic.

In version 0.8x three APIs are provided to send(produce) and receive(consume) messages from Kafka. One to produce messages and two to consume messages from Kafka. The producer API is quite handy. A Highlevel API provides a simple way to consume massages and hides the complex details, but has limited control. A Simple API provides full control to consume messages, but the integration code works on a low-level. Quite frankly the Simple API should be named „Complex API“. Anyway with version 0.9x a new client API will be shipped which could be easier to use and could work without Zookeeper.

Log centric approach

Due to it’s distributed design and messaging concept, Kafka can be used as a central data log; providing data e.g. user activity on web-pages in realtime to several consumers.
Jay Kreps, one of the founder of Kafka at Linked-In describes this log approach and further fundamental concepts of Kafka in the following book, session and post – so I don’t want to repeat it here.

In the end this approach allows building decoupled systems emerging into an event driven system.

Ecosystem

The ecosystem collaborating with Kafka is growing and supporting a couple of scenarios to deal with data, especially big-data in real-time.

Later in the evening Camus stopped by and claimed to be the pipeline between Kafka and Apache Hadoop. Indeed Camus, like Kafka, a project open sourced by Linked-In, can run on a Hadoop cluster. It simply consumes data, out of Kafka topics and pushes it as files, in one folder per topic, into the HDFS.
UPDATE Sept. ’15: Camus is superseded by Gobblin.

the data backyard

Machine Learning, Data Engineering & Data Science

A coffee with Kafka

First cup and sketch

Log centric approach

Ecosystem

First cup and sketch

Log centric approach

Ecosystem

Teilen mit: