Kafka

Kafka, what is it?

  • Originally developed by Linkedin for internal use, later in 2011 donated to Apache Kafka for open source

  • Initially developed as message queue, later used as distributed streaming platform

  • Apache Kafka is a publish-subscribe based durable messaging system. A messaging system sends messages between processes, applications, and servers.

  • Kafka acts as the central nervous system that makes streaming data available to applications. It builds real-time data pipelines responsible for data processing and transferring between different systems that need to use it.

  • The key design principles of Kafka were formed based on the growing need for high-throughput architectures that are easily scalable and provide the ability to store, process, and reprocess streaming data.

Setup and Basic Working

For sample code and all the Kafka-related functionalities with python, refer to the below link. It covers the following functionalities

  • Installation (creating docker container using a docker-compose file)

  • Creating a topic

  • List all the topics

  • Delete a topic

  • Producing to a topic

  • Consuming from a topic

  • Sample example with a database of 4 students from a school, which includes all functionalities

What keywords to know?

Four main parts in a Kafka system

  • Broker: A Kafka cluster consists of one or more servers (Kafka brokers) running Kafka. It handles all requests from clients (produce, consume, and metadata) and keeps data replicated within the cluster. There can be one or more brokers in a cluster.

    • act as a channel between consumers and producers

  • Zookeeper: Keeps the state of the cluster (brokers, topics, users).

    • It is a high performance and open source complete coordination service used for distributed applications adapted by Kafka. It lets Kafka manage sources properly.

    • Checks health of each partition of every topic, and maintain the leader information

    • Keeps metadata (brokers, topics, partitions, leader information, access control list (who is allowed to write and who is not), dynamic configuration, quota (client throttling))

  • Producer: Producers are processes that push records into Kafka topics within the broker

  • Consumer: A consumer pulls records off a Kafka topic. It consumes records for further processing

Other components

  • Topic:

    • All kafka records (messages) are organized into topics.

    • Topic can be thought of as categories.

    • Producer applications write data to topics and consumer applications read from topics.

  • Group id:

    • As said, we can consume messages from the same topic more than once, groupid plays an important role in consuming the same message for different purposes (multiple streams)

    • It is a consumer property, a consumer consumes record from a topic with a particular group id

    • Group id with offset for a topic is an important combination for overall kafka processing

  • Offset:

    • There are two types of offsets in use: "earliest" and "latest"

  • Schema:

    • In Kafka, records are written in a particular format (avro for Java and avsc for Python)

    • Schema is defined in avro or avsc file format

    • The database schema is its structure described in a formal language supported by the database management system. The term "schema" refers to the organization of data as a blueprint of how the database is constructed

  • Partitions:

    • Partitions are the main concurrency mechanism in Kafka

    • A topic is divided into 1 or more partitions, enabling producer and consumer loads to be scaled

    • A consumer group supports as many consumers as partitions for a topic

    • The consumers are shared evenly across the partitions, allowing for the consumer load to be linearly scaled by increasing both consumers and partitions

    • You can have fewer consumers than partitions (in which case consumers get messages from multiple partitions)

    • If you have more consumers than partitions some of the consumers will be “starved” and not receive any messages until the cosumers number drops to (or below) the number of partitions

    • Further details of the partition can be found here

Architecture

Image Credits: DataCouch youtube video

Four Core API Architecture Kafka uses:

  • Producer API: The Producer API in Kafka allows an application to publish a stream of records to one or more Kafka topics.

  • Consumer API: An application can subscribe to one or more Kafka topics using the Kafka Consumer API. It also enables the application to process streams of records generated in relation to such topics.

  • Streams API: The Kafka Streams API allows an application to use a stream processing architecture to process data in Kafka. An application can use this API to take input streams from one or more topics, process them using streams operations, and generate output streams to transmit to one or more topics. The Streams API allows you to convert input streams into output streams in this manner.

  • Connect API: The Kafka Connector API connects Kafka topics to applications. This opens up possibilities for constructing and managing the operations of producers and consumers, as well as establishing reusable links between these solutions. A connector, for example, may capture all database updates and ensure that they are made available in a Kafka topic.

Properties:

Time-stamped Messages

Messages are time stamped. Because of some reasons, if we want to restart the message consumption again, we can use the time stamp to start consuming message again from a particular date. This is controlled by the offset.

Retention of Messages

Messages in Kafka are called records. Records published to the cluster stay in the cluster until a configurable retention period has passed by. Other way is, records stay untill maximum allocated memory gets used up.

We can run multiple consumers on same topic with different group ids, we can consume the same messages multiple times for different purposes

Parallelism

The maximum parallelism at which an application may run is bounded by the maximum number of stream tasks, which itself is determined by maximum number of partitions of the input topic(s) the application is reading from.

Partition are main concurrency mechanism, topic is divided into 1 or more partitions hence enabling consumer and producer loads to be scaled. Specifically, a consumer group supports multiple consumers—as many consumers as partitions for a topic

For example, if your input topic has 5 partitions, then you can run up to 5 applications instances. If you run a larger number of app instances than partitions of the input topic, the “excess” app instances will launch but remain idle; however, if one of the busy instances goes down, one of the idle instances will resume the former’s work

Other Properties

  • Consumers and Producers are loosely coupled (isolated from each other)

Disadvantage

Unlike Celery, in Kafka we don't have ability to prioritize the message queues or consume messages from the queue based on some criterion. In Kafka, messages are consumed in a particular order based on time stamp only

Reference for Further Reading

Last updated

Was this helpful?