Kafka
Kafka, what is it?
Originally developed by Linkedin for internal use, later in 2011 donated to Apache Kafka for open source
Initially developed as message queue, later used as distributed streaming platform
Apache Kafka is a publish-subscribe based durable messaging system. A messaging system sends messages between processes, applications, and servers.
Kafka acts as the central nervous system that makes streaming data available to applications. It builds real-time data pipelines responsible for data processing and transferring between different systems that need to use it.
The key design principles of Kafka were formed based on the growing need for high-throughput architectures that are easily scalable and provide the ability to store, process, and reprocess streaming data.
Streaming: Given a sequence of data (a stream), a series of operations (kernel functions) is applied to each element in the stream. Streaming processing is part of dataflow programming
Setup and Basic Working
For sample code and all the Kafka-related functionalities with python, refer to the below link. It covers the following functionalities
Installation (creating docker container using a docker-compose file)
Creating a topic
List all the topics
Delete a topic
Producing to a topic
Consuming from a topic
Sample example with a database of 4 students from a school, which includes all functionalities
What keywords to know?
Four main parts in a Kafka system
Broker: A Kafka cluster consists of one or more servers (Kafka brokers) running Kafka. It handles all requests from clients (produce, consume, and metadata) and keeps data replicated within the cluster. There can be one or more brokers in a cluster.
act as a channel between consumers and producers
Zookeeper: Keeps the state of the cluster (brokers, topics, users).
It is a high performance and open source complete coordination service used for distributed applications adapted by Kafka. It lets Kafka manage sources properly.
Checks health of each partition of every topic, and maintain the leader information
Keeps metadata (brokers, topics, partitions, leader information, access control list (who is allowed to write and who is not), dynamic configuration, quota (client throttling))
Producer: Producers are processes that push records into Kafka topics within the broker
Consumer: A consumer pulls records off a Kafka topic. It consumes records for further processing
Other components
Topic:
All kafka records (messages) are organized into topics.
Topic can be thought of as categories.
Producer applications write data to topics and consumer applications read from topics.
Group id:
As said, we can consume messages from the same topic more than once, groupid plays an important role in consuming the same message for different purposes (multiple streams)
It is a consumer property, a consumer consumes record from a topic with a particular group id
Group id with offset for a topic is an important combination for overall kafka processing
Offset:
There are two types of offsets in use: "earliest" and "latest"
Schema:
In Kafka, records are written in a particular format (avro for Java and avsc for Python)
Schema is defined in avro or avsc file format
The database schema is its structure described in a formal language supported by the database management system. The term "schema" refers to the organization of data as a blueprint of how the database is constructed
Partitions:
Partitions are the main concurrency mechanism in Kafka
A topic is divided into 1 or more partitions, enabling producer and consumer loads to be scaled
A consumer group supports as many consumers as partitions for a topic
The consumers are shared evenly across the partitions, allowing for the consumer load to be linearly scaled by increasing both consumers and partitions
You can have fewer consumers than partitions (in which case consumers get messages from multiple partitions)
If you have more consumers than partitions some of the consumers will be “starved” and not receive any messages until the cosumers number drops to (or below) the number of partitions
Architecture
Four Core API Architecture Kafka uses:
Producer API: The Producer API in Kafka allows an application to publish a stream of records to one or more Kafka topics.
Consumer API: An application can subscribe to one or more Kafka topics using the Kafka Consumer API. It also enables the application to process streams of records generated in relation to such topics.
Streams API: The Kafka Streams API allows an application to use a stream processing architecture to process data in Kafka. An application can use this API to take input streams from one or more topics, process them using streams operations, and generate output streams to transmit to one or more topics. The Streams API allows you to convert input streams into output streams in this manner.
Connect API: The Kafka Connector API connects Kafka topics to applications. This opens up possibilities for constructing and managing the operations of producers and consumers, as well as establishing reusable links between these solutions. A connector, for example, may capture all database updates and ensure that they are made available in a Kafka topic.
Properties:
Time-stamped Messages
Messages are time stamped. Because of some reasons, if we want to restart the message consumption again, we can use the time stamp to start consuming message again from a particular date. This is controlled by the offset.
Retention of Messages
Messages in Kafka are called records. Records published to the cluster stay in the cluster until a configurable retention period has passed by. Other way is, records stay untill maximum allocated memory gets used up.
We can run multiple consumers on same topic with different group ids, we can consume the same messages multiple times for different purposes
Parallelism
The maximum parallelism at which an application may run is bounded by the maximum number of stream tasks, which itself is determined by maximum number of partitions of the input topic(s) the application is reading from.
Partition are main concurrency mechanism, topic is divided into 1 or more partitions hence enabling consumer and producer loads to be scaled. Specifically, a consumer group supports multiple consumers—as many consumers as partitions for a topic
For example, if your input topic has 5 partitions, then you can run up to 5 applications instances. If you run a larger number of app instances than partitions of the input topic, the “excess” app instances will launch but remain idle; however, if one of the busy instances goes down, one of the idle instances will resume the former’s work
Other Properties
Consumers and Producers are loosely coupled (isolated from each other)
Disadvantage
Unlike Celery, in Kafka we don't have ability to prioritize the message queues or consume messages from the queue based on some criterion. In Kafka, messages are consumed in a particular order based on time stamp only
Reference for Further Reading
Top Interview Questions, gives complete understanding of the kafka:
Last updated
Was this helpful?