Apache Kafka: A Metaphorical Introduction to Event Streaming for Data Scientists and Data Engineers

Kaushikrch Last Updated : 08 Dec, 2020

7 min read

Overview

Learn about viewing data as streams of immutable events in contrast to mutable containers
Understand how Apache Kafka captures real-time data through event streaming

Introduction

Back in 2010, professional networking site LinkedIn faced inconsistency and latency issues on their existing request-response system. To overcome this, a team of software engineers at the company developed an optimized messaging system that could solve their problem of the continuous flow of data. In late 2010, LinkedIn open-sourced the project.

In July 2011, Apache Software Foundation accepted it as an incubator project; thus, giving birth to Apache Kafka that went on to become one of the largest streaming platforms in the world.

Capturing real-time data was possible by using Kafka (we will get into the discussion of how later on). Alongside Kafka, LinkedIn also created Samza to process data streams in real-time. Apache added Samza as part of their project repository in 2013.

I always wondered what thoughts the creators of Kafka had in mind when naming the tool. Jay Kreps, one of the cofounders of Kafka (along with Neha Narkhede and Jun Rao), said:

“I thought that since Kafka was a system optimized for writing, using a writer’s name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.

So basically there is not much of a relationship.” (Narkhede, Shapira, & Palino, 2017 p. 16)

The way Kafka treats the concept of data is entirely different from what we have thought of data to be. Though Kreps may be right in saying not to read too much into the name of the tool, I find a lot of similarities between the philosophical underpinnings of 20th-century’s celebrated literary figure Franz Kafka’s works and how Apache Kafka treats data.

In this article, I will take metaphors from Kafka’s stories and relate them to understand the working of Apache Kafka.

“The Metamorphosis” (of the data architecture)

In one of the famous stories of Franz Kafka’s The Metamorphosis, the protagonist Gregor Samsa wakes up in the morning to find himself transformed into a giant insect. There is a physical alteration to Gregor Samsa’s body. The Metamorphosis is a story that deals with the transformation of the protagonist and events that describe the absurdity of life, which remains one of the central ideas in most of Kafka’s works.

You must be wondering how does this story even relates to data architecture in any way? Let me explain.

Data as a stream of events

Traditionally, we have thought of data as a collection of a set of values about objects. Data tells us the current state of an entity which may be qualitative or quantitative — for example, some characteristics of a customer, a product, or an economy.

Till recently, data mostly existed in a monolithic framework called the database. If the attributes of the entity change, we would update the database to reflect the changes.

Thus, traditional relational databases are mutable, that is, subject to change its state and once changed remains in that state until any update occurs. You can think of updating a database as changing the data attributes or appending new records.

However, in the world of Apache Kafka, data is not objectified but treated as a stream of events. This stream of events is recorded in the form of log files called topics. For starters, a log is a file that records an event that occurred sequentially.

Mutable states and immutable events

For example, a customer identified by customer_id 123 buys one unit of a product having product_id 999. In a traditional relational database, the quantity 1 will be recorded against a row matching the customer id and product id, which reflects the current state of the database. Suppose, the customer changes her mind and buys three units of the same product. We need to mutate the database for this transactional change by updating the quantity from 1 to 3.

But, if we treat data as streams of events, the log file reflects the change of mind as a fact or an immutable event occurring at a specific time. The log appends the event after the previous occurrence of quantity 1 has been recorded in the log stream.

Apache Kafka - Mutable vs immutable events

Source: Kleppmann, M. (2015)

Statefulness and Fault Tolerance of Apache Kafka

Returning to the plight of Gregor Samsa, an event at a specific time led to his physical transformation. But his mind remains intact as he worries about getting late to office. The metamorphosis of the physical reality is an event or a fact at a particular point in time. But it is also stateful, in the sense that his mind still thinks like a human.

Just like Gregor accepts the physical transformation without any surprise or shock, his mind changes as per his physical needs. This is even true for event streams in Apache Kafka. Though it keeps a memory of all the events taking place over a while, you have the option of removing the logs dating back to an extended period.

Being stateful ensures that Apache Kafka is fault-tolerant and resilient, i.e., if any problems occur at a future date, you always have the option of returning to the previous working state.

Agile response for real-time data

The advent of a multitude of data sources has transformed the process of decision-making by governments, businesses, and individual agents.

Earlier, data was like a source of validation of the decision-making process; that is, strategic decisions were instinctive and experiential and then validated by data. Whereas, nowadays, data-driven decision making is a part of the process itself. Thus, we record data as events taking place within an enterprise. The enterprise analyses these events manipulates them and uses them for more data as output.

“Every byte of data has a story to tell, something of importance that will inform the next thing to be done. In order to know what that is, we need to get the data from where it is created to where it can be analyzed.” (Narkhede, Shapira, & Palino, 2017 p. 1)

In this fast-changing app oriented world, agility and responsiveness in data transfers are critical. Moving around data quickly within internal systems and responding to requests from external servers become imperative. This is where Apache Kafka’s pub-sub (publish-subscribe) messaging comes in handy.

Publish-Subscribe System of Messaging in Apache Kafka

Consider an example where an e-retailer wants to build an application for the checkout process, which is a webpage or app where buyers can pay for their products that they want to order. When a customer shops online and checks out through the webpage or app, the purchase message should reflect on the shipment team’s data to ship the product to the customer who purchased it. This is not a complicated process.

Now, think of a scenario where this event should also trigger a system of other events like sending an automated email receipt to the customer, updating the inventory database, etc. As the front and back-end services get added, and the list of responses to a purchase checkout grows, more integrations need to be built, and this can get too messy. The integrations lead to an inter-team dependency for any changes to the applications, which makes development slow.

Source: Narkhede, Shapira, & Palino (2017) p. 2

Apache Kafka helps achieve the decoupling of system dependencies that makes the hard integration go away. This makes the checkout webpage or app broadcast events instead of directly transferring the events to different servers. This is what we mean by publishing. Then, the other services like inventory, shipment, or email subscribe to that stream, which triggers them to act accordingly.

The publish-subscribe model uses the messaging service, that is, the publisher stream event messages and the subscriber chooses to subscribe to the requisite messages.

Retrieved from https://www.confluent.io/what-is-apache-kafka/

Producers and Consumers

In Apache Kafka, you can store the streams of events generated by the producers durably and reliably for the specified time.

It also allows you to read and process these streams by the consumers. The producers and consumers are completely decoupled, that is, the producers don’t wait for the consumers to consume the events, and the consumers can consume the streams in any sequence. The consumers also have an option to decide which messages to consume. However, the logs are appended to the topics sequentially. This property removes the complexity of having complicated routing rules which makes Kafka fast, efficient, scalable, and all these are performed in a distributed manner.

This reminds me of how Gregor Samsa’s mind and body are decoupled. Samsa thinks of going to work or standing upright with an insects’ body. To make space for his physical existence, Samsa’s family members move furniture out of his room. Samsa realizes his possessions are taken away from him.

Through these details, the story suggests that our physical lives shape and direct our mental lives, not the other way around. Alternatively, the need for real-time capture of data makes us think about data as not as physical realities but as streams of events.

End Notes

In this article, we looked at an evolved way of understanding data as streams of events. This is very different from how data has been viewed till now. We also understood the need for a decoupled architecture for these data events to be published and subscribed to make the process of data generation, consumption, and update much faster and efficient. This is achieved in the Apache Kafka framework.

I recommend you go through the following data engineering resources to enhance your knowledge:

For more technical details related to Apache Kafka, feel free to refer to some of the below-mentioned resources.

References

Narkhede, N., Shapira, G., & Palino, T. (2017). Kafka: The definitive guide. Sebastopol, CA: O’Reilly.

IBM Cloud Education (2020). Apache Kafka. https://www.ibm.com/cloud/learn/apache-kafka.

Kleppmann, M. (2015). Turning the database inside-out with Apache Samza. https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/.

Apache Kafka. Introduction. https://kafka.apache.org/intro

Kaushikrch

Beginner Data Engineering

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Apache Kafka: A Metaphorical Introduction to Event Streaming for Data Scientists and Data Engineers

Overview

Introduction

“The Metamorphosis” (of the data architecture)

Data as a stream of events

Mutable states and immutable events

Statefulness and Fault Tolerance of Apache Kafka

Agile response for real-time data

Publish-Subscribe System of Messaging in Apache Kafka

Producers and Consumers

End Notes

References

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid