Every second, LinkedIn’s Kafka infrastructure processes 2.3 million messages—a staggering 7 trillion daily. What makes this possible isn’t just Kafka’s components, but how they’re architected to work in harmony at planetary scale.
In my previous post, I introduced Apache Kafka and explained how it plays a key role in real-time data streaming. As I continue learning more about Kafka, I wanted to dig a bit deeper into how it actually works under the hood—without getting lost in overly complex details.
In this post, I’ll break down Kafka’s architecture in a way that’s easy to understand, especially if you’re just getting started like I am.
“You don’t truly know Kafka until you’ve seen how brokers, partitions, and consumers interact under load—or more importantly, how they fail.”
1. Core Components: The 90-Second Refresher
(For When You “Know Kafka” But Need the Architecture Lens)
Kafka’s power lies in just a handful of moving parts. If you’ve already read about Kafka or played around with it a bit, this is your quick refresher—but this time with a focus on architecture and how things work under the hood.
🔌 Brokers: The Stateless Middlemen
Kafka brokers are the backbone of a Kafka cluster. Think of a broker as a node (a server) that receives, stores, and serves data—but it doesn’t track consumer progress.
- This stateless design is deliberate. It means brokers don’t carry baggage like “who consumed what.”
- Instead, consumers are responsible for tracking what they’ve read, and Kafka stores everything in an append-only log.
- Want more throughput? Add more brokers. Kafka’s design allows near-linear scaling.
⚡ Pro Insight: The stateless nature of brokers allows Kafka to scale effortlessly compared to systems that track consumer state (like RabbitMQ).
📚 Topics & Partitions: The Foundation of Kafka’s Data Model
A topic is where messages are sent—like a category or channel for your data.
A partition is a chunk of that topic that helps Kafka scale and maintain order.
Analogy:
Imagine your topic is a book about “User Events.” Each partition is a chapter.
- Events are written in order inside each chapter (partition).
- Messages are never modified or deleted, unless explicitly configured.
Why partitions matter:
- They define Kafka’s parallelism.
- A topic with 6 partitions can be read by up to 6 consumers simultaneously (within a group).
- More partitions = more throughput, but also more overhead.
💡 Tradeoff: 10 partitions mean 10 consumers can process in parallel, but each partition adds metadata, open files, and complexity.
✍️ Producers: The Writers of the Log
A producer is any application that sends (writes) data to a Kafka topic.
- It decides which partition to write to (often via a key-based hashing function).
- Producers don’t wait for consumers—Kafka stores the data, and it can be read later.
- You can configure how “safe” a write should be using the
acks
setting (we’ll dive into this in the reliability section).
Example: A web app logging user clicks sends each event as a Kafka message to a “clickstream” topic.
📖 Consumers: The Readers with Memory
Consumers are the apps or services that read data from Kafka.
- They track their own offsets, which represent where they are in the partition.
- Kafka doesn’t delete messages just because they’re read—retention is time-based or size-based, not consumption-based.
🧠 This is different from traditional message queues like RabbitMQ, where once a message is read, it’s gone.
🧭 ZooKeeper (Retiring) vs. KRaft (Modern Kafka)
Kafka has traditionally used ZooKeeper for managing metadata like broker info, cluster health, and topic configs. But Kafka is evolving.
- ZooKeeper is being replaced by KRaft, Kafka’s new built-in consensus layer.
- With KRaft, Kafka doesn’t rely on an external system anymore—everything is self-contained.
🚀 KRaft reduces complexity, improves failover speed, and makes Kafka easier to deploy.
✅ Summary of Core Components
Component | What It Does | Why It Matters |
---|---|---|
Broker | Stores and delivers messages | Stateless = scalable |
Topic | Logical category for messages | Organizes your data |
Partition | Subdivision of a topic | Enables parallelism |
Producer | Writes data to Kafka | Decouples data ingestion |
Consumer | Reads data from Kafka | Controls consumption |
ZooKeeper / KRaft | Coordinates brokers | KRaft is the future |
2. Reliability: How Kafka Keeps Your Data Safe
Kafka is built to be reliable—meaning your data won’t disappear even if servers crash. But how does it do this? The answer lies in replication, a clever way of storing multiple copies of your data. Let’s break it down in simple terms.
Kafka’s reputation for reliability doesn’t come from complex algorithms or exotic hardware—it emerges from a surprisingly simple principle: redundancy done right. But here’s the paradox: Kafka’s replication model gives you either bulletproof consistency or high availability—rarely both at once. Understanding this tradeoff is what separates engineers who merely use Kafka from those who master it.
The Leader-Follower Dance
At the core of Kafka’s reliability is its leader-follower replication model. Every partition has one leader broker (handling all reads/writes) and multiple followers (replicating data asynchronously). This seems straightforward—until a broker fails.
When the leader disappears, Kafka doesn’t just promote any follower. It consults the ISR (In-Sync Replicas) list—followers that are fully caught up with the leader. Only these replicas qualify for promotion, ensuring no data loss during failover.
2.1 Replication: The Backup System
Imagine you have an important document. To make sure you never lose it, you save copies on:
- Your laptop (Leader)
- A USB drive (Follower 1)
- Cloud storage (Follower 2)
This is how Kafka works:
- Every message is stored in multiple places (brokers).
- If one broker fails, another takes over automatically.
💡 Key Idea:
Kafka doesn’t just store one copy of your data—it keeps backups (called replicas) to prevent loss.
2.2 Leaders and Followers: Who Does What?
Each partition (a piece of a topic) has:
- 1 Leader (handles all reads and writes)
- N Followers (keep copies of the data)
How it works:
- When a producer sends a message, it goes to the leader.
- The leader copies it to followers (if
acks=all
is set). - If the leader crashes, a follower becomes the new leader.
[]
How leaders and followers work together
2.3 What Happens When a Broker Fails?
Kafka is designed to handle failures smoothly. Here’s what happens if a broker (server) goes down:
- Detect the failure (Kafka notices a broker is offline).
- Promote a follower (the most up-to-date replica becomes leader).
- Continue operations (producers and consumers reconnect automatically).
⚠️ Important Setting:
unclean.leader.election.enable=false
(default) → Only up-to-date followers can become leaders (no data loss).unclean.leader.election.enable=true
→ Any follower can become leader (risk of losing recent messages).
2.4 Durability vs. Speed: The acks
Tradeoff
Kafka lets you choose how safe you want your messages to be:
acks Setting | What It Means | When to Use |
---|---|---|
acks=0 | “Fire and forget” (no guarantee) | Low-value data (logs, metrics) |
acks=1 | Leader confirms receipt (default) | Most use cases (good balance) |
acks=all | Leader + all replicas confirm | Critical data (banking, orders) |
Example:
- If you’re tracking website clicks,
acks=1
is fine.- If you’re processing payments, use
acks=all
.
2.5 Kafka’s Secret Weapon: The Immutable Log
Unlike databases that delete old data, Kafka keeps everything (until you configure it to delete). This means:
- You can replay old messages (useful for debugging).
- Consumers can reset and read from the start (like rewinding time).
Retention Settings:
log.retention.hours=168
(default: keep data for 7 days)log.retention.bytes=1GB
(limit total size per partition)
3. Scalability: How Kafka Handles Massive Traffic
Kafka’s real superpower isn’t just reliability—it’s the ability to scale horizontally to handle millions of messages per second. Let’s explore how this works in practice, using simple concepts anyone can understand.
One of Kafka’s most impressive feats is its ability to scale seamlessly—from handling a few messages per second to millions. But how does it achieve this? The secret lies in its partitioned, distributed design, which allows it to grow horizontally by simply adding more brokers. Let’s break down how this works in a way that’s easy to understand, whether you’re a student learning distributed systems or an engineer designing a data pipeline.
3.1 Partitions: The Secret to Parallel Processing
At the core of Kafka’s scalability are partitions—subdivisions of a topic that allow data to be distributed across multiple brokers. Think of a topic like a book, and partitions like its chapters. If multiple people want to read the book at the same time, they can each take a different chapter rather than waiting for one person to finish the whole thing.
- Each partition is an ordered, immutable sequence of messages.
- Producers write to different partitions in parallel, increasing throughput.
- Consumers read from partitions independently, allowing for high-speed processing.
This design means Kafka can handle near-linear scalability: double the partitions (and consumers), and you can nearly double your throughput.
Example:
A topic with 10 partitions can have up to 10 consumers processing data simultaneously. If you need more speed, you can increase partitions (though there are tradeoffs, which we’ll discuss soon).
Think of a topic as a book, and partitions as chapters:
- Instead of one person reading the entire book cover to cover,
- Ten people can each read a different chapter at the same time.
How it works:
- Each partition is an ordered, immutable log.
- Producers can write to different partitions in parallel.
- Consumers can read from partitions independently.
🔄 More partitions = more parallelism = higher throughput
3.2 Consumer Groups: Teamwork Makes the Dream Work
While partitions help with writing data at scale, consumer groups handle reading efficiently. A consumer group is simply a set of consumers working together to process messages from a topic.
Here’s how it works:
- Each partition is assigned to only one consumer in a group.
- If you add more consumers than partitions, the extra consumers sit idle.
- If a consumer fails, its partitions are reassigned to others in the group automatically.
A consumer group is like a team of workers splitting a job:
- Each worker (consumer) handles specific partitions
- No two workers read the same partition (no duplicate processing)
Key Rule:
❗ Number of consumers ≤ Number of partitions
Real-World Case:
Companies like Netflix use consumer groups to process streaming events in real time. If they need more processing power, they simply add more consumers—no code changes required.
3.3 The Dark Side of Too Many Partitions
While partitions enable scalability, they aren’t free. Each partition adds:
- Memory overhead (brokers track metadata for each one).
- File handles (each partition requires open files on disk).
- Rebalance delays (adding partitions triggers consumer reassignments).
More partitions aren’t always better:
⚠ Metadata overhead (ZooKeeper/KRaft must track them all)
⚠ More open files (each partition = disk I/O operations)
⚠ Longer recovery time after failures
Best Practices:
- Start with 3-6 partitions per topic (adjust based on need).
- Monitor consumer lag (if lag grows, consider adding partitions).
- Avoid over-partitioning (too many can slow down your cluster).
3.4 Adding Brokers: Scaling the Easy Way
The final piece of Kafka’s scalability is brokers—the servers that store and manage data. Adding brokers allows Kafka to:
- Distribute partitions (improving storage and throughput).
- Increase fault tolerance (more replicas mean better durability).
- Handle more concurrent connections (useful for high-traffic systems).
The best part? Kafka automatically rebalances partitions when brokers are added or removed, meaning scaling up (or down) requires minimal manual effort.
Need more capacity? Just add brokers (servers):
- Kafka automatically rebalances partitions
- No downtime required
- Throughput increases linearly
Cloud Tip:
AWS MSK and Confluent Cloud let you scale brokers with a slider—no manual work needed.
Summary: Kafka’s Scaling Superpowers
✅ Partitions → Enable parallel processing
✅ Consumer Groups → Distribute work across machines
✅ Dynamic Broker Scaling → Grow with your traffic
4. Bonus: What Is a Consumer Group?
When you first hear the term “Consumer Group”, it might sound like some kind of admin panel or fancy feature. But it’s actually one of Kafka’s simplest—and most powerful—concepts.
A consumer group is just a collection of Kafka consumers that work together to read data from a topic, while ensuring each message is processed only once by the group.
Why Consumer Groups Matter?
Imagine a food delivery app like Uber Eats. Every time someone places an order (message), it needs to be handled by a driver (consumer). Now imagine having hundreds of drivers organized into teams, each assigned to handle a subset of neighborhoods (partitions). That’s how consumer groups work.
Without consumer groups:
- Every consumer gets every message (like shouting into a crowd).
- This is fine for logging or monitoring, but not for processing data in parallel.
With consumer groups:
- Kafka divides the work.
- Each partition of a topic is assigned to only one consumer within the group.
- Messages are evenly distributed, and not duplicated across consumers.
How It Works?
Let’s say you have:
- A topic with 4 partitions
- A consumer group with 2 consumers
Kafka will automatically:
- Assign 2 partitions to each consumer
- Track who’s reading what (via offsets)
- Reassign partitions if one consumer crashes (fault tolerance ✅)
Add a third consumer? Kafka rebalances and redistributes the partitions automatically.
Remove one? The others take over. No manual work needed.
Offset Tracking
Each consumer in a group tracks its offset—the last message it successfully read. This means consumers can:
- Pause and resume without reprocessing everything
- Restart from where they left off
- Even rewind to older messages (for retries or audits)
⚡ Pro Tip:
Offset storage is handled inside Kafka itself, making state tracking reliable and centralized.
✅ When to Use Consumer Groups
Use consumer groups when:
- You need to scale message processing across multiple machines
- You want guaranteed message delivery with no duplication
- You need automatic failover and load balancing
5.Recap + Best Practices
As we’ve explored throughout this post, Apache Kafka is a powerful, fault-tolerant distributed system designed for high throughput and reliability in real-time data streaming. From its core components (brokers, producers, consumers, and topics) to its advanced features like partitioning, replication, and consumer groups, Kafka is built to scale and handle enormous amounts of data with ease.
Let’s wrap everything up with a high-level summary and best practices for those working with Kafka in production.
5.1 🔑 Kafka Core Takeaways
- Producers & Consumers: Kafka’s producers write data to topics, while consumers read from them. This decouples the data producers from consumers, making Kafka highly flexible and scalable.
- Partitioning: Topics are split into partitions—each partition can be distributed across different brokers, allowing Kafka to scale horizontally and process large volumes of data in parallel.
- Replication: Kafka’s fault tolerance comes from replicating each partition across multiple brokers. If a broker fails, another one can take over with minimal downtime.
- Consumer Groups: Kafka allows multiple consumers to process data in parallel, with each partition being read by only one consumer in a consumer group.
- Immutability & Durability: Kafka stores messages immutably and guarantees data durability, allowing consumers to reprocess or replay events.
5.2 Kafka Best Practices
While Kafka is robust, it’s important to follow best practices to ensure optimal performance and reliable operations in production:
1. Partitioning: Find the Sweet Spot
- Too few partitions: Limits scalability and parallelism. A single consumer could end up handling too much data.
- Too many partitions: Increases metadata overhead and can slow down your system.
Best Practice: Start with 3–6 partitions per topic and adjust based on your system’s needs. Always monitor consumer lag and partition throughput.
2. Replication: Ensure Durability
- Set acks=all to ensure all replicas of a partition confirm receipt of data. This guarantees data integrity but can introduce some latency.
Best Practice: Use acks=1 for normal workloads, but for critical data (e.g., financial transactions), use acks=all to prevent data loss.
3. Monitor and Tune Consumer Groups
- Consumer lag: Always monitor how far behind your consumers are. High lag indicates bottlenecks.
- Scaling consumers: If you have more consumers than partitions, excess consumers will remain idle.
Best Practice: Match the number of consumers to the number of partitions, and add more consumers only when you need to scale horizontally.
4. Manage Broker Failures
- ZooKeeper vs. KRaft: If you’re still using ZooKeeper, consider migrating to KRaft (Kafka Raft mode) to improve scalability, simplify management, and reduce failover time.
Best Practice: Enable unclean.leader.election=false to avoid promoting out-of-sync followers during a broker failure. This ensures no data loss during failover.
5. Configuration & Tuning
- Kafka has many configuration settings (e.g., log.retention.hours, batch.size, linger.ms). Proper tuning ensures efficient data flow and minimizes performance bottlenecks.
Best Practice: Test different configurations based on your use case and workloads, and monitor regularly to catch performance issues early.
6. Conclusion & Final Thoughts
As someone still learning Kafka, understanding its architecture helped me appreciate why it’s such a popular tool in modern data systems. I wrote this post to not only reinforce what I’ve learned, but to hopefully help others who are also starting their Kafka journey.
If you’re a beginner like me, don’t worry about mastering every detail right away. Focus on the big picture—how data flows, what the main components do, and why Kafka is built the way it is.
And that concludes our deep dive into Kafka architecture! Hopefully, this guide helps you understand Kafka’s core concepts and prepares you to use it effectively in your data engineering and software development projects.
Let me know what you think, and feel free to reach out if you’re also learning Kafka or working with it!
Reading this felt like having a conversation with someone who knows exactly how to make you see things in a new light.