Kafka is easier to understand when you can break it

Apache Kafka is often explained as a set of concepts: topics, partitions, brokers, producers, consumers, replicas, offsets, leaders, followers, and consumer groups.

That works well at the beginning.

Then the first failure enters the conversation.

A broker goes down. A follower starts lagging. The ISR shrinks. A producer uses acks=all. A consumer keeps reading, but only up to the high watermark. A controller election happens. A region becomes unavailable. Suddenly, the system is no longer a static diagram. It is a timeline of decisions.

And this is where Kafka becomes difficult to teach.

Not because the individual concepts are impossible, but because the interesting behavior only appears when they interact.

That is why we built the Kafka Simulator — a browser-based, deterministic model of Kafka you can break safely and replay step by step. It runs on Apache Kafka 4.3 semantics, needs no backend, and sends no telemetry about your scenarios.

Kafka failures are hard to explain on a whiteboard

Some Kafka questions are easy to ask and surprisingly hard to answer without visualization.

What happens when replication factor is 3, min.insync.replicas is 2, and one broker dies?
What changes when the second broker dies?
Why can a producer still write after the first failure, but starts receiving NotEnoughReplicas after the second?
What exactly does acks=all wait for?
Why did the high watermark stop advancing?
Which replica becomes leader after a broker failure?
What does an unclean leader election actually lose?
How do you explain the difference between a healthy cluster, a degraded cluster, and a cluster that is still alive but can no longer satisfy its durability guarantees?

These are the moments where a static diagram starts to fall apart.

Kafka is a distributed system. It has time, order, failure, recovery, and trade-offs. The most important lessons are often hidden in transitions: before and after a failure, before and after a rebalance, before and after a controller election, before and after the ISR changes.

A simulator for seeing Kafka move

The goal of the simulator is simple: make Kafka behavior visible.

You can change Kafka settings, run a scenario or build your own cluster, break it, and then inspect what happened step by step. Instead of jumping from “healthy” to “failed,” the simulator exposes the timeline in between.

You can pause the scenario.
You can step backward and forward, or scrub to any moment on the timeline.
You can inspect brokers, partitions, replicas, producers, consumers, offsets, ISR, the high watermark, and metrics.
You can open the Why tab and read a plain-language explanation of the current state.
You can open the Metrics tab and see which Kafka metrics move in that situation.

The producer Inspector panel: acks set to all, idempotence on, retries, retry backoff, linger.ms, compression and a partitioner selector. — Inspect any entity in the simulator — here a producer: acks, idempotence, batching, retries and the partitioner, all editable.

This is especially useful for teaching failure behavior. In a real Kafka cluster, a failure is noisy, concurrent, and often hard to isolate. In the simulator, the same failure becomes a controlled learning moment.

You can ask: “Why did this produce request fail?”

Then step one event backward.
Then one event forward.
Then inspect the ISR.
Then check the high watermark.
Then compare the producer configuration with the current replica state.

The point is not only to show the final result. The point is to make the path to that result understandable.

The canonical example: `acks=all` and `min.insync.replicas`

One of the simplest and most useful walkthroughs is also one of the best teaching examples. It is the canonical demo on the simulator’s home page, and you can reproduce it hands-on in a free-play sandbox.

Start with:

replication factor: 3
min.insync.replicas: 2
producer acks: all

Topic configuration panel for a topic named orders, with partitions 3, replication factor 3 and min.insync.replicas 2. — The canonical setup: a topic with replication factor 3 and min.insync.replicas 2.

In a healthy cluster, the producer writes to the leader, followers replicate the record, the high watermark advances, and the record becomes committed.

Now kill one broker.

The cluster is degraded, but still writable. There are still two in-sync replicas, so the producer can satisfy acks=all. This is the important boundary: the system is no longer fully healthy, but it can still preserve the configured durability guarantee.

Now kill another broker.

Only one in-sync replica remains. The leader may still be alive, but the producer can no longer satisfy min.insync.replicas=2. The write fails with NotEnoughReplicas.

A single-DC cluster with broker-1 and broker-2 down. broker-3 is the surviving leader and the producer is stuck retrying because the ISR has fallen below min.insync.replicas. — Two of three brokers down. The surviving leader still holds the data, but with the ISR below min.insync.replicas the acks=all producer can only keep retrying.

That distinction is one of the core lessons of Kafka reliability.

A cluster can be available. A leader can exist. A topic can still have data. But writes may still be rejected because the durability contract cannot be met.

This is exactly the kind of concept that becomes much easier when you can see the ISR, leader, producer request, high watermark, and metric changes together on one screen.

Built for step-by-step learning

Each scenario in the simulator is designed as a navigable sandbox.

You are not watching a fixed animation that disappears after it plays. You can move through the scenario like a debugger.

The Steps tab showing a scenario broken into numbered steps, each with a short explanation, that you click to move the playhead. — Every scenario is a navigable, step-by-step walkthrough — click any step to jump the playhead there.

Every event is part of a deterministic, seeded timeline. You can replay it, pause it, step forward, step backward, and inspect state at each moment. This makes it useful not only for demos, but also for workshops, onboarding, debugging discussions, and architecture reviews.

The full scenario state is encoded in the URL: the scenario, the cluster configuration, every action you took, the seed, and the position on the timeline. That means a scenario can be shared as a reproducible link — same configuration, same seed, same timeline, same failure moment.

This makes the simulator useful for explanations such as:

“Open this link and go to the moment where broker 2 dies.”

“Now check the ISR.”

“Now move one step forward and watch the leader election.”

“Now look at the producer error.”

“Now compare that with the metric movement.”

Instead of describing Kafka behavior from memory, you can point to a concrete, inspectable state.

What ships in the first 1.0 release

For the first 1.0 release, we are starting with a focused, single-DC version of the simulator, themed Fundamentals.

This release concentrates on foundational Kafka learning: topics, partitions, offsets, keys and partitioning, brokers, replicas and leaders, the difference between the log end offset and the high watermark, producer acknowledgements (acks=0, acks=1, the acks trade-off, and the acks=1 durability gap), the consumer fetch loop, partition assignment across group members, and rebalances.

It ships as thirteen guided scenarios, each with a frozen golden trace, plus a free-play sandbox where you can build your own single-DC cluster and experiment — including the acks=all durability walkthrough above.

The goal of the first release is not to expose every scenario we have internally. The goal is to ship a stable, understandable playground that teaches the core mechanics well.

That means the first public version is intentionally smaller than the simulator engine behind it. We would rather release a reliable set of scenarios that explain Kafka clearly than publish every advanced mode before the explanations, edge cases, and visual states are ready.

The scenario library grouped into categories — Anatomy, Producers and EOS, Consumer groups, Replication, Tiered storage, Disaster recovery, Schema Registry, Authorization, and Share groups — each with a scenario count. — The full scenario library behind the simulator. The 1.0 release ships the Fundamentals; the other packs roll out on a roughly biweekly cadence.

What is coming next

The simulator engine already models far more than the first pack exposes, and new scenario packs land on a roughly biweekly cadence. The changelog tracks what has shipped and what is next.

Upcoming packs add guided scenarios for replication and the min.insync.replicas boundary, delivery semantics and transactions, storage and lifecycle, the controller and quotas, a chaos and failure lab, and multi-DC disaster recovery — active-passive, active-active, stretched 3-DC and 2.5-DC clusters, DC failover, observer promotion, network partitions, slow brokers, and unclean leader elections.

These scenarios are powerful, but they also need to be handled carefully. Multi-DC Kafka behavior is full of trade-offs. It is easy to create a demo that looks impressive but teaches the wrong lesson. We want the advanced scenarios to be solid, explainable, and honest about the assumptions they make — which is why they roll out gradually rather than all at once.

A stretched 2.5-DC topology: data center dc-a is dead, dc-b is active with the leader, and dc-c is a witness. An observer in the surviving data center is being promoted. — Coming soon: a stretched 2.5-DC cluster after losing a data center — the surviving side’s observer is promoted to keep the partition writable.

Failure modes we want to make understandable

Kafka reliability is not one feature. It is a set of trade-offs.

The simulator is designed to help explain those trade-offs through concrete failure modes:

broker failures
slow followers
network partitions
ISR shrinkage
leader elections
unclean leader elections
producer retry behavior
consumer position and lag
DC failover
observer promotion
replication lag
recovery after failure

A few of these are already explorable in the first release — consumer position and lag, the acks=1 durability gap, and hands-on broker failures in the free-play sandbox. The rest arrive with the chaos and multi-DC packs.

The Failure Lab panel with controls to partition or destroy a whole data center and to kill the KRaft quorum. — Coming soon: the Failure Lab — isolate or destroy a whole data center, or kill the KRaft quorum, and watch the cluster react.

The important part is that each failure should answer the same teaching questions:

What changed?
Why did it change?
What is still safe?
What is no longer guaranteed?
Which metric should tell you that something is wrong?

A good simulator should not only show red icons. It should explain the system state behind them.

The Why tab

One of the most important parts of the simulator is the Why tab.

When a scenario reaches an interesting state, the simulator explains why the cluster is behaving that way.

For example, after a broker failure, the visualization may show that a producer is still able to write. The Why tab explains that the ISR still contains enough replicas to satisfy min.insync.replicas.

After a second failure, the producer may start receiving NotEnoughReplicas. The Why tab explains that acks=all requires the configured minimum number of in-sync replicas, and the current ISR is now too small. It also points you straight to the partition or broker that caused it.

The Why tab explaining that three partitions are below min.insync.replicas, so acks=all writes are rejected with NOT_ENOUGH_REPLICAS, plus a suggested remediation. — The Why tab on a rejected write: three partitions are below min.insync.replicas, so acks=all is failing with NOT_ENOUGH_REPLICAS — and it suggests how to recover.

This turns a failure from a visual event into a learning event.

The goal is not just to say “this failed.” The goal is to say “this failed because this guarantee could no longer be satisfied.”

Metrics should tell the same story

The simulator also includes a Metrics tab, because Kafka problems are usually diagnosed through metrics in production.

When a follower falls behind, you see it in the ISR-health and under-replicated-partitions readings. When acks=all writes start to retry, the retry count moves and produce throughput drops. When the cluster recovers, those readings settle again. Each metric links back to the event that last moved it, so you can connect a number to the moment it changed.

The metric values in the simulator are educational, not a replacement for production measurement. They are meant to be directionally correct and tied to the scenario state, so learners can connect what they see in the cluster with the kind of signals they would monitor in a real environment.

This matters because Kafka learning often separates architecture from operations. The simulator tries to connect them again.

A broker failure is not only a broker icon turning red. It is also ISR shrinkage, under-replicated partitions, possible leader election, producer behavior changes, and metric movement.

Honest simulation, not magic

We want the simulator to be useful, but we also want it to be honest.

It does not run a real Kafka cluster in the browser. It does not simulate operating system scheduling, disk I/O, page cache behavior, GC pauses, real network buffers, TLS handshakes, or byte-exact serialization.

It is a deterministic, educational model of Kafka behavior. It is built to explain ordering, state transitions, failure consequences, and configuration trade-offs. It is not built to predict exact throughput, latency, or production performance.

That distinction matters. A simulator is valuable when it helps you build the right mental model. It becomes dangerous when it pretends to be more exact than it is.

So the simulator includes an explicit model-limitations page. It explains what is modeled, what is approximated, and what is skipped.

Help us make it better

We also created a public repository for reporting bugs and incorrect behavior.

That matters because Kafka is full of edge cases, and simulation bugs are teaching bugs. If a scenario presents the wrong explanation, the wrong state transition, or the wrong failure outcome, we want to know.

The simulator will improve fastest with feedback from people who use Kafka in different ways: platform teams, developers, SREs, trainers, consultants, and anyone who has ever had to explain why a Kafka cluster behaved differently than expected.

If something looks wrong, please report it.

Start with the single-DC playground

The first release is a foundation: a browser-based Kafka simulator focused on single-DC learning. It is designed for safe experimentation. No backend is required. No real cluster is needed. You can break things freely, replay scenarios, share URLs, and inspect every step.

The free-play cluster setup panel: broker count, KRaft voters, the control plane, racks and a rack-aware placement toggle. — Free-play cluster setup — brokers, KRaft voters, racks and rack-aware placement.

Multi-DC and disaster-recovery scenarios are coming next, including active-passive, active-active, stretched 3-DC, stretched 2.5-DC, DC failover, and observer promotion.

For now, start with the basics. Open the playground, start a single-DC free-play cluster, and:

Set replication.factor=3.

Set min.insync.replicas=2.

Set acks=all.

Kill a broker.

Then kill another one.

Kafka is easier to understand when you can see it move.

It is even easier when you can break it safely.