Do Kafka metrics have to be so difficult?

Nowadays, Apache Kafka is one of the most widely used pieces of software in IT. It’s everywhere, it’s a de facto standard for event streaming, and it is widely adopted across large enterprises. The Apache Kafka project says Kafka is used by thousands of organisations and trusted by more than 80% of Fortune 100 companies.

In many organisations, Kafka is the backbone of their systems, and the Kafka ecosystem is constantly growing.

But running a reliable Kafka cluster can be challenging: it has a lot of settings at both broker and topic level, it’s distributed, and obviously we want it to be highly available.

Monitoring is crucial: without a proper setup for collecting metrics, building dashboards, and eventually defining alerts, the operations team is blind.

In this article, I’ll describe the known problems with Kafka metrics and propose a solution which hopefully solves most of them — or maybe even all of them.

Have you imported a Grafana Kafka dashboard and it doesn’t show any data? You’ll find the answer below.

What is Kafka, anyway?

I bet that if you are reading this article, you perfectly know what Kafka is and what its purpose is. But here’s a quick recap:

Apache Kafka is a distributed event streaming platform. The simplest way to think about it: Kafka is a durable, append-only log that lets many applications write events to it — producers — and many others read those events back — consumers — in real time, at very high throughput.

Events live in topics, which are split into partitions and replicated across multiple brokers, the servers that make up a Kafka cluster. That gives you horizontal scalability and fault tolerance: if a broker dies, another replica takes over, and consumers keep reading without missing a beat.

Today, Kafka sits at the heart of a lot of mission-critical systems: payment pipelines, fraud detection, microservice communication, change data capture from databases, log aggregation, IoT telemetry. When it works well, nobody notices. When it doesn’t, it’s a disaster.

The metrics system inside Kafka — the root of the problem

Did you know Kafka brokers expose metrics through two different metric systems? I didn’t.

Kafka metrics are handled by two separate metrics libraries running in the same JVM.

The older one is Yammer Metrics. It’s been in Kafka since the early days, and many fundamental broker metrics — for example BytesInPerSec, MessagesInPerSec, and UnderReplicatedPartitions — are still handled by it.

The second one is Kafka Metrics, org.apache.kafka.common.metrics, also known as the SPI — Service Provider Interface. It was introduced when the Java clients were created. It’s also used by ecosystem tools like Kafka Streams and Kafka Connect.

This is not just trivia. The official Apache Kafka monitoring documentation says Kafka uses Yammer Metrics for server metrics, while Java clients use Kafka Metrics. Both expose metrics through JMX and can be configured with pluggable stats reporters.

These two systems exist for historical reasons. At some stage, all new metrics were created as SPI metrics, but there was no plan to migrate Yammer metrics to SPI.

Here are the main differences:

Area	Yammer Metrics MBeans	Kafka Metrics MBeans
Main use in Kafka	Classic broker/server/controller metrics	Java clients and newer/common broker/controller modules
Reporter config	`kafka.metrics.reporters`	`metric.reporters`
Reporter interface	`kafka.metrics.KafkaMetricsReporter`	`org.apache.kafka.common.metrics.MetricsReporter`
Default JMX exposure	Yammer JMX reporter	`org.apache.kafka.common.metrics.JmxReporter`
MBean shape	Metric name is usually part of the `ObjectName`, as `name=...`	`ObjectName` is usually domain + type + tags; metric names are usually attributes
Attributes	Generic Yammer attributes like `Value`, `Count`, `MeanRate`, `OneMinuteRate`, percentiles	Kafka metric names as attributes, such as `byte-rate`, `throttle-time`, `connection-count`, `-rate`, `-total`
Example style	`kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=my-topic`	`kafka.server:type=Produce,user=alice,client-id=app1` with attributes like `byte-rate`, `throttle-time`

Kafka metrics setup — the old way

The traditional way to monitor Kafka is to export metrics from JMX with Prometheus JMX Exporter. It’s a library that runs as a Java agent and is configured with a rules file and a port number.

The port is used as the Prometheus endpoint, with /metrics appended.

And the rules file? It’s a massive YAML file, with tons of regular expressions. Usually, nobody fully understands the details and just uses it, hoping that whoever wrote it knew what they were doing.

So you have to upload JMX Exporter to all brokers, possibly create or edit the config file with rules — good luck with this — and restart brokers. If everything goes smoothly, you can curl the metrics endpoint and see the metrics. Good!

Then you scrape these metrics into Prometheus, download some Kafka dashboards in Grafana, and… in most cases, you see “No Data”. Why? Because there is no single standard for the rules, so the only option is editing the dashboard or trying to find another one.

But there is more: each of these metrics systems can be configured with a custom class, or plugin. Look at the table above — yes, there are two settings. One with the kafka prefix, the other without. One with metric in singular, the other with metrics in plural.

Oh yes.

Hey, it’s 2026, can Kafka use OpenTelemetry?

The short answer is: not directly. But there are some ways to teach Kafka the OpenTelemetry protocol.

There are several ways to do it:

JMX Exporter plus the Prometheus receiver in the OpenTelemetry Collector — the same JMX mapping problems as described above.
OpenTelemetry Collector JMX receiver — still JMX-based, requires exposing JMX, and the component is now marked as deprecated, with a recommendation to use a standalone JMX Gatherer Java program instead.
OpenTelemetry Java agent — a good option. It can collect Kafka broker metrics through the JMX Metric Insight module, and the OpenTelemetry demo uses this approach for Kafka. But if you want a custom set of JMX metrics, you still end up with another metrics mapping file.

The solution

What if we had something that addresses all the issues described above?

As mentioned, the metric.reporters and kafka.metrics.reporters settings are just class names. The classes have to implement the interfaces org.apache.kafka.common.metrics.MetricsReporter and kafka.metrics.KafkaMetricsReporter, respectively.

So we had the idea: a library with the following characteristics.

Native OTLP, no JMX hop

Most Kafka observability stacks bolt on jmx_exporter, or something similar, as a JVM agent, then scrape MBeans over an HTTP endpoint, then push to a collector.

This plugin lives inside the Kafka process and speaks OTLP directly to the collector — one fewer process, one fewer config surface, and no JMX rule YAMLs to maintain.

One plugin, both Kafka registries

Kafka brokers expose metrics through two parallel systems: the Kafka SPI, configured with metric.reporters, and the legacy Yammer/Coda Hale registry.

Several broker-internal signals — UnderReplicatedPartitions, OfflinePartitionsCount, ActiveControllerCount, and the per-topic BrokerTopicMetrics — only register with Yammer.

OtlpMetricReporter attaches to both with a single configuration. The same JAR runs unchanged on clients, where the Yammer side auto-disables.

Fail-safe by design — Kafka is never blocked

Metric callbacks like metricChange and metricRemoval only touch an in-memory ConcurrentHashMap. All I/O happens on a daemon scheduler thread.

If the collector is unreachable, the export call times out, the batch is dropped, and the next tick starts fresh. No retry queue, no unbounded memory, no impact on Kafka produce/fetch latency.

Broker context becomes first-class Prometheus labels

Kafka invokes MetricsReporter.contextChange(MetricsContext) with cluster id, node/broker id, and Kafka version.

The plugin captures those values and attaches them as OTLP resource attributes. They surface as labels — kafka_cluster_id, kafka_node_id, or kafka_broker_id — on every series, so by(kafka_cluster_id, kafka_node_id) works in PromQL with zero extra wiring.

Ladies and gentlemen, meet `monedula-metrics-reporter`

monedula-metrics-reporter is an open source library that fulfills these requirements — and more.

It exports Kafka metrics directly over OTLP using gRPC or HTTP, supports Kafka 3.x and 4.x on Java 17+, and can also emit JVM runtime metrics on the same pipeline.

It also includes practical production features like:

metric allow-listing,
custom resource attributes,
TLS and mTLS configuration,
compression,
reporter self-monitoring metrics,
and support for both brokers and clients.

The reporter emits its own health metrics too, for example:

monedula_reporter_export_success_total,
monedula_reporter_export_failure_total,
monedula_reporter_export_duration_ms.

So if the collector pipeline breaks, you don’t just get silence. You get signals that the reporter itself is failing to export.

The project also ships with an easy-to-use quickstart that demonstrates the full flow with Kafka, the OpenTelemetry Collector, Prometheus, and Grafana. And because dashboards are part of the problem, it includes a curated set of ready-to-use Grafana dashboards that match the metrics produced by the plugin.

You can find it on GitHub, build it locally, and test it with the quickstart. Prebuilt artifacts are coming soon.

If you find a bug or have a suggestion for improvement, feel free to create an issue on GitHub.

Summary

Kafka monitoring is harder than it should be because Kafka exposes metrics through two different metric systems, most production setups still rely on JMX, and every JMX-to-Prometheus mapping creates another compatibility layer between the broker, Prometheus, and Grafana dashboards.

monedula-metrics-reporter takes a different approach: it runs as a Kafka metrics reporter, exports metrics natively over OTLP, handles both Kafka metric registries, and ships with dashboards that match the emitted metrics.

So, do Kafka metrics have to be so difficult?

Hopefully not anymore.