Streaming March 28, 2025

Exactly-Once Streaming into Delta Tables: How It Works

Kafka + Delta Lake + idempotent writes = exactly-once delivery without a distributed transaction manager. Walking through the checkpoint protocol step by step.

Jordan Park Software Engineer, DataLynxr

Diagram of exactly-once streaming pipeline into Delta Lake tables with Kafka checkpointing

The problem with "at-least-once"

Every Kafka consumer tutorial starts with "at-least-once delivery." It sounds like a minor caveat. In practice, it means your data has duplicates, and something downstream has to deduplicate them — either in the query layer (expensive), in the ETL pipeline (fragile), or by accepting that your aggregate metrics are slightly wrong (common).

Exactly-once delivery is the property we actually want: each event consumed from Kafka commits to the destination table exactly one time, regardless of consumer restarts, broker failures, or network partitions. Getting there without a distributed transaction coordinator is the interesting part.

Why exactly-once without a transaction manager is possible

The insight is that Delta Lake (and Apache Iceberg) already give you ACID transactions on the write path. Every commit to a Delta table is atomic — it either fully applies or it doesn't. This means the problem reduces to: make each write to the Delta table idempotent.

If you can write the same batch of Kafka records twice and get the same result in the Delta table, you have exactly-once semantics without a distributed transaction manager. The Kafka offset is the idempotency key: if batch N at offset range [1000, 2000) is already committed to the Delta transaction log, the second attempt to write it is a no-op.

The checkpoint protocol — step by step

Here's how the DataLynxr streaming engine implements this:

Consumer reads partition offsets. Before pulling any messages from Kafka, the streaming engine reads the Delta transaction log to find the last committed offset for each partition. This is stored as metadata in the Delta commit file, not in Kafka's consumer group tracking.
Messages are pulled for the next micro-batch. The engine requests records from the last committed offset forward, up to a configurable batch size (default: 5,000 records or 2 seconds of data, whichever comes first).
Records are converted to Parquet and written to object storage. The batch is serialized, Snappy-compressed, and written as new Parquet files in the Delta table's directory. These files are not yet visible to readers — they're staged but uncommitted.
Delta table commit with offset metadata. A single atomic Delta transaction commits the new Parquet files AND records the end offsets for each Kafka partition as metadata in the commit file. Both happen in one atomic write to S3.
Kafka consumer offset is advanced. Only after the Delta commit succeeds is the Kafka consumer offset advanced. If the Delta commit fails, offsets are not advanced, and the batch is replayed from the same start offset on the next attempt.

Steps 4 and 5 together are the key: if the commit fails (S3 write error, process crash), the engine restarts and reads the Delta log again. It finds the same "last committed offset" and pulls the same batch from Kafka. When it tries to write the same batch again, the Delta transaction uses conditional PUT semantics (via S3 conditional writes) to detect a concurrent commit at the same log version — preventing a duplicate commit even under concurrent consumer restarts.

Schema registry integration

Most production Kafka topics carry Avro or Protobuf messages registered in Confluent Schema Registry or AWS Glue Schema Registry. The streaming engine fetches the schema on startup and on every schema version change detected in the message header. Schema evolution is handled by Delta's schema evolution rules: adding a new nullable column in the Avro schema appends that column to the Delta table with NULL for all prior rows — no backfill, no pipeline restart required.

This is where the copy-to-warehouse approach becomes fragile: the ETL pipeline has to detect the new Avro field, translate it to the warehouse schema, and perform an ALTER TABLE — often a blocking operation that delays all downstream analytics until it completes.

Latency profile

At the default configuration (batch size 5,000 records, 2s flush interval), P50 end-to-end latency from Kafka produce to Delta table read-availability is under 2 seconds. P99 is under 5 seconds at 50,000 events/second throughput on a 4-partition topic. Latency scales primarily with batch flush interval — reducing the interval to 500ms reduces P50 latency to under 800ms at the cost of increased S3 write overhead (smaller, more frequent Parquet files requiring more frequent compaction).

Compaction is handled automatically by a background process that rewrites small files into larger ones using Z-ordering on configurable columns. This keeps read performance from degrading as the table accumulates micro-batch writes over time.

Try it

The streaming connector configuration takes about 15 minutes to set up from scratch. See the connectors documentation for the Kafka connector config reference and the full list of supported source systems (Kinesis and Pulsar connectors follow the same checkpoint protocol described above).