Outbox Replay and Dead-Letter Queue

Summary

Event Hub persists events in a relational database outbox table before forwarding them to the event bus (Transactional Outbox pattern). Each event exists in one of three statuses: NEW, PROCESSED, or FAILED. Scheduler jobs periodically query unprocessed (NEW and FAILED) records and attempt to send them to the event bus. This mechanism ensures that the scheduler automatically retries events when the event bus is temporarily unreachable or when a send fails. A configurable cleanup job removes processed records.

Why It Exists

Three fundamental failure scenarios exist in distributed message delivery:

Transient infrastructure outage — The event bus broker or Schema Registry becomes temporarily unreachable. Events must not be lost, and the system must deliver them once the outage ends.
Serialization error — Corrupted Protobuf data cannot be deserialized. Retrying this event will never succeed.
Table growth — Accumulation of successfully processed records degrades query performance and increases disk usage.

The outbox lifecycle addresses these three problems: the scheduler automatically retries NEW and FAILED records (replay), permanently failing records remain in FAILED status (implicit DLQ), and the cleanup job removes successfully completed records.

Where It Fits in Keymate

The outbox mechanism is Event Hub's reliability layer. It serves as the bridge between the event acceptance flow (Stage 2: Writing to the Outbox) described in the Overview and the event bus delivery flow detailed in the Delivery & Subscription Model.

Boundaries

In scope:

Outbox table structure and event data model
Event status transitions (NEW → PROCESSED / FAILED)
Retry (replay) mechanism and behavior
Concurrency control for multi-instance deployments
Cleanup job and configuration

Out of scope:

Event bus topic mapping and channel configuration → Delivery & Subscription Model
gRPC validation rules → Overview
Consumer-side schema expectations → Consumer Contracts

How It Works

Outbox Table Structure

Event Hub persists each accepted event as an outbox record with a unique identifier, the source event ID, source service name, event type, the original event payload (Protobuf binary), and a status field that tracks the event through its lifecycle. Every record must belong to an event type, and Event Hub validates that event data is present before writing.

Event Status Lifecycle

Each event exists in one of three statuses:

Status	Description	Transition Condition
`NEW`	Event written to outbox, not yet sent to the event bus	Event Hub creates this status when it persists a validated event to the outbox
`PROCESSED`	Event successfully sent to the event bus	Event Hub sets this status after the event bus broker acknowledges the message
`FAILED`	Event bus send failed	Event Hub sets this status after a broker rejection or connection error

Replay Mechanism (Retry)

Event Hub does not have a separate retry mechanism — retry is a natural consequence of the outbox polling pattern. Scheduler jobs query the outbox table for all unprocessed records (both NEW and FAILED), which means the scheduler automatically retries previously failed events on every cycle. The maximum number of records processed per cycle is configurable through the batch size setting.

When multiple Event Hub instances run simultaneously, the system ensures only one instance processes each event — preventing duplicate delivery while allowing throughput to scale linearly with the number of instances.

After each send attempt, Event Hub determines the status based on the event bus response:

If the event bus broker acknowledges the message → Event Hub sets the status to PROCESSED
If the event bus broker rejects or is unreachable → Event Hub sets the status to FAILED

Event Hub batch-updates all records (successful and failed) in a single transaction.

warning

The scheduler retries FAILED records indefinitely. Permanent failures (such as corrupted Protobuf data or invalid schema) cause these records to fail on every scheduler cycle. Event Hub does not have a maximum retry count or a dedicated dead-letter queue (DLQ) mechanism — FAILED records remaining in the outbox table serve as an implicit DLQ.

Concurrency Control

Event Hub supports running multiple instances simultaneously for horizontal scaling. The concurrency control mechanism ensures:

Instances do not block each other
No two instances process the same event (the system prevents duplicate delivery)
Throughput scales linearly as instance count increases

Cleanup Job (Delete Processed Events)

A scheduled cleanup job periodically deletes records in PROCESSED status from the outbox table. This prevents indefinite table growth and maintains query performance.

The following configuration controls the cleanup job:

Configuration	Default	Description
Cleanup interval	Configurable	How often the cleanup job runs
Cleanup enabled	`false`	Toggle for enabling/disabling the cleanup job

caution

Event Hub disables the cleanup job by default. If you do not enable it in production, the outbox table grows indefinitely. As the number of PROCESSED records increases, polling query performance degrades because the status filter must scan more rows.

Error Scenarios and Behaviors

Scenario	Behavior
Event bus broker temporarily unreachable	Event Hub catches the connection error and silently recovers. Event Hub marks the event `FAILED` and retries it in the next cycle.
Schema Registry unreachable	Protobuf serialization fails because schema registration cannot complete. The event becomes `FAILED`.
Corrupted Protobuf data	Deserialization from binary to Protobuf message fails. The event stays `FAILED` and the same error repeats every cycle (permanent failure).
Database transaction error	The database rolls back the transaction and Event Hub logs the error. The record status remains unchanged and the scheduler retries it in the next cycle.
Batch update failure	Even if the event bus send succeeds, if the status update fails, the record remains `NEW` or `FAILED`. The scheduler sends it again in the next cycle (duplicate delivery risk).

Diagram

Example Scenario

Scenario

The event bus broker experiences a transient outage. Event Hub writes events arriving during this period to the outbox and automatically delivers them once the outage ends.

Input

Actor: External integration service
Resource: An event of a given EventHubType
Action: Scheduler polling + event bus retry
Context:
- Event bus broker unreachable for 30 seconds

Expected Outcome

The event arrives via gRPC, passes validation, and Event Hub writes it to the outbox with NEW status
On the next scheduler cycle, the scheduler retrieves the record from the outbox. Event bus send fails due to connection error → Event Hub updates the event to FAILED
On subsequent scheduler cycles, the scheduler re-queries the same event (not yet PROCESSED), attempts to send, and fails → FAILED status persists
Once the event bus broker becomes reachable again, the scheduler retrieves the record, sends to the event bus → broker sends ACK → Event Hub updates the event to PROCESSED
The cleanup job (if enabled) eventually deletes the PROCESSED record

Common Misunderstandings

"Event Hub uses a DLQ topic" — Event Hub does not create a separate event bus DLQ topic. Records in FAILED status remain in the outbox table and serve as an implicit DLQ. Detecting permanent failures requires monitoring the outbox table.
"The cleanup job also deletes FAILED records" — The cleanup job only deletes records in PROCESSED status. FAILED records remain in the table, and the scheduler continues to retry them.
"Retry count is limited" — There is no maximum retry count in the current implementation. The scheduler retries FAILED records indefinitely until someone resolves the issue or manually deletes them.

warning

Permanently failing events (e.g., corrupted Protobuf data) accumulate indefinitely in the outbox table. In production, regularly monitor for events in FAILED status and manually investigate persistent failures.

Design Notes / Best Practices

Enable the cleanup job in production — Event Hub disables the cleanup job by default. Enable it in production environments. Otherwise, the outbox table grows indefinitely and polling query performance degrades.
Tune the batch size to your throughput — The batch size determines the maximum number of records processed per scheduler cycle. Adjust based on your workload — keep in mind that Event Hub updates each batch within a single transaction.
Monitor FAILED records — Create an alert that monitors the count and age of events in FAILED status. Records that remain FAILED for extended periods indicate a persistent problem.

Preventing event loss during event bus cluster maintenance
Detecting and performing root cause analysis on long-standing FAILED records
Monitoring outbox table size and optimizing the cleanup strategy
Running multiple Event Hub pods with horizontal scaling

Next Step

Continue with Consumer Contracts to review the Protobuf schema structure, validation guarantees, and compatibility rules for event bus consumers.

Overview

Event Hub's overall architecture and the Transactional Outbox pattern.

Delivery & Subscription Model

Event bus topic structure, scheduler mechanism, and Protobuf serialization details.

Consumer Contracts

Event schema rules and compatibility expectations for consumers.

Audit & Observability

Audit logs and OpenTelemetry observability during event processing.

Are FAILED events automatically retried?

Yes. Scheduler jobs query for all unprocessed records, which includes both NEW and FAILED events. The scheduler automatically re-attempts FAILED records on every cycle.

Does Event Hub use a separate DLQ topic?

No. Event Hub does not create a separate event bus DLQ topic. Records in FAILED status remain in the outbox table and serve as an implicit DLQ. Detecting permanent failures requires monitoring the outbox table.

Does the cleanup job also delete FAILED records?

No. The cleanup job only targets records in PROCESSED status. It deletes only records that Event Hub successfully sent to the event bus. FAILED records remain in the table.

Can multiple pods process the same event?

No. Event Hub uses a concurrency-safe polling mechanism that ensures only one instance processes each event. Multiple pods can run simultaneously without processing the same event.

Summary​

Why It Exists​

Where It Fits in Keymate​

Boundaries​

How It Works​

Outbox Table Structure​

Event Status Lifecycle​

Replay Mechanism (Retry)​

Concurrency Control​

Cleanup Job (Delete Processed Events)​

Error Scenarios and Behaviors​

Diagram​

Example Scenario​

Scenario​

Input​

Expected Outcome​

Common Misunderstandings​

Design Notes / Best Practices​

Related Use Cases​

Next Step​

Related Docs​