Skip to main content

Outbox Replay and Dead-Letter Queue

Summary

Event Hub persists events in a relational database outbox table before forwarding them to the event bus (Transactional Outbox pattern). Each event exists in one of three statuses: NEW, PROCESSED, or FAILED. Scheduler jobs periodically query unprocessed (NEW and FAILED) records and attempt to send them to the event bus. This mechanism ensures that the scheduler automatically retries events when the event bus is temporarily unreachable or when a send fails. A configurable cleanup job removes processed records.

Why It Exists

Three fundamental failure scenarios exist in distributed message delivery:

  1. Transient infrastructure outage — The event bus broker or Schema Registry becomes temporarily unreachable. Events must not be lost, and the system must deliver them once the outage ends.
  2. Serialization error — Corrupted Protobuf data cannot be deserialized. Retrying this event will never succeed.
  3. Table growth — Accumulation of successfully processed records degrades query performance and increases disk usage.

The outbox lifecycle addresses these three problems: the scheduler automatically retries NEW and FAILED records (replay), permanently failing records remain in FAILED status (implicit DLQ), and the cleanup job removes successfully completed records.

Where It Fits in Keymate

The outbox mechanism is Event Hub's reliability layer. It serves as the bridge between the event acceptance flow (Stage 2: Writing to the Outbox) described in the Overview and the event bus delivery flow detailed in the Delivery & Subscription Model.

Boundaries

In scope:

  • Outbox table structure and event data model
  • Event status transitions (NEWPROCESSED / FAILED)
  • Retry (replay) mechanism and behavior
  • Concurrency control for multi-instance deployments
  • Cleanup job and configuration

Out of scope:

How It Works

Outbox Table Structure

Event Hub persists each accepted event as an outbox record with a unique identifier, the source event ID, source service name, event type, the original event payload (Protobuf binary), and a status field that tracks the event through its lifecycle. Every record must belong to an event type, and Event Hub validates that event data is present before writing.

Event Status Lifecycle

Each event exists in one of three statuses:

StatusDescriptionTransition Condition
NEWEvent written to outbox, not yet sent to the event busEvent Hub creates this status when it persists a validated event to the outbox
PROCESSEDEvent successfully sent to the event busEvent Hub sets this status after the event bus broker acknowledges the message
FAILEDEvent bus send failedEvent Hub sets this status after a broker rejection or connection error

Replay Mechanism (Retry)

Event Hub does not have a separate retry mechanism — retry is a natural consequence of the outbox polling pattern. Scheduler jobs query the outbox table for all unprocessed records (both NEW and FAILED), which means the scheduler automatically retries previously failed events on every cycle. The maximum number of records processed per cycle is configurable through the batch size setting.

When multiple Event Hub instances run simultaneously, the system ensures only one instance processes each event — preventing duplicate delivery while allowing throughput to scale linearly with the number of instances.

After each send attempt, Event Hub determines the status based on the event bus response:

  • If the event bus broker acknowledges the message → Event Hub sets the status to PROCESSED
  • If the event bus broker rejects or is unreachable → Event Hub sets the status to FAILED

Event Hub batch-updates all records (successful and failed) in a single transaction.

warning

The scheduler retries FAILED records indefinitely. Permanent failures (such as corrupted Protobuf data or invalid schema) cause these records to fail on every scheduler cycle. Event Hub does not have a maximum retry count or a dedicated dead-letter queue (DLQ) mechanism — FAILED records remaining in the outbox table serve as an implicit DLQ.

Concurrency Control

Event Hub supports running multiple instances simultaneously for horizontal scaling. The concurrency control mechanism ensures:

  • Instances do not block each other
  • No two instances process the same event (the system prevents duplicate delivery)
  • Throughput scales linearly as instance count increases

Cleanup Job (Delete Processed Events)

A scheduled cleanup job periodically deletes records in PROCESSED status from the outbox table. This prevents indefinite table growth and maintains query performance.

The following configuration controls the cleanup job:

ConfigurationDefaultDescription
Cleanup intervalConfigurableHow often the cleanup job runs
Cleanup enabledfalseToggle for enabling/disabling the cleanup job
caution

Event Hub disables the cleanup job by default. If you do not enable it in production, the outbox table grows indefinitely. As the number of PROCESSED records increases, polling query performance degrades because the status filter must scan more rows.

Error Scenarios and Behaviors

ScenarioBehavior
Event bus broker temporarily unreachableEvent Hub catches the connection error and silently recovers. Event Hub marks the event FAILED and retries it in the next cycle.
Schema Registry unreachableProtobuf serialization fails because schema registration cannot complete. The event becomes FAILED.
Corrupted Protobuf dataDeserialization from binary to Protobuf message fails. The event stays FAILED and the same error repeats every cycle (permanent failure).
Database transaction errorThe database rolls back the transaction and Event Hub logs the error. The record status remains unchanged and the scheduler retries it in the next cycle.
Batch update failureEven if the event bus send succeeds, if the status update fails, the record remains NEW or FAILED. The scheduler sends it again in the next cycle (duplicate delivery risk).

Diagram

Example Scenario

Scenario

The event bus broker experiences a transient outage. Event Hub writes events arriving during this period to the outbox and automatically delivers them once the outage ends.

Input

  • Actor: External integration service
  • Resource: An event of a given EventHubType
  • Action: Scheduler polling + event bus retry
  • Context:
    • Event bus broker unreachable for 30 seconds

Expected Outcome

  1. The event arrives via gRPC, passes validation, and Event Hub writes it to the outbox with NEW status
  2. On the next scheduler cycle, the scheduler retrieves the record from the outbox. Event bus send fails due to connection error → Event Hub updates the event to FAILED
  3. On subsequent scheduler cycles, the scheduler re-queries the same event (not yet PROCESSED), attempts to send, and fails → FAILED status persists
  4. Once the event bus broker becomes reachable again, the scheduler retrieves the record, sends to the event bus → broker sends ACK → Event Hub updates the event to PROCESSED
  5. The cleanup job (if enabled) eventually deletes the PROCESSED record

Common Misunderstandings

  • "Event Hub uses a DLQ topic" — Event Hub does not create a separate event bus DLQ topic. Records in FAILED status remain in the outbox table and serve as an implicit DLQ. Detecting permanent failures requires monitoring the outbox table.

  • "The cleanup job also deletes FAILED records" — The cleanup job only deletes records in PROCESSED status. FAILED records remain in the table, and the scheduler continues to retry them.

  • "Retry count is limited" — There is no maximum retry count in the current implementation. The scheduler retries FAILED records indefinitely until someone resolves the issue or manually deletes them.

warning

Permanently failing events (e.g., corrupted Protobuf data) accumulate indefinitely in the outbox table. In production, regularly monitor for events in FAILED status and manually investigate persistent failures.

Design Notes / Best Practices

  • Enable the cleanup job in production — Event Hub disables the cleanup job by default. Enable it in production environments. Otherwise, the outbox table grows indefinitely and polling query performance degrades.

  • Tune the batch size to your throughput — The batch size determines the maximum number of records processed per scheduler cycle. Adjust based on your workload — keep in mind that Event Hub updates each batch within a single transaction.

  • Monitor FAILED records — Create an alert that monitors the count and age of events in FAILED status. Records that remain FAILED for extended periods indicate a persistent problem.

  • Preventing event loss during event bus cluster maintenance
  • Detecting and performing root cause analysis on long-standing FAILED records
  • Monitoring outbox table size and optimizing the cleanup strategy
  • Running multiple Event Hub pods with horizontal scaling

Next Step

Continue with Consumer Contracts to review the Protobuf schema structure, validation guarantees, and compatibility rules for event bus consumers.

Are FAILED events automatically retried?

Yes. Scheduler jobs query for all unprocessed records, which includes both NEW and FAILED events. The scheduler automatically re-attempts FAILED records on every cycle.

Does Event Hub use a separate DLQ topic?

No. Event Hub does not create a separate event bus DLQ topic. Records in FAILED status remain in the outbox table and serve as an implicit DLQ. Detecting permanent failures requires monitoring the outbox table.

Does the cleanup job also delete FAILED records?

No. The cleanup job only targets records in PROCESSED status. It deletes only records that Event Hub successfully sent to the event bus. FAILED records remain in the table.

Can multiple pods process the same event?

No. Event Hub uses a concurrency-safe polling mechanism that ensures only one instance processes each event. Multiple pods can run simultaneously without processing the same event.