Outbox Replay and Dead-Letter Queue
Summary
Event Hub persists events in a relational database outbox table before forwarding them to the event bus (Transactional Outbox pattern). Each event exists in one of three statuses: NEW, PROCESSED, or FAILED. Scheduler jobs periodically query unprocessed (NEW and FAILED) records and attempt to send them to the event bus. This mechanism ensures that the scheduler automatically retries events when the event bus is temporarily unreachable or when a send fails. A configurable cleanup job removes processed records.
Why It Exists
Three fundamental failure scenarios exist in distributed message delivery:
- Transient infrastructure outage — The event bus broker or Schema Registry becomes temporarily unreachable. Events must not be lost, and the system must deliver them once the outage ends.
- Serialization error — Corrupted Protobuf data cannot be deserialized. Retrying this event will never succeed.
- Table growth — Accumulation of successfully processed records degrades query performance and increases disk usage.
The outbox lifecycle addresses these three problems: the scheduler automatically retries NEW and FAILED records (replay), permanently failing records remain in FAILED status (implicit DLQ), and the cleanup job removes successfully completed records.
Where It Fits in Keymate
The outbox mechanism is Event Hub's reliability layer. It serves as the bridge between the event acceptance flow (Stage 2: Writing to the Outbox) described in the Overview and the event bus delivery flow detailed in the Delivery & Subscription Model.
Boundaries
In scope:
- Outbox table structure and event data model
- Event status transitions (
NEW→PROCESSED/FAILED) - Retry (replay) mechanism and behavior
- Concurrency control for multi-instance deployments
- Cleanup job and configuration
Out of scope:
- Event bus topic mapping and channel configuration → Delivery & Subscription Model
- gRPC validation rules → Overview
- Consumer-side schema expectations → Consumer Contracts
How It Works
Outbox Table Structure
Event Hub persists each accepted event as an outbox record with a unique identifier, the source event ID, source service name, event type, the original event payload (Protobuf binary), and a status field that tracks the event through its lifecycle. Every record must belong to an event type, and Event Hub validates that event data is present before writing.
Event Status Lifecycle
Each event exists in one of three statuses:
| Status | Description | Transition Condition |
|---|---|---|
NEW | Event written to outbox, not yet sent to the event bus | Event Hub creates this status when it persists a validated event to the outbox |
PROCESSED | Event successfully sent to the event bus | Event Hub sets this status after the event bus broker acknowledges the message |
FAILED | Event bus send failed | Event Hub sets this status after a broker rejection or connection error |
Replay Mechanism (Retry)
Event Hub does not have a separate retry mechanism — retry is a natural consequence of the outbox polling pattern. Scheduler jobs query the outbox table for all unprocessed records (both NEW and FAILED), which means the scheduler automatically retries previously failed events on every cycle. The maximum number of records processed per cycle is configurable through the batch size setting.
When multiple Event Hub instances run simultaneously, the system ensures only one instance processes each event — preventing duplicate delivery while allowing throughput to scale linearly with the number of instances.
After each send attempt, Event Hub determines the status based on the event bus response:
- If the event bus broker acknowledges the message → Event Hub sets the status to
PROCESSED - If the event bus broker rejects or is unreachable → Event Hub sets the status to
FAILED
Event Hub batch-updates all records (successful and failed) in a single transaction.
The scheduler retries FAILED records indefinitely. Permanent failures (such as corrupted Protobuf data or invalid schema) cause these records to fail on every scheduler cycle. Event Hub does not have a maximum retry count or a dedicated dead-letter queue (DLQ) mechanism — FAILED records remaining in the outbox table serve as an implicit DLQ.
Concurrency Control
Event Hub supports running multiple instances simultaneously for horizontal scaling. The concurrency control mechanism ensures:
- Instances do not block each other
- No two instances process the same event (the system prevents duplicate delivery)
- Throughput scales linearly as instance count increases
Cleanup Job (Delete Processed Events)
A scheduled cleanup job periodically deletes records in PROCESSED status from the outbox table. This prevents indefinite table growth and maintains query performance.
The following configuration controls the cleanup job:
| Configuration | Default | Description |
|---|---|---|
| Cleanup interval | Configurable | How often the cleanup job runs |
| Cleanup enabled | false | Toggle for enabling/disabling the cleanup job |
Event Hub disables the cleanup job by default. If you do not enable it in production, the outbox table grows indefinitely. As the number of PROCESSED records increases, polling query performance degrades because the status filter must scan more rows.
Error Scenarios and Behaviors
| Scenario | Behavior |
|---|---|
| Event bus broker temporarily unreachable | Event Hub catches the connection error and silently recovers. Event Hub marks the event FAILED and retries it in the next cycle. |
| Schema Registry unreachable | Protobuf serialization fails because schema registration cannot complete. The event becomes FAILED. |
| Corrupted Protobuf data | Deserialization from binary to Protobuf message fails. The event stays FAILED and the same error repeats every cycle (permanent failure). |
| Database transaction error | The database rolls back the transaction and Event Hub logs the error. The record status remains unchanged and the scheduler retries it in the next cycle. |
| Batch update failure | Even if the event bus send succeeds, if the status update fails, the record remains NEW or FAILED. The scheduler sends it again in the next cycle (duplicate delivery risk). |
Diagram
Example Scenario
Scenario
The event bus broker experiences a transient outage. Event Hub writes events arriving during this period to the outbox and automatically delivers them once the outage ends.
Input
- Actor: External integration service
- Resource: An event of a given
EventHubType - Action: Scheduler polling + event bus retry
- Context:
- Event bus broker unreachable for 30 seconds
Expected Outcome
- The event arrives via gRPC, passes validation, and Event Hub writes it to the outbox with
NEWstatus - On the next scheduler cycle, the scheduler retrieves the record from the outbox. Event bus send fails due to connection error → Event Hub updates the event to
FAILED - On subsequent scheduler cycles, the scheduler re-queries the same event (not yet
PROCESSED), attempts to send, and fails →FAILEDstatus persists - Once the event bus broker becomes reachable again, the scheduler retrieves the record, sends to the event bus → broker sends ACK → Event Hub updates the event to
PROCESSED - The cleanup job (if enabled) eventually deletes the
PROCESSEDrecord
Common Misunderstandings
-
"Event Hub uses a DLQ topic" — Event Hub does not create a separate event bus DLQ topic. Records in
FAILEDstatus remain in the outbox table and serve as an implicit DLQ. Detecting permanent failures requires monitoring the outbox table. -
"The cleanup job also deletes FAILED records" — The cleanup job only deletes records in
PROCESSEDstatus.FAILEDrecords remain in the table, and the scheduler continues to retry them. -
"Retry count is limited" — There is no maximum retry count in the current implementation. The scheduler retries
FAILEDrecords indefinitely until someone resolves the issue or manually deletes them.
Permanently failing events (e.g., corrupted Protobuf data) accumulate indefinitely in the outbox table. In production, regularly monitor for events in FAILED status and manually investigate persistent failures.
Design Notes / Best Practices
-
Enable the cleanup job in production — Event Hub disables the cleanup job by default. Enable it in production environments. Otherwise, the outbox table grows indefinitely and polling query performance degrades.
-
Tune the batch size to your throughput — The batch size determines the maximum number of records processed per scheduler cycle. Adjust based on your workload — keep in mind that Event Hub updates each batch within a single transaction.
-
Monitor
FAILEDrecords — Create an alert that monitors the count and age of events inFAILEDstatus. Records that remainFAILEDfor extended periods indicate a persistent problem.
Related Use Cases
- Preventing event loss during event bus cluster maintenance
- Detecting and performing root cause analysis on long-standing
FAILEDrecords - Monitoring outbox table size and optimizing the cleanup strategy
- Running multiple Event Hub pods with horizontal scaling
Next Step
Continue with Consumer Contracts to review the Protobuf schema structure, validation guarantees, and compatibility rules for event bus consumers.
Related Docs
Overview
Event Hub's overall architecture and the Transactional Outbox pattern.
Delivery & Subscription Model
Event bus topic structure, scheduler mechanism, and Protobuf serialization details.
Consumer Contracts
Event schema rules and compatibility expectations for consumers.
Audit & Observability
Audit logs and OpenTelemetry observability during event processing.
Are FAILED events automatically retried?
Yes. Scheduler jobs query for all unprocessed records, which includes both NEW and FAILED events. The scheduler automatically re-attempts FAILED records on every cycle.
Does Event Hub use a separate DLQ topic?
No. Event Hub does not create a separate event bus DLQ topic. Records in FAILED status remain in the outbox table and serve as an implicit DLQ. Detecting permanent failures requires monitoring the outbox table.
Does the cleanup job also delete FAILED records?
No. The cleanup job only targets records in PROCESSED status. It deletes only records that Event Hub successfully sent to the event bus. FAILED records remain in the table.
Can multiple pods process the same event?
No. Event Hub uses a concurrency-safe polling mechanism that ensures only one instance processes each event. Multiple pods can run simultaneously without processing the same event.