Outbox & DLQ Handling
Summary
The outbox pattern ensures reliable event delivery by persisting events to a database table before publishing them to subscribers. When delivery fails, events move to a Dead Letter Queue (DLQ) for inspection and replay. This architecture guarantees that platform events reach their destinations even when subscribers experience temporary outages.
Why It Exists
Distributed systems face reliability challenges when publishing events:
- Network failures can prevent event delivery
- Subscriber services may be temporarily unavailable
- Database transactions and event publishing can become inconsistent
- Failed events need a recovery path
The outbox pattern solves these problems by treating event publishing as a two-phase process: first persist locally, then deliver asynchronously. The DLQ provides a safety net for events that cannot be delivered after retry attempts.
Where It Fits in Keymate
The outbox and DLQ components operate between event producers and the subscription delivery system.
Events flow from producers through the outbox table to the processor, which delivers them to subscribers. Failed deliveries land in the DLQ where operators can inspect and replay them.
Boundaries
What it covers:
- Outbox event persistence and lifecycle
- Event status management (NEW, PROCESSED, FAILED)
- Dead letter queue storage and querying
- Replay API for failed event recovery
- Audit and telemetry for delivery operations
What it does not cover:
- Subscription management (see Event Subscription Model)
- Event bus infrastructure (see Event Bus Integration)
- Event payload schema design
How It Works
Outbox Pattern
The outbox pattern captures events in a persistent table before delivery:
- Event capture — When a platform action occurs (user created, policy changed), the system writes an event record to the outbox table within the same database transaction as the action
- Asynchronous processing — A separate processor polls the outbox table for new events
- Delivery — The processor delivers events to subscribed endpoints
- Status update — On success, the event status changes to PROCESSED; on failure after retries, it changes to FAILED
Outbox Event Structure
Each outbox event contains:
| Field | Description |
|---|---|
| id | Unique event identifier (UUID) |
| event_type | Classification of the event (user.created, policy.updated) |
| payload | JSON-formatted event data |
| status | Current state: NEW, PROCESSED, or FAILED |
| created_at | Timestamp when the event was captured |
| processed_at | Timestamp when processing completed (success or failure) |
Event Status Lifecycle
Events start in NEW status. Successful delivery moves them to PROCESSED. Events that fail after the maximum retry attempts move to FAILED status and enter the DLQ.
Dead Letter Queue
The DLQ stores events that could not be delivered after all retry attempts. Each DLQ entry includes:
- Original event payload
- Subscriber information
- Error message from the last delivery attempt
- Timestamps for tracking
Operators can query the DLQ by:
- Message ID
- Subscriber ID
- Error message content
- Date range
Retry and Replay
The system provides configurable retry behavior:
| Configuration | Description |
|---|---|
| max_retry | Maximum delivery attempts before moving to DLQ |
| timeout | Seconds to wait for subscriber response |
When events land in the DLQ, operators can replay them:
- Single replay — Retry delivery for one specific event
- Bulk replay — Retry delivery for multiple events matching criteria
Successful replay removes the event from the DLQ and marks it as PROCESSED.
Delivery Flow
The outbox processor follows this sequence for each event:
- Fetch subscription details for the event type
- Authenticate using the subscription's configured auth method
- Deliver the event payload to the subscriber endpoint
- Log the result to audit and telemetry systems
- Update event status based on delivery result
- On failure, move to DLQ after retry exhaustion
Diagram
Example Scenario
Scenario
A user creation event fails to deliver to a downstream analytics service. The operator investigates and replays the failed event after the analytics service is restored.
Input
- Actor: Platform operator
- Resource: Failed event in DLQ for analytics-service subscriber
- Action: Query DLQ, identify issue, replay event
- Context: Analytics service was temporarily unavailable
Expected Outcome
- Operator queries DLQ:
GET /dlt-messages?subscriberId=analytics-service - Response shows the failed event with error message "Connection refused"
- Operator confirms analytics service is restored
- Operator replays:
POST /replaywith message ID - Event delivers successfully to analytics service
- Event removed from DLQ, status set to PROCESSED
Common Misunderstandings
- Outbox events are immediately delivered — Events are persisted first, then processed asynchronously. There is a brief delay between event creation and delivery.
- DLQ events are lost — DLQ events persist until explicitly replayed or purged. They are not automatically deleted.
- Replay always succeeds — Replay attempts the same delivery flow. If the underlying issue persists, the event returns to the DLQ.
Events in PROCESSED status are eventually purged based on retention policy. Do not rely on the outbox table for long-term event storage or audit purposes.
Design Notes / Best Practices
- Configure retry counts based on typical subscriber recovery times
- Monitor DLQ depth as an indicator of integration health
- Set up alerts for events stuck in FAILED status
- Use bulk replay during incident recovery to clear DLQ backlogs
- Review DLQ error messages to identify recurring integration issues
Query the DLQ by date range after known outages to identify all affected events for bulk replay.
Related Use Cases
- Guaranteed delivery of audit events to compliance systems
- User provisioning to downstream identity stores
- Policy change propagation to enforcement points
- Organization updates to connected applications