Skip to main content

Outbox & DLQ Handling

Summary

The outbox pattern ensures reliable event delivery by persisting events to a database table before publishing them to subscribers. When delivery fails, events move to a Dead Letter Queue (DLQ) for inspection and replay. This architecture guarantees that platform events reach their destinations even when subscribers experience temporary outages.

Why It Exists

Distributed systems face reliability challenges when publishing events:

  • Network failures can prevent event delivery
  • Subscriber services may be temporarily unavailable
  • Database transactions and event publishing can become inconsistent
  • Failed events need a recovery path

The outbox pattern solves these problems by treating event publishing as a two-phase process: first persist locally, then deliver asynchronously. The DLQ provides a safety net for events that cannot be delivered after retry attempts.

Where It Fits in Keymate

The outbox and DLQ components operate between event producers and the subscription delivery system.

Events flow from producers through the outbox table to the processor, which delivers them to subscribers. Failed deliveries land in the DLQ where operators can inspect and replay them.

Boundaries

What it covers:

  • Outbox event persistence and lifecycle
  • Event status management (NEW, PROCESSED, FAILED)
  • Dead letter queue storage and querying
  • Replay API for failed event recovery
  • Audit and telemetry for delivery operations

What it does not cover:

How It Works

Outbox Pattern

The outbox pattern captures events in a persistent table before delivery:

  1. Event capture — When a platform action occurs (user created, policy changed), the system writes an event record to the outbox table within the same database transaction as the action
  2. Asynchronous processing — A separate processor polls the outbox table for new events
  3. Delivery — The processor delivers events to subscribed endpoints
  4. Status update — On success, the event status changes to PROCESSED; on failure after retries, it changes to FAILED

Outbox Event Structure

Each outbox event contains:

FieldDescription
idUnique event identifier (UUID)
event_typeClassification of the event (user.created, policy.updated)
payloadJSON-formatted event data
statusCurrent state: NEW, PROCESSED, or FAILED
created_atTimestamp when the event was captured
processed_atTimestamp when processing completed (success or failure)

Event Status Lifecycle

Events start in NEW status. Successful delivery moves them to PROCESSED. Events that fail after the maximum retry attempts move to FAILED status and enter the DLQ.

Dead Letter Queue

The DLQ stores events that could not be delivered after all retry attempts. Each DLQ entry includes:

  • Original event payload
  • Subscriber information
  • Error message from the last delivery attempt
  • Timestamps for tracking

Operators can query the DLQ by:

  • Message ID
  • Subscriber ID
  • Error message content
  • Date range

Retry and Replay

The system provides configurable retry behavior:

ConfigurationDescription
max_retryMaximum delivery attempts before moving to DLQ
timeoutSeconds to wait for subscriber response

When events land in the DLQ, operators can replay them:

  • Single replay — Retry delivery for one specific event
  • Bulk replay — Retry delivery for multiple events matching criteria

Successful replay removes the event from the DLQ and marks it as PROCESSED.

Delivery Flow

The outbox processor follows this sequence for each event:

  1. Fetch subscription details for the event type
  2. Authenticate using the subscription's configured auth method
  3. Deliver the event payload to the subscriber endpoint
  4. Log the result to audit and telemetry systems
  5. Update event status based on delivery result
  6. On failure, move to DLQ after retry exhaustion

Diagram

Example Scenario

Scenario

A user creation event fails to deliver to a downstream analytics service. The operator investigates and replays the failed event after the analytics service is restored.

Input

  • Actor: Platform operator
  • Resource: Failed event in DLQ for analytics-service subscriber
  • Action: Query DLQ, identify issue, replay event
  • Context: Analytics service was temporarily unavailable

Expected Outcome

  • Operator queries DLQ: GET /dlt-messages?subscriberId=analytics-service
  • Response shows the failed event with error message "Connection refused"
  • Operator confirms analytics service is restored
  • Operator replays: POST /replay with message ID
  • Event delivers successfully to analytics service
  • Event removed from DLQ, status set to PROCESSED

Common Misunderstandings

  • Outbox events are immediately delivered — Events are persisted first, then processed asynchronously. There is a brief delay between event creation and delivery.
  • DLQ events are lost — DLQ events persist until explicitly replayed or purged. They are not automatically deleted.
  • Replay always succeeds — Replay attempts the same delivery flow. If the underlying issue persists, the event returns to the DLQ.
warning

Events in PROCESSED status are eventually purged based on retention policy. Do not rely on the outbox table for long-term event storage or audit purposes.

Design Notes / Best Practices

  • Configure retry counts based on typical subscriber recovery times
  • Monitor DLQ depth as an indicator of integration health
  • Set up alerts for events stuck in FAILED status
  • Use bulk replay during incident recovery to clear DLQ backlogs
  • Review DLQ error messages to identify recurring integration issues
tip

Query the DLQ by date range after known outages to identify all affected events for bulk replay.

  • Guaranteed delivery of audit events to compliance systems
  • User provisioning to downstream identity stores
  • Policy change propagation to enforcement points
  • Organization updates to connected applications