Skip to main content

Connector Runtime

Summary

The Connector Runtime defines the execution lifecycle, retry strategy, and error recovery mechanisms of Integration Hub's delivery pipeline. When the Outbox Processor fails to deliver an event after retrying, the message is routed to a dead letter topic (DLT). The DLT Processor consumes these failed events, persists them to the relational database, and automatically retries delivery on a configurable schedule. Each retry attempt is recorded in a history log. Operators can inspect failed events and trigger manual replays via a REST API. A configurable cleanup job removes successfully processed DLT events.

Why It Exists

In a distributed delivery system, transient failures are inevitable — subscriber endpoints go down, network partitions occur, authentication tokens expire. The Connector Runtime provides a multi-layered retry strategy:

  1. Immediate retry — The Outbox Processor retries delivery up to a configurable number of times (default: 3) before NACKing
  2. Scheduled retry — The DLT Processor automatically retries failed events on a configurable interval
  3. Manual replay — Operators can inspect and manually retrigger delivery for specific events via REST API

This layered approach ensures that transient failures are recovered automatically while persistent failures are surfaced for operator intervention.

Where It Fits in Keymate

The Connector Runtime is the reliability and operations layer of the Integration Hub pipeline. It handles events that the Connector & Delivery Model failed to deliver and provides operators with visibility into delivery failures.

Boundaries

In scope:

  • DLT topic consumption
  • DLT event persistence and retry history
  • Scheduled retry mechanism
  • Operator REST API (list, detail, replay)
  • Cleanup of processed DLT events
  • Event status lifecycle (NEWPROCESSED / FAILED)

Out of scope:

How It Works

Retry Layers

DLT Processor: Event Consumption

The DLT Processor consumes all four DLT topics:

Consumer ConfigurationValue
DeserializerProtobuf deserializer
Protobuf typeOutboxEvent
Offset strategyearliest
Commit modeManual ACK/NACK

At startup, the DLT Processor subscribes to the merged DLT stream, parses each message into an OutboxEvent, and persists it to the database with status NEW.

Data Model

The DLT Processor persists failed events in two related data stores:

DLT Events — The primary store for failed events. Each record contains:

FieldDescription
Event dataRaw Protobuf bytes of the OutboxEvent and a human-readable parsed representation
StatusNEW, FAILED, or PROCESSED
Retry counterNumber of retry attempts made
TimestampsCreation and last update timestamps

Retry History — A complete audit trail of every retry attempt for each DLT event. Each entry records:

FieldDescription
Error codeHTTP status code or error classification
Error messageError description
Error causeRoot cause details
Process dateWhen the retry attempt occurred

Deleting a DLT event automatically cascades to its retry history records.

Scheduled Retry

The DLT Processor runs a scheduled job that retries failed events:

ConfigurationDefaultDescription
Retry interval3sHow often the retry scheduler runs
Maximum retry count3Maximum retry attempts before marking as FAILED
Batch size100Number of records per retry cycle
Delivery timeout5sHTTP delivery timeout for DLT retries

The retry flow:

  1. Query failed events where status = 'FAILED' and retry count has not been exhausted, using a concurrency-safe locking mechanism
  2. For each event: extract endpoint and auth from the stored OutboxEvent, attempt HTTP delivery
  3. On success: update status to PROCESSED, increment retry counter, record result in retry history
  4. On failure: keep status as FAILED, increment retry counter, record error details in retry history
  5. When replayed >= retry.count: event remains FAILED — no further automatic retries

Operator REST API

The DLT Processor exposes a REST API for operator inspection and manual replay:

The REST API provides three main operations:

OperationDescription
List failed eventsReturns a paginated list of FAILED events with metadata. Supports filtering, sorting, and search.
Get event detailReturns a single event with its full retry history — every attempt, error code, error message, and timestamp.
Manual replayAccepts a list of event UUIDs and triggers re-delivery. Each replayed event goes through the same HTTP delivery pipeline as scheduled retries.

Cleanup Job

A configurable cleanup job removes PROCESSED DLT events:

ConfigurationDefaultDescription
Cleanup interval60mHow often the cleanup scheduler runs
Cleanup enabledfalseToggle for auto-delete of processed events
caution

The cleanup job is disabled by default. Enable it in production to prevent the DLT store from growing indefinitely. Only PROCESSED events are deleted — FAILED events are preserved for operator inspection.

Event Status Lifecycle

Diagram

Example Scenario

Scenario

A subscriber endpoint is down for 15 seconds. The DLT Processor captures the failed event and automatically delivers it once the endpoint recovers.

Input

  • Actor: DLT Processor (automated)
  • Resource: Failed OutboxEvent in the user lifecycle DLT
  • Action: Scheduled retry
  • Context:
    • Subscriber endpoint: temporarily unreachable
    • Retry interval: 3S
    • Max retries: 3

Expected Outcome

  1. Outbox Processor fails to deliver → NACK → message routed to DLT topic
  2. DLT Processor consumes the message, persists it with status = 'NEW'
  3. Retry 1 (t+3s): Scheduled job picks up the event, attempts HTTP delivery → 503 → error recorded in retry history, attempt count incremented
  4. Retry 2 (t+6s): Another attempt → 503 → history recorded, replayed = 2
  5. Retry 3 (t+9s): Endpoint is back → 200 OK → status updated to PROCESSED, history records success
  6. Cleanup job (if enabled) deletes the PROCESSED record after the next cleanup cycle

Common Misunderstandings

  • "DLT events are permanently lost after max retries" — Events that exhaust automatic retries remain in the DLT store with FAILED status. They are not deleted. Operators can manually replay them via the REST API at any time.

  • "Manual replay bypasses authentication" — Manual replay goes through the same delivery pipeline, including authentication. The subscriber's auth credentials from the original OutboxEvent envelope are used.

  • "Cleanup job deletes FAILED events" — The cleanup job only deletes PROCESSED events. FAILED events are preserved for operator investigation.

warning

The DLT Processor's scheduled retry interval (default: 3 seconds) is aggressive by default. In production, consider increasing this interval to avoid overwhelming recovering subscriber endpoints with rapid retry attempts.

Design Notes / Best Practices

  • Review DLT events regularly — Use the operator REST API to monitor failed events. Long-standing FAILED events often indicate expired subscriber credentials, decommissioned endpoints, or permanent configuration issues.

  • Use the retry history for root cause analysis — Each DLT event records every retry attempt with error codes and messages. Use this data to distinguish transient failures (timeouts, temporary unavailability) from permanent failures (auth errors, endpoint not found).

  • Enable cleanup in production — Enable the cleanup job to prevent table growth. The default 60-minute interval is suitable for most deployments.

  • Scale DLT Processor independently — The DLT Processor uses a concurrency-safe locking mechanism for retry queries, enabling horizontal scaling. Multiple pods can process DLT events concurrently without conflicts.

tip

The event detail endpoint returns the full replay history for a specific event. This is invaluable for debugging — it shows exactly when each retry was attempted, what error occurred, and whether the final delivery succeeded.

Next Step

Continue with Audit & Observability to learn how Integration Hub audit logs are collected and how to monitor the delivery pipeline with OpenTelemetry.

What happens after all automatic retries are exhausted?

Events that exhaust all automatic retries (default: 3 in Outbox Processor + 3 in DLT Processor) remain in the DLT store with FAILED status. They are not deleted. Operators can manually replay them at any time via the operator REST API with the event UUIDs.

How do I manually replay a failed event?

Use the manual replay endpoint with a list of event UUIDs. The DLT Processor re-attempts HTTP delivery using the original OutboxEvent envelope (including endpoint and auth). The result is recorded in the retry history.

Can I see the retry history for a specific event?

Yes. The event detail endpoint returns the event along with its complete retry history — including every retry attempt, the error code, error message, and timestamp.

Does the cleanup job delete FAILED events?

No. The cleanup job only deletes events in PROCESSED status. FAILED events are preserved in the DLT store for operator inspection and manual replay.