Connector Runtime

Summary

The Connector Runtime defines the execution lifecycle, retry strategy, and error recovery mechanisms of Integration Hub's delivery pipeline. When the Outbox Processor fails to deliver an event after retrying, the message is routed to a dead letter topic (DLT). The DLT Processor consumes these failed events, persists them to the relational database, and automatically retries delivery on a configurable schedule. Each retry attempt is recorded in a history log. Operators can inspect failed events and trigger manual replays via a REST API. A configurable cleanup job removes successfully processed DLT events.

Why It Exists

In a distributed delivery system, transient failures are inevitable — subscriber endpoints go down, network partitions occur, authentication tokens expire. The Connector Runtime provides a multi-layered retry strategy:

Immediate retry — The Outbox Processor retries delivery up to a configurable number of times (default: 3) before NACKing
Scheduled retry — The DLT Processor automatically retries failed events on a configurable interval
Manual replay — Operators can inspect and manually retrigger delivery for specific events via REST API

This layered approach ensures that transient failures are recovered automatically while persistent failures are surfaced for operator intervention.

Where It Fits in Keymate

The Connector Runtime is the reliability and operations layer of the Integration Hub pipeline. It handles events that the Connector & Delivery Model failed to deliver and provides operators with visibility into delivery failures.

Boundaries

In scope:

DLT topic consumption
DLT event persistence and retry history
Scheduled retry mechanism
Operator REST API (list, detail, replay)
Cleanup of processed DLT events
Event status lifecycle (NEW → PROCESSED / FAILED)

Out of scope:

Initial HTTP delivery → Connector & Delivery Model
Subscription management → Consumer Model
Topic routing → Event-Driven Distribution

How It Works

Retry Layers

DLT Processor: Event Consumption

The DLT Processor consumes all four DLT topics:

Consumer Configuration	Value
Deserializer	Protobuf deserializer
Protobuf type	`OutboxEvent`
Offset strategy	`earliest`
Commit mode	Manual ACK/NACK

At startup, the DLT Processor subscribes to the merged DLT stream, parses each message into an OutboxEvent, and persists it to the database with status NEW.

Data Model

The DLT Processor persists failed events in two related data stores:

DLT Events — The primary store for failed events. Each record contains:

Field	Description
Event data	Raw Protobuf bytes of the `OutboxEvent` and a human-readable parsed representation
Status	`NEW`, `FAILED`, or `PROCESSED`
Retry counter	Number of retry attempts made
Timestamps	Creation and last update timestamps

Retry History — A complete audit trail of every retry attempt for each DLT event. Each entry records:

Field	Description
Error code	HTTP status code or error classification
Error message	Error description
Error cause	Root cause details
Process date	When the retry attempt occurred

Deleting a DLT event automatically cascades to its retry history records.

Scheduled Retry

The DLT Processor runs a scheduled job that retries failed events:

Configuration	Default	Description
Retry interval	`3s`	How often the retry scheduler runs
Maximum retry count	`3`	Maximum retry attempts before marking as `FAILED`
Batch size	`100`	Number of records per retry cycle
Delivery timeout	`5s`	HTTP delivery timeout for DLT retries

The retry flow:

Query failed events where status = 'FAILED' and retry count has not been exhausted, using a concurrency-safe locking mechanism
For each event: extract endpoint and auth from the stored OutboxEvent, attempt HTTP delivery
On success: update status to PROCESSED, increment retry counter, record result in retry history
On failure: keep status as FAILED, increment retry counter, record error details in retry history
When replayed >= retry.count: event remains FAILED — no further automatic retries

Operator REST API

The DLT Processor exposes a REST API for operator inspection and manual replay:

The REST API provides three main operations:

Operation	Description
List failed events	Returns a paginated list of `FAILED` events with metadata. Supports filtering, sorting, and search.
Get event detail	Returns a single event with its full retry history — every attempt, error code, error message, and timestamp.
Manual replay	Accepts a list of event UUIDs and triggers re-delivery. Each replayed event goes through the same HTTP delivery pipeline as scheduled retries.

Cleanup Job

A configurable cleanup job removes PROCESSED DLT events:

Configuration	Default	Description
Cleanup interval	`60m`	How often the cleanup scheduler runs
Cleanup enabled	`false`	Toggle for auto-delete of processed events

caution

The cleanup job is disabled by default. Enable it in production to prevent the DLT store from growing indefinitely. Only PROCESSED events are deleted — FAILED events are preserved for operator inspection.

Event Status Lifecycle

Diagram

Example Scenario

Scenario

A subscriber endpoint is down for 15 seconds. The DLT Processor captures the failed event and automatically delivers it once the endpoint recovers.

Input

Actor: DLT Processor (automated)
Resource: Failed OutboxEvent in the user lifecycle DLT
Action: Scheduled retry
Context:
- Subscriber endpoint: temporarily unreachable
- Retry interval: 3S
- Max retries: 3

Expected Outcome

Outbox Processor fails to deliver → NACK → message routed to DLT topic
DLT Processor consumes the message, persists it with status = 'NEW'
Retry 1 (t+3s): Scheduled job picks up the event, attempts HTTP delivery → 503 → error recorded in retry history, attempt count incremented
Retry 2 (t+6s): Another attempt → 503 → history recorded, replayed = 2
Retry 3 (t+9s): Endpoint is back → 200 OK → status updated to PROCESSED, history records success
Cleanup job (if enabled) deletes the PROCESSED record after the next cleanup cycle

Common Misunderstandings

"DLT events are permanently lost after max retries" — Events that exhaust automatic retries remain in the DLT store with FAILED status. They are not deleted. Operators can manually replay them via the REST API at any time.
"Manual replay bypasses authentication" — Manual replay goes through the same delivery pipeline, including authentication. The subscriber's auth credentials from the original OutboxEvent envelope are used.
"Cleanup job deletes FAILED events" — The cleanup job only deletes PROCESSED events. FAILED events are preserved for operator investigation.

warning

The DLT Processor's scheduled retry interval (default: 3 seconds) is aggressive by default. In production, consider increasing this interval to avoid overwhelming recovering subscriber endpoints with rapid retry attempts.

Design Notes / Best Practices

Review DLT events regularly — Use the operator REST API to monitor failed events. Long-standing FAILED events often indicate expired subscriber credentials, decommissioned endpoints, or permanent configuration issues.
Use the retry history for root cause analysis — Each DLT event records every retry attempt with error codes and messages. Use this data to distinguish transient failures (timeouts, temporary unavailability) from permanent failures (auth errors, endpoint not found).
Enable cleanup in production — Enable the cleanup job to prevent table growth. The default 60-minute interval is suitable for most deployments.
Scale DLT Processor independently — The DLT Processor uses a concurrency-safe locking mechanism for retry queries, enabling horizontal scaling. Multiple pods can process DLT events concurrently without conflicts.

tip

The event detail endpoint returns the full replay history for a specific event. This is invaluable for debugging — it shows exactly when each retry was attempted, what error occurred, and whether the final delivery succeeded.

Next Step

Continue with Audit & Observability to learn how Integration Hub audit logs are collected and how to monitor the delivery pipeline with OpenTelemetry.

Overview

The four-service architecture and where DLT processing fits.

Connector & Delivery Model

Initial HTTP delivery that precedes DLT routing.

Event-Driven Distribution

Domain topics and their paired DLT topics.

Consumer Model

Subscription configuration that determines delivery targets.

What happens after all automatic retries are exhausted?

Events that exhaust all automatic retries (default: 3 in Outbox Processor + 3 in DLT Processor) remain in the DLT store with FAILED status. They are not deleted. Operators can manually replay them at any time via the operator REST API with the event UUIDs.

How do I manually replay a failed event?

Use the manual replay endpoint with a list of event UUIDs. The DLT Processor re-attempts HTTP delivery using the original OutboxEvent envelope (including endpoint and auth). The result is recorded in the retry history.

Can I see the retry history for a specific event?

Yes. The event detail endpoint returns the event along with its complete retry history — including every retry attempt, the error code, error message, and timestamp.

Does the cleanup job delete FAILED events?

No. The cleanup job only deletes events in PROCESSED status. FAILED events are preserved in the DLT store for operator inspection and manual replay.

Summary​

Why It Exists​

Where It Fits in Keymate​

Boundaries​

How It Works​

Retry Layers​

DLT Processor: Event Consumption​

Data Model​

Scheduled Retry​

Operator REST API​

Cleanup Job​

Event Status Lifecycle​

Diagram​

Example Scenario​

Scenario​

Input​

Expected Outcome​

Common Misunderstandings​

Design Notes / Best Practices​

Next Step​

Related Docs​