Connector Runtime
Summary
The Connector Runtime defines the execution lifecycle, retry strategy, and error recovery mechanisms of Integration Hub's delivery pipeline. When the Outbox Processor fails to deliver an event after retrying, the message is routed to a dead letter topic (DLT). The DLT Processor consumes these failed events, persists them to the relational database, and automatically retries delivery on a configurable schedule. Each retry attempt is recorded in a history log. Operators can inspect failed events and trigger manual replays via a REST API. A configurable cleanup job removes successfully processed DLT events.
Why It Exists
In a distributed delivery system, transient failures are inevitable — subscriber endpoints go down, network partitions occur, authentication tokens expire. The Connector Runtime provides a multi-layered retry strategy:
- Immediate retry — The Outbox Processor retries delivery up to a configurable number of times (default: 3) before NACKing
- Scheduled retry — The DLT Processor automatically retries failed events on a configurable interval
- Manual replay — Operators can inspect and manually retrigger delivery for specific events via REST API
This layered approach ensures that transient failures are recovered automatically while persistent failures are surfaced for operator intervention.
Where It Fits in Keymate
The Connector Runtime is the reliability and operations layer of the Integration Hub pipeline. It handles events that the Connector & Delivery Model failed to deliver and provides operators with visibility into delivery failures.
Boundaries
In scope:
- DLT topic consumption
- DLT event persistence and retry history
- Scheduled retry mechanism
- Operator REST API (list, detail, replay)
- Cleanup of processed DLT events
- Event status lifecycle (
NEW→PROCESSED/FAILED)
Out of scope:
- Initial HTTP delivery → Connector & Delivery Model
- Subscription management → Consumer Model
- Topic routing → Event-Driven Distribution
How It Works
Retry Layers
DLT Processor: Event Consumption
The DLT Processor consumes all four DLT topics:
| Consumer Configuration | Value |
|---|---|
| Deserializer | Protobuf deserializer |
| Protobuf type | OutboxEvent |
| Offset strategy | earliest |
| Commit mode | Manual ACK/NACK |
At startup, the DLT Processor subscribes to the merged DLT stream, parses each message into an OutboxEvent, and persists it to the database with status NEW.
Data Model
The DLT Processor persists failed events in two related data stores:
DLT Events — The primary store for failed events. Each record contains:
| Field | Description |
|---|---|
| Event data | Raw Protobuf bytes of the OutboxEvent and a human-readable parsed representation |
| Status | NEW, FAILED, or PROCESSED |
| Retry counter | Number of retry attempts made |
| Timestamps | Creation and last update timestamps |
Retry History — A complete audit trail of every retry attempt for each DLT event. Each entry records:
| Field | Description |
|---|---|
| Error code | HTTP status code or error classification |
| Error message | Error description |
| Error cause | Root cause details |
| Process date | When the retry attempt occurred |
Deleting a DLT event automatically cascades to its retry history records.
Scheduled Retry
The DLT Processor runs a scheduled job that retries failed events:
| Configuration | Default | Description |
|---|---|---|
| Retry interval | 3s | How often the retry scheduler runs |
| Maximum retry count | 3 | Maximum retry attempts before marking as FAILED |
| Batch size | 100 | Number of records per retry cycle |
| Delivery timeout | 5s | HTTP delivery timeout for DLT retries |
The retry flow:
- Query failed events where
status = 'FAILED'and retry count has not been exhausted, using a concurrency-safe locking mechanism - For each event: extract endpoint and auth from the stored
OutboxEvent, attempt HTTP delivery - On success: update status to
PROCESSED, increment retry counter, record result in retry history - On failure: keep status as
FAILED, increment retry counter, record error details in retry history - When
replayed >= retry.count: event remainsFAILED— no further automatic retries
Operator REST API
The DLT Processor exposes a REST API for operator inspection and manual replay:
The REST API provides three main operations:
| Operation | Description |
|---|---|
| List failed events | Returns a paginated list of FAILED events with metadata. Supports filtering, sorting, and search. |
| Get event detail | Returns a single event with its full retry history — every attempt, error code, error message, and timestamp. |
| Manual replay | Accepts a list of event UUIDs and triggers re-delivery. Each replayed event goes through the same HTTP delivery pipeline as scheduled retries. |
Cleanup Job
A configurable cleanup job removes PROCESSED DLT events:
| Configuration | Default | Description |
|---|---|---|
| Cleanup interval | 60m | How often the cleanup scheduler runs |
| Cleanup enabled | false | Toggle for auto-delete of processed events |
The cleanup job is disabled by default. Enable it in production to prevent the DLT store from growing indefinitely. Only PROCESSED events are deleted — FAILED events are preserved for operator inspection.
Event Status Lifecycle
Diagram
Example Scenario
Scenario
A subscriber endpoint is down for 15 seconds. The DLT Processor captures the failed event and automatically delivers it once the endpoint recovers.
Input
- Actor: DLT Processor (automated)
- Resource: Failed
OutboxEventin the user lifecycle DLT - Action: Scheduled retry
- Context:
- Subscriber endpoint: temporarily unreachable
- Retry interval:
3S - Max retries:
3
Expected Outcome
- Outbox Processor fails to deliver → NACK → message routed to DLT topic
- DLT Processor consumes the message, persists it with
status = 'NEW' - Retry 1 (t+3s): Scheduled job picks up the event, attempts HTTP delivery →
503→ error recorded in retry history, attempt count incremented - Retry 2 (t+6s): Another attempt →
503→ history recorded,replayed = 2 - Retry 3 (t+9s): Endpoint is back →
200 OK→ status updated toPROCESSED, history records success - Cleanup job (if enabled) deletes the
PROCESSEDrecord after the next cleanup cycle
Common Misunderstandings
-
"DLT events are permanently lost after max retries" — Events that exhaust automatic retries remain in the DLT store with
FAILEDstatus. They are not deleted. Operators can manually replay them via the REST API at any time. -
"Manual replay bypasses authentication" — Manual replay goes through the same delivery pipeline, including authentication. The subscriber's auth credentials from the original
OutboxEventenvelope are used. -
"Cleanup job deletes FAILED events" — The cleanup job only deletes
PROCESSEDevents.FAILEDevents are preserved for operator investigation.
The DLT Processor's scheduled retry interval (default: 3 seconds) is aggressive by default. In production, consider increasing this interval to avoid overwhelming recovering subscriber endpoints with rapid retry attempts.
Design Notes / Best Practices
-
Review DLT events regularly — Use the operator REST API to monitor failed events. Long-standing
FAILEDevents often indicate expired subscriber credentials, decommissioned endpoints, or permanent configuration issues. -
Use the retry history for root cause analysis — Each DLT event records every retry attempt with error codes and messages. Use this data to distinguish transient failures (timeouts, temporary unavailability) from permanent failures (auth errors, endpoint not found).
-
Enable cleanup in production — Enable the cleanup job to prevent table growth. The default 60-minute interval is suitable for most deployments.
-
Scale DLT Processor independently — The DLT Processor uses a concurrency-safe locking mechanism for retry queries, enabling horizontal scaling. Multiple pods can process DLT events concurrently without conflicts.
The event detail endpoint returns the full replay history for a specific event. This is invaluable for debugging — it shows exactly when each retry was attempted, what error occurred, and whether the final delivery succeeded.
Next Step
Continue with Audit & Observability to learn how Integration Hub audit logs are collected and how to monitor the delivery pipeline with OpenTelemetry.
Related Docs
Overview
The four-service architecture and where DLT processing fits.
Connector & Delivery Model
Initial HTTP delivery that precedes DLT routing.
Event-Driven Distribution
Domain topics and their paired DLT topics.
Consumer Model
Subscription configuration that determines delivery targets.
What happens after all automatic retries are exhausted?
Events that exhaust all automatic retries (default: 3 in Outbox Processor + 3 in DLT Processor) remain in the DLT store with FAILED status. They are not deleted. Operators can manually replay them at any time via the operator REST API with the event UUIDs.
How do I manually replay a failed event?
Use the manual replay endpoint with a list of event UUIDs. The DLT Processor re-attempts HTTP delivery using the original OutboxEvent envelope (including endpoint and auth). The result is recorded in the retry history.
Can I see the retry history for a specific event?
Yes. The event detail endpoint returns the event along with its complete retry history — including every retry attempt, the error code, error message, and timestamp.
Does the cleanup job delete FAILED events?
No. The cleanup job only deletes events in PROCESSED status. FAILED events are preserved in the DLT store for operator inspection and manual replay.