Decision Trace & Explainability

Goal

Trace, diagnose, and replay failed platform events using the DLT Management module in the Admin Console. By the end of this guide you will know how to browse the dead-letter event list, drill into a single event to read its full error history, and replay one or more events after the underlying issue has been resolved.

Audience

This guide is intended for platform operators and site-reliability engineers (SREs) who are responsible for monitoring event delivery health and resolving delivery failures across the Keymate platform.

Prerequisites

Access to the Admin Console with the dlts:read permission scope (required for viewing events and event details).
The dlts:update permission scope (required for triggering event replay).
Familiarity with the event-driven architecture of Keymate. If you are new to this topic, review the Admin Console overview first.

Before You Start

What is a Dead Letter Topic?

When the platform publishes an event to a downstream subscriber and delivery fails — for example, because the subscriber endpoint is unreachable, returns an error status code, or times out — the event is moved to a Dead Letter Topic (DLT). The DLT preserves the complete event payload together with the error context (error code, error message, and optional error cause) so that operators can:

Diagnose — Understand why delivery failed by inspecting the error history timeline.
Resolve — Fix the root cause (restore the subscriber endpoint, correct configuration, and so on).
Replay — Re-deliver the event to the original subscriber once the issue is resolved.

Each failed delivery attempt is recorded as a separate entry in the event's history, giving you a chronological view of every retry and its outcome.

Steps

1. Navigate to DLT Management

Open the Admin Console and select Observability in the sidebar navigation. Then select DLT Management. This opens the event list at the route /observability/dlt-management.

2. Browse and filter DLT events

The list page displays all failed events in a paginated table. By default the table shows 20 events per page, sorted by createdAt in descending order (newest first).

You can interact with the list in the following ways:

Action	How
Search	Type a keyword in the search field. The search applies across event metadata fields such as event type, resource type, and source service.
Sort	Click a column header to sort by `createdAt`, `updatedAt`, or `replayed`. Toggle between ascending and descending order.
Paginate	Use the pagination controls at the bottom of the table to move between pages. The current offset and total row count are displayed.

Each row in the table shows the following fields:

Field	Description
`id`	Unique identifier for the dead-letter record
`eventType`	The type of the original platform event
`operationType`	The operation that triggered the event (for example, `CREATE`, `UPDATE`)
`resourceType`	The type of resource the event pertains to
`sourceService`	The service that originally produced the event
`serviceName`	The name of the service associated with the event
`replayed`	Number of times this event has been replayed
`createdAt`	Timestamp when the event entered the DLT
`updatedAt`	Timestamp of the most recent status change

3. View event detail

Click any row to open the event detail page at /observability/dlt-management/{id}. The detail view surfaces the full event context organized into several sections.

Event metadata

The top section shows the following event metadata:

Field	Description
`id`	Dead-letter record identifier
`status`	Current status of the event
`replayed`	Replay count
`headers`	HTTP headers captured at the time of the failed delivery (key-value map)
`createdAt`	When the event entered the DLT
`updatedAt`	When the record was last modified

Parsed event payload

The event section contains the full parsed event structure:

eventId — The original event identifier.
createdAt — When the original event was created.
payload — The inner event body, which includes:
- id — Resource identifier
- type — Event type classifier
- error — Error value associated with the event (may be null)
- userId — The user associated with the event (optional)
- realmId — The realm in which the event occurred (optional)
- clientId — The client application identifier (optional)
- eventTime — Timestamp of the original event
- ipAddress — Source IP address (optional)
- sessionId — Session identifier (optional)
- detailsJson — Arbitrary key-value map with additional event-specific data (optional)

Event source

Nested inside the payload, the eventSource object records origin metadata:

Field	Description
`id`	Event source identifier
`createdAt`	Source creation timestamp
`eventType`	Event type at the source
`resourceType`	Resource type at the source
`operationType`	Operation type at the source
`sourceService`	Originating service name

Subscription info

The subscription object describes the delivery target that failed:

Field	Description
`id`	Subscription identifier
`deliveryEndpoint`	The URL the platform attempted to deliver the event to
`subscriberService`	Name of the subscriber service
`callbackProtocol`	Protocol used for delivery (for example, `HTTP`, `HTTPS`)
`requestMethod`	HTTP method used for delivery (for example, `POST`)
`payloadType`	Format of the payload sent to the subscriber
`active`	Whether the subscription is currently active

4. Diagnose failures using the error history

Scroll down to the History section. Each entry in the history array represents a single failed delivery attempt and contains:

Field	Description
`errorCode`	The error code returned (for example, an HTTP status code or internal error code)
`errorMessage`	A human-readable description of the failure
`errorCause`	Additional cause information, if available (optional)
`processDate`	Timestamp of the failed delivery attempt

Review the history entries in chronological order to identify patterns:

Recurring identical error codes may indicate a persistent infrastructure issue, such as a subscriber service that is down.
Changing error messages across entries may suggest intermittent connectivity problems or a configuration that was partially corrected.
The errorCause field, when present, often contains stack traces or upstream error details that help narrow down the root cause.

tip

Cross-reference the headers from the event metadata with the Decision Trace Headers reference to correlate authorization-related failures with decision trace identifiers.

5. Replay failed events

After you resolve the root cause of the failure, replay the event to re-deliver it to the original subscriber.

Replay from the list page

Select one or more events using the checkboxes in the event list.
Click the Replay action button.
The platform sends the selected event IDs for reprocessing. At least one event must be selected.

Replay from the detail page

Open the event detail page.
Click the Replay action for the individual event.

Review replay results

After a replay operation completes, the response includes:

Field	Description
`success`	Whether the overall replay operation succeeded
`message`	A summary message describing the outcome
`processedCount`	Number of events that were redelivered
`failedCount`	Number of events that failed redelivery again
`data`	An array of per-event results, each containing the event record, a `processed` boolean, a `message`, and an optional `statusCode`

If all events are redelivered, a success notification appears with the processed count.
If some events fail, a warning notification appears showing both the processed count and the failed count.
If the entire operation fails, an error notification appears.

After replay, the replayed counter on each successfully reprocessed event increments, and the event list refreshes automatically.

note

Replaying an event requires the dlts:update permission scope. If you do not have this permission, the replay action is not available.

Validation Scenario

Scenario

Confirm that you can find a failed event, inspect its error history, replay it, and verify the result.

Expected Result

The replayed event's replayed count increments by one, and the processedCount in the replay response equals the number of events you selected.

How to Verify

Open Observability > DLT Management.
Use the search field to locate an event by its event type or source service.
Click the event to open the detail page. Confirm that the History section displays at least one error entry with errorCode, errorMessage, and processDate fields populated.
Note the current value of the replayed field.
Click Replay.
Verify:
- UI evidence: A success notification appears. The replayed counter on the event detail page increments by one.
- API evidence: The replay response contains success: true and processedCount: 1.
- List evidence: Return to the event list. The replayed column for the event reflects the updated count.

Troubleshooting

Replay fails again immediately

Symptom: You replay an event and the failedCount in the response is greater than zero.

Cause: The subscriber endpoint is still unreachable, or the underlying issue has not been resolved.

Resolution: Verify that the deliveryEndpoint shown in the subscription section of the event detail is reachable. Check the subscriber service health and network connectivity before attempting another replay.

Events do not appear in the list

Symptom: You expect to see failed events, but the DLT event list is empty or does not contain the events you are looking for.

Cause: The search filter or sort order may be hiding the events. The default sort is createdAt descending, so older events appear on later pages.

Resolution: Clear the search field, reset the sort order to the default, and check subsequent pages. If events are still missing, confirm that the subscriber configuration is set up to route failures to the DLT.

Replay action is not available

Symptom: The replay button is disabled or not visible.

Cause: Your account does not have the dlts:update permission scope.

Resolution: Contact your platform administrator to request the dlts:update scope on the dlts resource.

Next Steps

After mastering DLT event tracing and replay, expand your observability workflow:

Monitor platform health holistically — Visit Alerts, Logs, and Traces to learn how to set up alerts and correlate logs with failed events.
Understand decision headers — Review the Decision Trace Headers reference for the full list of authorization decision headers that appear in event metadata.
Manage active sessions — Use Session and Device Monitoring to investigate user sessions related to failed events.

Alerts, Logs, and Traces

Set up alerts and correlate logs across platform services.

Decision Trace Headers

Reference for HTTP headers injected by the authorization gateway.

Admin Console Overview

Architecture and navigation overview for the Admin Console.

Session and Device Monitoring

Monitor and manage active user sessions and devices.

Goal​

Audience​

Prerequisites​

Before You Start​

What is a Dead Letter Topic?​

Steps​

1. Navigate to DLT Management​

2. Browse and filter DLT events​

3. View event detail​

Event metadata​

Parsed event payload​

Event source​

Subscription info​

4. Diagnose failures using the error history​

5. Replay failed events​

Replay from the list page​

Replay from the detail page​

Review replay results​

Validation Scenario​

Scenario​

Expected Result​

How to Verify​

Troubleshooting​

Replay fails again immediately​

Events do not appear in the list​

Replay action is not available​

Next Steps​

Related Docs​