Skip to main content

Decision Trace & Explainability

Goal

Trace, diagnose, and replay failed platform events using the DLT Management module in the Admin Console. By the end of this guide you will know how to browse the dead-letter event list, drill into a single event to read its full error history, and replay one or more events after the underlying issue has been resolved.

Audience

This guide is intended for platform operators and site-reliability engineers (SREs) who are responsible for monitoring event delivery health and resolving delivery failures across the Keymate platform.

Prerequisites

  • Access to the Admin Console with the dlts:read permission scope (required for viewing events and event details).
  • The dlts:update permission scope (required for triggering event replay).
  • Familiarity with the event-driven architecture of Keymate. If you are new to this topic, review the Admin Console overview first.

Before You Start

What is a Dead Letter Topic?

When the platform publishes an event to a downstream subscriber and delivery fails — for example, because the subscriber endpoint is unreachable, returns an error status code, or times out — the event is moved to a Dead Letter Topic (DLT). The DLT preserves the complete event payload together with the error context (error code, error message, and optional error cause) so that operators can:

  1. Diagnose — Understand why delivery failed by inspecting the error history timeline.
  2. Resolve — Fix the root cause (restore the subscriber endpoint, correct configuration, and so on).
  3. Replay — Re-deliver the event to the original subscriber once the issue is resolved.

Each failed delivery attempt is recorded as a separate entry in the event's history, giving you a chronological view of every retry and its outcome.

Steps

1. Navigate to DLT Management

Open the Admin Console and select Observability in the sidebar navigation. Then select DLT Management. This opens the event list at the route /observability/dlt-management.

2. Browse and filter DLT events

The list page displays all failed events in a paginated table. By default the table shows 20 events per page, sorted by createdAt in descending order (newest first).

You can interact with the list in the following ways:

ActionHow
SearchType a keyword in the search field. The search applies across event metadata fields such as event type, resource type, and source service.
SortClick a column header to sort by createdAt, updatedAt, or replayed. Toggle between ascending and descending order.
PaginateUse the pagination controls at the bottom of the table to move between pages. The current offset and total row count are displayed.

Each row in the table shows the following fields:

FieldDescription
idUnique identifier for the dead-letter record
eventTypeThe type of the original platform event
operationTypeThe operation that triggered the event (for example, CREATE, UPDATE)
resourceTypeThe type of resource the event pertains to
sourceServiceThe service that originally produced the event
serviceNameThe name of the service associated with the event
replayedNumber of times this event has been replayed
createdAtTimestamp when the event entered the DLT
updatedAtTimestamp of the most recent status change

3. View event detail

Click any row to open the event detail page at /observability/dlt-management/{id}. The detail view surfaces the full event context organized into several sections.

Event metadata

The top section shows the following event metadata:

FieldDescription
idDead-letter record identifier
statusCurrent status of the event
replayedReplay count
headersHTTP headers captured at the time of the failed delivery (key-value map)
createdAtWhen the event entered the DLT
updatedAtWhen the record was last modified

Parsed event payload

The event section contains the full parsed event structure:

  • eventId — The original event identifier.
  • createdAt — When the original event was created.
  • payload — The inner event body, which includes:
    • id — Resource identifier
    • type — Event type classifier
    • error — Error value associated with the event (may be null)
    • userId — The user associated with the event (optional)
    • realmId — The realm in which the event occurred (optional)
    • clientId — The client application identifier (optional)
    • eventTime — Timestamp of the original event
    • ipAddress — Source IP address (optional)
    • sessionId — Session identifier (optional)
    • detailsJson — Arbitrary key-value map with additional event-specific data (optional)

Event source

Nested inside the payload, the eventSource object records origin metadata:

FieldDescription
idEvent source identifier
createdAtSource creation timestamp
eventTypeEvent type at the source
resourceTypeResource type at the source
operationTypeOperation type at the source
sourceServiceOriginating service name

Subscription info

The subscription object describes the delivery target that failed:

FieldDescription
idSubscription identifier
deliveryEndpointThe URL the platform attempted to deliver the event to
subscriberServiceName of the subscriber service
callbackProtocolProtocol used for delivery (for example, HTTP, HTTPS)
requestMethodHTTP method used for delivery (for example, POST)
payloadTypeFormat of the payload sent to the subscriber
activeWhether the subscription is currently active

4. Diagnose failures using the error history

Scroll down to the History section. Each entry in the history array represents a single failed delivery attempt and contains:

FieldDescription
errorCodeThe error code returned (for example, an HTTP status code or internal error code)
errorMessageA human-readable description of the failure
errorCauseAdditional cause information, if available (optional)
processDateTimestamp of the failed delivery attempt

Review the history entries in chronological order to identify patterns:

  • Recurring identical error codes may indicate a persistent infrastructure issue, such as a subscriber service that is down.
  • Changing error messages across entries may suggest intermittent connectivity problems or a configuration that was partially corrected.
  • The errorCause field, when present, often contains stack traces or upstream error details that help narrow down the root cause.
tip

Cross-reference the headers from the event metadata with the Decision Trace Headers reference to correlate authorization-related failures with decision trace identifiers.

5. Replay failed events

After you resolve the root cause of the failure, replay the event to re-deliver it to the original subscriber.

Replay from the list page

  1. Select one or more events using the checkboxes in the event list.
  2. Click the Replay action button.
  3. The platform sends the selected event IDs for reprocessing. At least one event must be selected.

Replay from the detail page

  1. Open the event detail page.
  2. Click the Replay action for the individual event.

Review replay results

After a replay operation completes, the response includes:

FieldDescription
successWhether the overall replay operation succeeded
messageA summary message describing the outcome
processedCountNumber of events that were redelivered
failedCountNumber of events that failed redelivery again
dataAn array of per-event results, each containing the event record, a processed boolean, a message, and an optional statusCode
  • If all events are redelivered, a success notification appears with the processed count.
  • If some events fail, a warning notification appears showing both the processed count and the failed count.
  • If the entire operation fails, an error notification appears.

After replay, the replayed counter on each successfully reprocessed event increments, and the event list refreshes automatically.

note

Replaying an event requires the dlts:update permission scope. If you do not have this permission, the replay action is not available.

Validation Scenario

Scenario

Confirm that you can find a failed event, inspect its error history, replay it, and verify the result.

Expected Result

The replayed event's replayed count increments by one, and the processedCount in the replay response equals the number of events you selected.

How to Verify

  1. Open Observability > DLT Management.
  2. Use the search field to locate an event by its event type or source service.
  3. Click the event to open the detail page. Confirm that the History section displays at least one error entry with errorCode, errorMessage, and processDate fields populated.
  4. Note the current value of the replayed field.
  5. Click Replay.
  6. Verify:
    • UI evidence: A success notification appears. The replayed counter on the event detail page increments by one.
    • API evidence: The replay response contains success: true and processedCount: 1.
    • List evidence: Return to the event list. The replayed column for the event reflects the updated count.

Troubleshooting

Replay fails again immediately

Symptom: You replay an event and the failedCount in the response is greater than zero.

Cause: The subscriber endpoint is still unreachable, or the underlying issue has not been resolved.

Resolution: Verify that the deliveryEndpoint shown in the subscription section of the event detail is reachable. Check the subscriber service health and network connectivity before attempting another replay.

Events do not appear in the list

Symptom: You expect to see failed events, but the DLT event list is empty or does not contain the events you are looking for.

Cause: The search filter or sort order may be hiding the events. The default sort is createdAt descending, so older events appear on later pages.

Resolution: Clear the search field, reset the sort order to the default, and check subsequent pages. If events are still missing, confirm that the subscriber configuration is set up to route failures to the DLT.

Replay action is not available

Symptom: The replay button is disabled or not visible.

Cause: Your account does not have the dlts:update permission scope.

Resolution: Contact your platform administrator to request the dlts:update scope on the dlts resource.

Next Steps

After mastering DLT event tracing and replay, expand your observability workflow:

  • Monitor platform health holistically — Visit Alerts, Logs, and Traces to learn how to set up alerts and correlate logs with failed events.
  • Understand decision headers — Review the Decision Trace Headers reference for the full list of authorization decision headers that appear in event metadata.
  • Manage active sessions — Use Session and Device Monitoring to investigate user sessions related to failed events.