Decision Trace & Explainability
Goal
Trace, diagnose, and replay failed platform events using the DLT Management module in the Admin Console. By the end of this guide you will know how to browse the dead-letter event list, drill into a single event to read its full error history, and replay one or more events after the underlying issue has been resolved.
Audience
This guide is intended for platform operators and site-reliability engineers (SREs) who are responsible for monitoring event delivery health and resolving delivery failures across the Keymate platform.
Prerequisites
- Access to the Admin Console with the
dlts:readpermission scope (required for viewing events and event details). - The
dlts:updatepermission scope (required for triggering event replay). - Familiarity with the event-driven architecture of Keymate. If you are new to this topic, review the Admin Console overview first.
Before You Start
What is a Dead Letter Topic?
When the platform publishes an event to a downstream subscriber and delivery fails — for example, because the subscriber endpoint is unreachable, returns an error status code, or times out — the event is moved to a Dead Letter Topic (DLT). The DLT preserves the complete event payload together with the error context (error code, error message, and optional error cause) so that operators can:
- Diagnose — Understand why delivery failed by inspecting the error history timeline.
- Resolve — Fix the root cause (restore the subscriber endpoint, correct configuration, and so on).
- Replay — Re-deliver the event to the original subscriber once the issue is resolved.
Each failed delivery attempt is recorded as a separate entry in the event's history, giving you a chronological view of every retry and its outcome.
Steps
1. Navigate to DLT Management
Open the Admin Console and select Observability in the sidebar navigation. Then select DLT Management. This opens the event list at the route /observability/dlt-management.
2. Browse and filter DLT events
The list page displays all failed events in a paginated table. By default the table shows 20 events per page, sorted by createdAt in descending order (newest first).
You can interact with the list in the following ways:
| Action | How |
|---|---|
| Search | Type a keyword in the search field. The search applies across event metadata fields such as event type, resource type, and source service. |
| Sort | Click a column header to sort by createdAt, updatedAt, or replayed. Toggle between ascending and descending order. |
| Paginate | Use the pagination controls at the bottom of the table to move between pages. The current offset and total row count are displayed. |
Each row in the table shows the following fields:
| Field | Description |
|---|---|
id | Unique identifier for the dead-letter record |
eventType | The type of the original platform event |
operationType | The operation that triggered the event (for example, CREATE, UPDATE) |
resourceType | The type of resource the event pertains to |
sourceService | The service that originally produced the event |
serviceName | The name of the service associated with the event |
replayed | Number of times this event has been replayed |
createdAt | Timestamp when the event entered the DLT |
updatedAt | Timestamp of the most recent status change |
3. View event detail
Click any row to open the event detail page at /observability/dlt-management/{id}. The detail view surfaces the full event context organized into several sections.
Event metadata
The top section shows the following event metadata:
| Field | Description |
|---|---|
id | Dead-letter record identifier |
status | Current status of the event |
replayed | Replay count |
headers | HTTP headers captured at the time of the failed delivery (key-value map) |
createdAt | When the event entered the DLT |
updatedAt | When the record was last modified |
Parsed event payload
The event section contains the full parsed event structure:
- eventId — The original event identifier.
- createdAt — When the original event was created.
- payload — The inner event body, which includes:
id— Resource identifiertype— Event type classifiererror— Error value associated with the event (may benull)userId— The user associated with the event (optional)realmId— The realm in which the event occurred (optional)clientId— The client application identifier (optional)eventTime— Timestamp of the original eventipAddress— Source IP address (optional)sessionId— Session identifier (optional)detailsJson— Arbitrary key-value map with additional event-specific data (optional)
Event source
Nested inside the payload, the eventSource object records origin metadata:
| Field | Description |
|---|---|
id | Event source identifier |
createdAt | Source creation timestamp |
eventType | Event type at the source |
resourceType | Resource type at the source |
operationType | Operation type at the source |
sourceService | Originating service name |
Subscription info
The subscription object describes the delivery target that failed:
| Field | Description |
|---|---|
id | Subscription identifier |
deliveryEndpoint | The URL the platform attempted to deliver the event to |
subscriberService | Name of the subscriber service |
callbackProtocol | Protocol used for delivery (for example, HTTP, HTTPS) |
requestMethod | HTTP method used for delivery (for example, POST) |
payloadType | Format of the payload sent to the subscriber |
active | Whether the subscription is currently active |
4. Diagnose failures using the error history
Scroll down to the History section. Each entry in the history array represents a single failed delivery attempt and contains:
| Field | Description |
|---|---|
errorCode | The error code returned (for example, an HTTP status code or internal error code) |
errorMessage | A human-readable description of the failure |
errorCause | Additional cause information, if available (optional) |
processDate | Timestamp of the failed delivery attempt |
Review the history entries in chronological order to identify patterns:
- Recurring identical error codes may indicate a persistent infrastructure issue, such as a subscriber service that is down.
- Changing error messages across entries may suggest intermittent connectivity problems or a configuration that was partially corrected.
- The
errorCausefield, when present, often contains stack traces or upstream error details that help narrow down the root cause.
Cross-reference the headers from the event metadata with the Decision Trace Headers reference to correlate authorization-related failures with decision trace identifiers.
5. Replay failed events
After you resolve the root cause of the failure, replay the event to re-deliver it to the original subscriber.
Replay from the list page
- Select one or more events using the checkboxes in the event list.
- Click the Replay action button.
- The platform sends the selected event IDs for reprocessing. At least one event must be selected.
Replay from the detail page
- Open the event detail page.
- Click the Replay action for the individual event.
Review replay results
After a replay operation completes, the response includes:
| Field | Description |
|---|---|
success | Whether the overall replay operation succeeded |
message | A summary message describing the outcome |
processedCount | Number of events that were redelivered |
failedCount | Number of events that failed redelivery again |
data | An array of per-event results, each containing the event record, a processed boolean, a message, and an optional statusCode |
- If all events are redelivered, a success notification appears with the processed count.
- If some events fail, a warning notification appears showing both the processed count and the failed count.
- If the entire operation fails, an error notification appears.
After replay, the replayed counter on each successfully reprocessed event increments, and the event list refreshes automatically.
Replaying an event requires the dlts:update permission scope. If you do not have this permission, the replay action is not available.
Validation Scenario
Scenario
Confirm that you can find a failed event, inspect its error history, replay it, and verify the result.
Expected Result
The replayed event's replayed count increments by one, and the processedCount in the replay response equals the number of events you selected.
How to Verify
- Open Observability > DLT Management.
- Use the search field to locate an event by its event type or source service.
- Click the event to open the detail page. Confirm that the History section displays at least one error entry with
errorCode,errorMessage, andprocessDatefields populated. - Note the current value of the
replayedfield. - Click Replay.
- Verify:
- UI evidence: A success notification appears. The
replayedcounter on the event detail page increments by one. - API evidence: The replay response contains
success: trueandprocessedCount: 1. - List evidence: Return to the event list. The
replayedcolumn for the event reflects the updated count.
- UI evidence: A success notification appears. The
Troubleshooting
Replay fails again immediately
Symptom: You replay an event and the failedCount in the response is greater than zero.
Cause: The subscriber endpoint is still unreachable, or the underlying issue has not been resolved.
Resolution: Verify that the deliveryEndpoint shown in the subscription section of the event detail is reachable. Check the subscriber service health and network connectivity before attempting another replay.
Events do not appear in the list
Symptom: You expect to see failed events, but the DLT event list is empty or does not contain the events you are looking for.
Cause: The search filter or sort order may be hiding the events. The default sort is createdAt descending, so older events appear on later pages.
Resolution: Clear the search field, reset the sort order to the default, and check subsequent pages. If events are still missing, confirm that the subscriber configuration is set up to route failures to the DLT.
Replay action is not available
Symptom: The replay button is disabled or not visible.
Cause: Your account does not have the dlts:update permission scope.
Resolution: Contact your platform administrator to request the dlts:update scope on the dlts resource.
Next Steps
After mastering DLT event tracing and replay, expand your observability workflow:
- Monitor platform health holistically — Visit Alerts, Logs, and Traces to learn how to set up alerts and correlate logs with failed events.
- Understand decision headers — Review the Decision Trace Headers reference for the full list of authorization decision headers that appear in event metadata.
- Manage active sessions — Use Session and Device Monitoring to investigate user sessions related to failed events.
Related Docs
Alerts, Logs, and Traces
Set up alerts and correlate logs across platform services.
Decision Trace Headers
Reference for HTTP headers injected by the authorization gateway.
Admin Console Overview
Architecture and navigation overview for the Admin Console.
Session and Device Monitoring
Monitor and manage active user sessions and devices.