Traces & Root Cause Analysis

Goal

Use distributed tracing to follow requests across Keymate platform services, identify which service or operation causes latency, and perform root cause analysis for production issues. At the end of this guide, you can trace any request from the API gateway through authorization, identity, and backend services to pinpoint the source of problems.

Audience

Operators and developers responsible for diagnosing performance issues and production incidents in the Keymate platform.

Prerequisites

A running Keymate deployment with the observability layer deployed
Access to the observability dashboard
Basic understanding of distributed tracing concepts (traces, spans, trace ID)

Before You Start

The platform assigns a trace ID to every request that enters Keymate. As the request flows through services (API gateway → authorization engine → identity provider → platform services), each service adds a span to the trace. The result is a complete picture of the request's journey, with timing information for every step.

The OpenTelemetry pipeline collects all tracing data — you do not need additional instrumentation.

Key Concepts

Concept	Definition
Trace	The complete record of a single request as it flows through multiple services
Span	One operation within a trace (e.g., "authorize request", "query database")
Trace ID	A unique identifier that connects all spans belonging to the same request
Parent span	The span that initiated a child operation
Latency	The time a span takes to complete — the gap between start and end

Steps

1. Find the trace

Start from one of these entry points:

Starting point	How to find the trace
Logs	Copy the trace ID from a log entry and search in the trace explorer
Metrics	Click through from a latency spike to see example slow traces in that time window
Trace explorer	Search by service name, time range, duration, or status code

2. Read the trace waterfall

The trace waterfall shows all spans arranged by time. Each row is a span, indented under its parent.

What to look for:

Pattern	What it means
One span is much longer than others	That operation is the bottleneck
Many sequential spans	Operations are running one after another instead of in parallel
A span shows an error status	That operation failed and may be the root cause
Large gap between parent and child spans	Time spent waiting — check network or queue latency

3. Identify the bottleneck

Follow this systematic approach:

Find the slowest span — sort by duration to identify the longest operation
Check if it is a leaf span — if the slowest span has no children, the issue is within that single operation (e.g., a slow database query)
Check if it is a parent span — if the slowest span has children, the issue is in one of its child operations
Look at the span attributes — attributes include database queries, HTTP methods, endpoints, and error messages that help diagnose the specific problem

4. Common investigation patterns

Slow authorization decisions

Root cause: A slow database query in the policy evaluation path. Investigate query performance or data volume.

Authentication latency

Root cause: The external identity provider (federation target) is slow to respond. Check network connectivity and the external provider's health.

Cascading failures

Root cause: The database is unavailable, causing platform services to timeout, which causes the gateway to timeout. Fix the database issue first.

5. Correlate with logs and metrics

After identifying the problematic span:

Check logs for the same service and time window — error messages provide specific context
Check metrics — resource utilization (CPU, memory, connection pool) may explain why the operation was slow
Check recent changes — a deployment, configuration change, or traffic increase may correlate with the issue

6. Export traces to external tools

If you use an external trace backend or observability platform, configure trace export through the OpenTelemetry pipeline.

Validation Scenario

Scenario

An operator investigates why some API requests take over 2 seconds when the normal response time is under 100ms.

Expected Result

The operator finds slow traces in the trace explorer filtered by duration > 2 seconds
The trace waterfall reveals that a database query in the authorization engine takes 1.8 seconds
The span attributes show the specific query and database table involved
The operator correlates with metrics and finds that database CPU is at 95%, confirming resource contention

How to Verify

Search for traces with duration > 2000ms in the reported time window
Verify the waterfall shows span-level timing for all services in the request path
Confirm span attributes include enough detail to identify the root cause

Troubleshooting

Traces are incomplete (missing spans). Some services may not propagate the trace context correctly. Verify that all platform services have the OTel SDK and that the service mesh forwards trace context headers.
No traces for a specific service. Check that the service emits traces and that you configured the collector to receive them. Verify namespace labels for telemetry collection.
Trace explorer is slow to query. Narrow the time range and add filters (service name, minimum duration) before searching. Broad queries over long time ranges are resource-intensive.

Next Steps

Logs — Correlate trace findings with detailed log entries
Metrics — Check resource utilization alongside trace analysis
Scaling & Performance — Scale components identified as bottlenecks

Observability Overview

How tracing fits into the telemetry model

Logs

Correlate traces with log entries

Metrics

Resource metrics alongside trace analysis

OpenTelemetry-First Model

How traces are collected and exported

Goal​

Audience​

Prerequisites​

Before You Start​

Key Concepts​

Steps​

1. Find the trace​

2. Read the trace waterfall​

3. Identify the bottleneck​

4. Common investigation patterns​

Slow authorization decisions​

Authentication latency​

Cascading failures​

5. Correlate with logs and metrics​

6. Export traces to external tools​

Validation Scenario​

Scenario​

Expected Result​

How to Verify​

Troubleshooting​

Next Steps​

Related Docs​