Skip to main content

Traces & Root Cause Analysis

Goal

Use distributed tracing to follow requests across Keymate platform services, identify which service or operation causes latency, and perform root cause analysis for production issues. At the end of this guide, you can trace any request from the API gateway through authorization, identity, and backend services to pinpoint the source of problems.

Audience

Operators and developers responsible for diagnosing performance issues and production incidents in the Keymate platform.

Prerequisites

  • A running Keymate deployment with the observability layer deployed
  • Access to the observability dashboard
  • Basic understanding of distributed tracing concepts (traces, spans, trace ID)

Before You Start

The platform assigns a trace ID to every request that enters Keymate. As the request flows through services (API gateway → authorization engine → identity provider → platform services), each service adds a span to the trace. The result is a complete picture of the request's journey, with timing information for every step.

The OpenTelemetry pipeline collects all tracing data — you do not need additional instrumentation.

Key Concepts

ConceptDefinition
TraceThe complete record of a single request as it flows through multiple services
SpanOne operation within a trace (e.g., "authorize request", "query database")
Trace IDA unique identifier that connects all spans belonging to the same request
Parent spanThe span that initiated a child operation
LatencyThe time a span takes to complete — the gap between start and end

Steps

1. Find the trace

Start from one of these entry points:

Starting pointHow to find the trace
LogsCopy the trace ID from a log entry and search in the trace explorer
MetricsClick through from a latency spike to see example slow traces in that time window
Trace explorerSearch by service name, time range, duration, or status code

2. Read the trace waterfall

The trace waterfall shows all spans arranged by time. Each row is a span, indented under its parent.

What to look for:

PatternWhat it means
One span is much longer than othersThat operation is the bottleneck
Many sequential spansOperations are running one after another instead of in parallel
A span shows an error statusThat operation failed and may be the root cause
Large gap between parent and child spansTime spent waiting — check network or queue latency

3. Identify the bottleneck

Follow this systematic approach:

  1. Find the slowest span — sort by duration to identify the longest operation
  2. Check if it is a leaf span — if the slowest span has no children, the issue is within that single operation (e.g., a slow database query)
  3. Check if it is a parent span — if the slowest span has children, the issue is in one of its child operations
  4. Look at the span attributes — attributes include database queries, HTTP methods, endpoints, and error messages that help diagnose the specific problem

4. Common investigation patterns

Slow authorization decisions

Root cause: A slow database query in the policy evaluation path. Investigate query performance or data volume.

Authentication latency

Root cause: The external identity provider (federation target) is slow to respond. Check network connectivity and the external provider's health.

Cascading failures

Root cause: The database is unavailable, causing platform services to timeout, which causes the gateway to timeout. Fix the database issue first.

5. Correlate with logs and metrics

After identifying the problematic span:

  • Check logs for the same service and time window — error messages provide specific context
  • Check metrics — resource utilization (CPU, memory, connection pool) may explain why the operation was slow
  • Check recent changes — a deployment, configuration change, or traffic increase may correlate with the issue

6. Export traces to external tools

If you use an external trace backend or observability platform, configure trace export through the OpenTelemetry pipeline.

Validation Scenario

Scenario

An operator investigates why some API requests take over 2 seconds when the normal response time is under 100ms.

Expected Result

  • The operator finds slow traces in the trace explorer filtered by duration > 2 seconds
  • The trace waterfall reveals that a database query in the authorization engine takes 1.8 seconds
  • The span attributes show the specific query and database table involved
  • The operator correlates with metrics and finds that database CPU is at 95%, confirming resource contention

How to Verify

  • Search for traces with duration > 2000ms in the reported time window
  • Verify the waterfall shows span-level timing for all services in the request path
  • Confirm span attributes include enough detail to identify the root cause

Troubleshooting

  • Traces are incomplete (missing spans). Some services may not propagate the trace context correctly. Verify that all platform services have the OTel SDK and that the service mesh forwards trace context headers.
  • No traces for a specific service. Check that the service emits traces and that you configured the collector to receive them. Verify namespace labels for telemetry collection.
  • Trace explorer is slow to query. Narrow the time range and add filters (service name, minimum duration) before searching. Broad queries over long time ranges are resource-intensive.

Next Steps

  • Logs — Correlate trace findings with detailed log entries
  • Metrics — Check resource utilization alongside trace analysis
  • Scaling & Performance — Scale components identified as bottlenecks