Traces & Root Cause Analysis
Goal
Use distributed tracing to follow requests across Keymate platform services, identify which service or operation causes latency, and perform root cause analysis for production issues. At the end of this guide, you can trace any request from the API gateway through authorization, identity, and backend services to pinpoint the source of problems.
Audience
Operators and developers responsible for diagnosing performance issues and production incidents in the Keymate platform.
Prerequisites
- A running Keymate deployment with the observability layer deployed
- Access to the observability dashboard
- Basic understanding of distributed tracing concepts (traces, spans, trace ID)
Before You Start
The platform assigns a trace ID to every request that enters Keymate. As the request flows through services (API gateway → authorization engine → identity provider → platform services), each service adds a span to the trace. The result is a complete picture of the request's journey, with timing information for every step.
The OpenTelemetry pipeline collects all tracing data — you do not need additional instrumentation.
Key Concepts
| Concept | Definition |
|---|---|
| Trace | The complete record of a single request as it flows through multiple services |
| Span | One operation within a trace (e.g., "authorize request", "query database") |
| Trace ID | A unique identifier that connects all spans belonging to the same request |
| Parent span | The span that initiated a child operation |
| Latency | The time a span takes to complete — the gap between start and end |
Steps
1. Find the trace
Start from one of these entry points:
| Starting point | How to find the trace |
|---|---|
| Logs | Copy the trace ID from a log entry and search in the trace explorer |
| Metrics | Click through from a latency spike to see example slow traces in that time window |
| Trace explorer | Search by service name, time range, duration, or status code |
2. Read the trace waterfall
The trace waterfall shows all spans arranged by time. Each row is a span, indented under its parent.
What to look for:
| Pattern | What it means |
|---|---|
| One span is much longer than others | That operation is the bottleneck |
| Many sequential spans | Operations are running one after another instead of in parallel |
| A span shows an error status | That operation failed and may be the root cause |
| Large gap between parent and child spans | Time spent waiting — check network or queue latency |
3. Identify the bottleneck
Follow this systematic approach:
- Find the slowest span — sort by duration to identify the longest operation
- Check if it is a leaf span — if the slowest span has no children, the issue is within that single operation (e.g., a slow database query)
- Check if it is a parent span — if the slowest span has children, the issue is in one of its child operations
- Look at the span attributes — attributes include database queries, HTTP methods, endpoints, and error messages that help diagnose the specific problem
4. Common investigation patterns
Slow authorization decisions
Root cause: A slow database query in the policy evaluation path. Investigate query performance or data volume.
Authentication latency
Root cause: The external identity provider (federation target) is slow to respond. Check network connectivity and the external provider's health.
Cascading failures
Root cause: The database is unavailable, causing platform services to timeout, which causes the gateway to timeout. Fix the database issue first.
5. Correlate with logs and metrics
After identifying the problematic span:
- Check logs for the same service and time window — error messages provide specific context
- Check metrics — resource utilization (CPU, memory, connection pool) may explain why the operation was slow
- Check recent changes — a deployment, configuration change, or traffic increase may correlate with the issue
6. Export traces to external tools
If you use an external trace backend or observability platform, configure trace export through the OpenTelemetry pipeline.
Validation Scenario
Scenario
An operator investigates why some API requests take over 2 seconds when the normal response time is under 100ms.
Expected Result
- The operator finds slow traces in the trace explorer filtered by duration > 2 seconds
- The trace waterfall reveals that a database query in the authorization engine takes 1.8 seconds
- The span attributes show the specific query and database table involved
- The operator correlates with metrics and finds that database CPU is at 95%, confirming resource contention
How to Verify
- Search for traces with duration > 2000ms in the reported time window
- Verify the waterfall shows span-level timing for all services in the request path
- Confirm span attributes include enough detail to identify the root cause
Troubleshooting
- Traces are incomplete (missing spans). Some services may not propagate the trace context correctly. Verify that all platform services have the OTel SDK and that the service mesh forwards trace context headers.
- No traces for a specific service. Check that the service emits traces and that you configured the collector to receive them. Verify namespace labels for telemetry collection.
- Trace explorer is slow to query. Narrow the time range and add filters (service name, minimum duration) before searching. Broad queries over long time ranges are resource-intensive.
Next Steps
- Logs — Correlate trace findings with detailed log entries
- Metrics — Check resource utilization alongside trace analysis
- Scaling & Performance — Scale components identified as bottlenecks