Metrics

Goal

Use the Keymate observability platform to monitor infrastructure and application metrics, create dashboards for key indicators, configure alerts for critical conditions, and detect performance bottlenecks. At the end of this guide, you have dashboards showing platform health and alerts that notify you before issues impact users.

Audience

Operators responsible for monitoring performance and availability of the Keymate platform.

Prerequisites

A running Keymate deployment with the observability layer deployed
Access to the observability dashboard

Before You Start

Keymate emits metrics from all platform components through the OpenTelemetry pipeline. Metrics cover three categories: infrastructure, application, and custom business metrics. The built-in observability platform ships with pre-configured dashboards for the most important indicators.

Metric Categories

Category	What it measures	Examples
Infrastructure	Kubernetes and system-level resource usage	CPU utilization, memory usage, disk I/O, network throughput, pod restart count
Application	Platform service behavior and performance	Request rate, response latency (p50/p95/p99), error rate, active sessions, authentication throughput
Custom business	Domain-specific indicators	Authorization decisions per second, policy evaluation latency, Tenant-specific request volume

Steps

1. Access the metrics dashboard

Open the observability dashboard. The platform provides pre-built dashboards organized by category:

Platform overview — high-level health across all services
Per-service dashboards — detailed metrics for each platform component
Infrastructure dashboard — Kubernetes node and pod resource usage

2. Monitor the key indicators

Focus on these indicators for day-to-day monitoring:

Indicator	What it tells you	Alert threshold (example)
Request error rate	Percentage of requests returning errors (4xx/5xx)	Alert if > 1% over 5 minutes
Response latency (p95)	95th percentile response time	Alert if > 500ms over 5 minutes
CPU utilization	Percentage of CPU limit consumed	Alert if > 80% sustained over 10 minutes
Memory utilization	Percentage of memory limit consumed	Alert if > 85% sustained over 10 minutes
Pod restart count	Number of pod restarts in a time window	Alert if > 3 restarts in 15 minutes
Database connection pool	Active vs available database connections	Alert if > 80% pool utilization
Certificate expiration	Days until TLS certificate expires	Alert if < 14 days

3. Create custom dashboards

Build dashboards tailored to your operational needs:

Tenant-level dashboard — request volume, error rate, and latency per Tenant
Authorization performance — policy evaluation latency, decision throughput, cache hit rate
Deployment health — component versions, replica counts, rollout status

4. Configure alerts

Set up alerts that notify your team before issues impact users.

Alert design principles:

Alert on symptoms (high error rate), not causes (high CPU) — causes without symptoms do not require immediate action
Use multiple severity levels: critical (page immediately), warning (investigate soon), informational (review in next working hours)
Include runbook links in alert notifications to speed up response
Avoid alert fatigue — tune thresholds based on observed baselines, not arbitrary values

5. Investigate performance bottlenecks

When metrics indicate a performance issue, use this investigation flow:

Identify the affected service — which component shows elevated latency or error rate?
Check resource utilization — is the service CPU-throttled or memory-constrained?
Check dependencies — is a downstream service (database, cache, external API) slow?
Correlate with traces — find slow traces for the affected service to identify the specific operation causing the bottleneck
Correlate with logs — check error logs during the affected time window for additional context

6. Export metrics to external tools

If you use an external metrics backend, metrics dashboard, or observability platform, configure the OTel Collector to export metrics to your tool. See Export & Tooling Portability for details.

Validation Scenario

Scenario

An operator sets up monitoring for a newly deployed Keymate platform and needs to verify that metrics are flowing and alerts are functional.

Expected Result

The platform overview dashboard shows metrics from all components
Per-service dashboards show request rate, latency, and error rate
Infrastructure dashboard shows CPU, memory, and pod status
A test alert (e.g., temporarily lower a threshold) fires and reaches the notification channel

How to Verify

Open each dashboard and confirm data is present and updating
Trigger a test alert by lowering a threshold temporarily
Verify the alert notification arrives in the configured channel

Troubleshooting

No metrics appearing. Verify the telemetry collector is running. Check that you configured the metrics pipeline and that the storage backend is reachable.
Metrics from one service missing. Verify the service pod has the telemetry sidecar or SDK. Check the collector logs for dropped metrics.
Alerts not firing. Verify you configured alert rules correctly. Check the alert evaluation interval and notification channel configuration.
Dashboard shows gaps in data. This usually indicates collector restarts or storage issues during the gap period. Check collector and storage health for that time range.

Next Steps

Traces & Root Cause Analysis — Drill into specific slow requests
Scaling & Performance — Scale components based on metric insights
Logs — Correlate metrics with log entries

Observability Overview

How metrics fit into the telemetry model

Logs

Correlate metrics with log entries

Traces & Root Cause Analysis

Drill into specific request flows

Scaling & Performance

Scale based on metric insights

Goal​

Audience​

Prerequisites​

Before You Start​

Metric Categories​

Steps​

1. Access the metrics dashboard​

2. Monitor the key indicators​

3. Create custom dashboards​

4. Configure alerts​

5. Investigate performance bottlenecks​

6. Export metrics to external tools​

Validation Scenario​

Scenario​

Expected Result​

How to Verify​

Troubleshooting​

Next Steps​

Related Docs​