Skip to main content

Metrics

Goal

Use the Keymate observability platform to monitor infrastructure and application metrics, create dashboards for key indicators, configure alerts for critical conditions, and detect performance bottlenecks. At the end of this guide, you have dashboards showing platform health and alerts that notify you before issues impact users.

Audience

Operators responsible for monitoring performance and availability of the Keymate platform.

Prerequisites

  • A running Keymate deployment with the observability layer deployed
  • Access to the observability dashboard

Before You Start

Keymate emits metrics from all platform components through the OpenTelemetry pipeline. Metrics cover three categories: infrastructure, application, and custom business metrics. The built-in observability platform ships with pre-configured dashboards for the most important indicators.

Metric Categories

CategoryWhat it measuresExamples
InfrastructureKubernetes and system-level resource usageCPU utilization, memory usage, disk I/O, network throughput, pod restart count
ApplicationPlatform service behavior and performanceRequest rate, response latency (p50/p95/p99), error rate, active sessions, authentication throughput
Custom businessDomain-specific indicatorsAuthorization decisions per second, policy evaluation latency, Tenant-specific request volume

Steps

1. Access the metrics dashboard

Open the observability dashboard. The platform provides pre-built dashboards organized by category:

  • Platform overview — high-level health across all services
  • Per-service dashboards — detailed metrics for each platform component
  • Infrastructure dashboard — Kubernetes node and pod resource usage

2. Monitor the key indicators

Focus on these indicators for day-to-day monitoring:

IndicatorWhat it tells youAlert threshold (example)
Request error ratePercentage of requests returning errors (4xx/5xx)Alert if > 1% over 5 minutes
Response latency (p95)95th percentile response timeAlert if > 500ms over 5 minutes
CPU utilizationPercentage of CPU limit consumedAlert if > 80% sustained over 10 minutes
Memory utilizationPercentage of memory limit consumedAlert if > 85% sustained over 10 minutes
Pod restart countNumber of pod restarts in a time windowAlert if > 3 restarts in 15 minutes
Database connection poolActive vs available database connectionsAlert if > 80% pool utilization
Certificate expirationDays until TLS certificate expiresAlert if < 14 days

3. Create custom dashboards

Build dashboards tailored to your operational needs:

  • Tenant-level dashboard — request volume, error rate, and latency per Tenant
  • Authorization performance — policy evaluation latency, decision throughput, cache hit rate
  • Deployment health — component versions, replica counts, rollout status

4. Configure alerts

Set up alerts that notify your team before issues impact users.

Alert design principles:

  • Alert on symptoms (high error rate), not causes (high CPU) — causes without symptoms do not require immediate action
  • Use multiple severity levels: critical (page immediately), warning (investigate soon), informational (review in next working hours)
  • Include runbook links in alert notifications to speed up response
  • Avoid alert fatigue — tune thresholds based on observed baselines, not arbitrary values

5. Investigate performance bottlenecks

When metrics indicate a performance issue, use this investigation flow:

  1. Identify the affected service — which component shows elevated latency or error rate?
  2. Check resource utilization — is the service CPU-throttled or memory-constrained?
  3. Check dependencies — is a downstream service (database, cache, external API) slow?
  4. Correlate with traces — find slow traces for the affected service to identify the specific operation causing the bottleneck
  5. Correlate with logs — check error logs during the affected time window for additional context

6. Export metrics to external tools

If you use an external metrics backend, metrics dashboard, or observability platform, configure the OTel Collector to export metrics to your tool. See Export & Tooling Portability for details.

Validation Scenario

Scenario

An operator sets up monitoring for a newly deployed Keymate platform and needs to verify that metrics are flowing and alerts are functional.

Expected Result

  • The platform overview dashboard shows metrics from all components
  • Per-service dashboards show request rate, latency, and error rate
  • Infrastructure dashboard shows CPU, memory, and pod status
  • A test alert (e.g., temporarily lower a threshold) fires and reaches the notification channel

How to Verify

  • Open each dashboard and confirm data is present and updating
  • Trigger a test alert by lowering a threshold temporarily
  • Verify the alert notification arrives in the configured channel

Troubleshooting

  • No metrics appearing. Verify the telemetry collector is running. Check that you configured the metrics pipeline and that the storage backend is reachable.
  • Metrics from one service missing. Verify the service pod has the telemetry sidecar or SDK. Check the collector logs for dropped metrics.
  • Alerts not firing. Verify you configured alert rules correctly. Check the alert evaluation interval and notification channel configuration.
  • Dashboard shows gaps in data. This usually indicates collector restarts or storage issues during the gap period. Check collector and storage health for that time range.

Next Steps