Metrics
Goal
Use the Keymate observability platform to monitor infrastructure and application metrics, create dashboards for key indicators, configure alerts for critical conditions, and detect performance bottlenecks. At the end of this guide, you have dashboards showing platform health and alerts that notify you before issues impact users.
Audience
Operators responsible for monitoring performance and availability of the Keymate platform.
Prerequisites
- A running Keymate deployment with the observability layer deployed
- Access to the observability dashboard
Before You Start
Keymate emits metrics from all platform components through the OpenTelemetry pipeline. Metrics cover three categories: infrastructure, application, and custom business metrics. The built-in observability platform ships with pre-configured dashboards for the most important indicators.
Metric Categories
| Category | What it measures | Examples |
|---|---|---|
| Infrastructure | Kubernetes and system-level resource usage | CPU utilization, memory usage, disk I/O, network throughput, pod restart count |
| Application | Platform service behavior and performance | Request rate, response latency (p50/p95/p99), error rate, active sessions, authentication throughput |
| Custom business | Domain-specific indicators | Authorization decisions per second, policy evaluation latency, Tenant-specific request volume |
Steps
1. Access the metrics dashboard
Open the observability dashboard. The platform provides pre-built dashboards organized by category:
- Platform overview — high-level health across all services
- Per-service dashboards — detailed metrics for each platform component
- Infrastructure dashboard — Kubernetes node and pod resource usage
2. Monitor the key indicators
Focus on these indicators for day-to-day monitoring:
| Indicator | What it tells you | Alert threshold (example) |
|---|---|---|
| Request error rate | Percentage of requests returning errors (4xx/5xx) | Alert if > 1% over 5 minutes |
| Response latency (p95) | 95th percentile response time | Alert if > 500ms over 5 minutes |
| CPU utilization | Percentage of CPU limit consumed | Alert if > 80% sustained over 10 minutes |
| Memory utilization | Percentage of memory limit consumed | Alert if > 85% sustained over 10 minutes |
| Pod restart count | Number of pod restarts in a time window | Alert if > 3 restarts in 15 minutes |
| Database connection pool | Active vs available database connections | Alert if > 80% pool utilization |
| Certificate expiration | Days until TLS certificate expires | Alert if < 14 days |
3. Create custom dashboards
Build dashboards tailored to your operational needs:
- Tenant-level dashboard — request volume, error rate, and latency per Tenant
- Authorization performance — policy evaluation latency, decision throughput, cache hit rate
- Deployment health — component versions, replica counts, rollout status
4. Configure alerts
Set up alerts that notify your team before issues impact users.
Alert design principles:
- Alert on symptoms (high error rate), not causes (high CPU) — causes without symptoms do not require immediate action
- Use multiple severity levels: critical (page immediately), warning (investigate soon), informational (review in next working hours)
- Include runbook links in alert notifications to speed up response
- Avoid alert fatigue — tune thresholds based on observed baselines, not arbitrary values
5. Investigate performance bottlenecks
When metrics indicate a performance issue, use this investigation flow:
- Identify the affected service — which component shows elevated latency or error rate?
- Check resource utilization — is the service CPU-throttled or memory-constrained?
- Check dependencies — is a downstream service (database, cache, external API) slow?
- Correlate with traces — find slow traces for the affected service to identify the specific operation causing the bottleneck
- Correlate with logs — check error logs during the affected time window for additional context
6. Export metrics to external tools
If you use an external metrics backend, metrics dashboard, or observability platform, configure the OTel Collector to export metrics to your tool. See Export & Tooling Portability for details.
Validation Scenario
Scenario
An operator sets up monitoring for a newly deployed Keymate platform and needs to verify that metrics are flowing and alerts are functional.
Expected Result
- The platform overview dashboard shows metrics from all components
- Per-service dashboards show request rate, latency, and error rate
- Infrastructure dashboard shows CPU, memory, and pod status
- A test alert (e.g., temporarily lower a threshold) fires and reaches the notification channel
How to Verify
- Open each dashboard and confirm data is present and updating
- Trigger a test alert by lowering a threshold temporarily
- Verify the alert notification arrives in the configured channel
Troubleshooting
- No metrics appearing. Verify the telemetry collector is running. Check that you configured the metrics pipeline and that the storage backend is reachable.
- Metrics from one service missing. Verify the service pod has the telemetry sidecar or SDK. Check the collector logs for dropped metrics.
- Alerts not firing. Verify you configured alert rules correctly. Check the alert evaluation interval and notification channel configuration.
- Dashboard shows gaps in data. This usually indicates collector restarts or storage issues during the gap period. Check collector and storage health for that time range.
Next Steps
- Traces & Root Cause Analysis — Drill into specific slow requests
- Scaling & Performance — Scale components based on metric insights
- Logs — Correlate metrics with log entries