Deployment Best Practices
Goal
Apply production-grade best practices to your Keymate deployment. This guide covers high availability, resource management, monitoring activation, backup planning, upgrade strategies, and namespace organization — turning an initial installation into a production-ready platform.
Audience
Platform engineers and operators responsible for running Keymate in production environments.
Prerequisites
- A running Keymate deployment (Helm-based or GitOps-based)
- Administrative access to the Kubernetes cluster
- Familiarity with Kubernetes resource management (requests, limits, PodDisruptionBudgets)
Before You Start
These best practices apply after the initial installation is complete and verified. If you have not installed the platform yet, start with the Pre-Deployment Checklist and the appropriate installation guide.
Steps
1. Configure high availability
Run critical components with multiple replicas to eliminate single points of failure.
Recommended replica counts for production:
| Component | Minimum replicas | Notes |
|---|---|---|
| Identity Provider | 2 | Session affinity recommended |
| Authorization Engine | 2 | Stateless — scales horizontally |
| API Gateway | 2 | Handles all inbound traffic |
| Platform Services | 2 | Per service instance |
| Relational Database | 2 (primary + replica) | Use operator-managed failover |
Configure PodDisruptionBudgets to prevent voluntary disruptions from taking all replicas offline at once:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: identity-provider-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: identity-provider
2. Set resource requests and limits
Set explicit CPU and memory requests and limits for every Keymate component. This prevents resource contention and ensures the Kubernetes scheduler places pods on nodes with sufficient capacity.
Principles:
- Set requests to the expected steady-state usage — this reserves the resources
- Set limits to handle peak workloads — this prevents a single component from consuming all node resources
- Never deploy production workloads without resource requests
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
Start with conservative limits and adjust based on observed usage through the Metrics dashboard. Over-provisioning wastes resources; under-provisioning causes OOM kills and CPU throttling.
3. Enable monitoring from day one
Do not wait for an incident to set up monitoring. Enable observability immediately after installation.
Minimum monitoring setup:
- Activate the Observability layer during installation
- Configure dashboards for key indicators: request latency, error rate, CPU/memory usage, database connection pool utilization
- Set up alerts for critical conditions: pod restarts, high error rates, certificate expiration, database replication lag
- Verify that logs, metrics, and traces are flowing into the observability platform
If you use external monitoring tools, configure telemetry export through the OpenTelemetry-first model. You can run both the built-in observability stack and your own tools simultaneously.
4. Implement a backup strategy
Protect against data loss by establishing regular backups for all persistent data.
What to back up:
| Data | Frequency | Retention |
|---|---|---|
| Identity provider database | Daily + before upgrades | 30 days minimum |
| Authorization engine data | Daily + before upgrades | 30 days minimum |
| Platform configuration | On every change (GitOps handles this automatically) | Full Git history |
| TLS certificates and secrets | On change | Align with certificate lifecycle |
Backup principles:
- Automate backups — teams forget manual backups under pressure
- Store backups in a separate location from the cluster (object storage, off-site storage)
- Test restores regularly — a backup you cannot restore is not a backup
- Document the restore procedure and ensure at least two team members can execute it
5. Plan your upgrade strategy
Keymate is a multi-component platform. Upgrading requires coordination across layers.
Upgrade principles:
- Upgrade one layer at a time following the dependency order: infrastructure → data → application → observability
- Read release notes before every upgrade — pay attention to breaking changes and migration requirements
- Test upgrades in a non-production environment before applying to production
- Take database backups before every upgrade
- Monitor the platform closely after each upgrade for unexpected behavior
For GitOps deployments:
Upgrades are Git commits. Promote version changes through your environment pipeline (dev → staging → production) and let ArgoCD apply them.
For Helm deployments:
Use helm upgrade per component in the correct dependency order. Validate each layer before proceeding to the next.
6. Organize namespaces
Use dedicated namespaces to isolate components by function. This improves access control, resource quotas, and operational visibility.
Recommended namespace layout:
| Namespace | Purpose |
|---|---|
| Platform infrastructure | Service mesh, certificate management |
| Platform data | Databases, caches, message brokers |
| Platform application | Identity, authorization, gateway, services |
| Platform observability | Telemetry, dashboards, alerting |
Avoid deploying all Keymate components into a single namespace. Namespace separation enables fine-grained RBAC, independent resource quotas, and cleaner operational visibility.
7. Secure secrets management
Production deployments require disciplined secrets handling.
- Store all credentials in Kubernetes Secrets, not in Helm values files or Git repositories
- Use an external secrets operator (e.g., External Secrets Operator, Vault) if your organization requires centralized secrets management
- Rotate database passwords and API keys on a regular schedule
- Audit secret access through Kubernetes audit logs
8. Configure network policies
Restrict network traffic to only the communication paths that platform components require.
- Allow inter-namespace traffic only between components that need to communicate
- Block direct external access to data layer services (databases, caches)
- Use the service mesh for mTLS enforcement between all platform services
- Review network policies after adding new components or tenants
Validation Scenario
Scenario
After applying all best practices, verify that the production deployment meets operational readiness criteria.
Expected Result
- All critical components run with 2+ replicas
- PodDisruptionBudgets protect all stateful services
- Every pod has resource requests and limits
- Monitoring dashboards show metrics, logs, and traces flowing
- Alerts fire correctly for test conditions (e.g., kill a pod, verify alert)
- Database backup runs successfully and a test restore completes
- Network policies block unauthorized traffic paths
How to Verify
kubectl get pdb -A— verify PodDisruptionBudgets existkubectl top pods -A— verify resource usage is within limits- Trigger a test alert and confirm it reaches the notification channel
- Restore a database backup to a test instance and verify data integrity
Troubleshooting
- Pods evicted or OOMKilled. Resource limits are too low. Increase memory limits and review actual usage in the metrics dashboard.
- Rolling update takes too long. PodDisruptionBudget
minAvailablemay be too high relative to replica count. Ensure at least one pod can be disrupted at a time. - Backup job fails. Check storage permissions and available disk space in the backup destination. Verify network access from the cluster to the backup storage.
- Upgrade breaks a component. Roll back to the previous version using
helm rollbackor a Git revert. Check release notes for missed migration steps.
Next Steps
- Production Hardening — Apply security-specific hardening on top of these operational best practices
- Scaling and Performance — Scale beyond the initial high availability setup
- Observability Overview — Deep dive into monitoring, alerting, and telemetry export