Skip to main content

Deployment Best Practices

Goal

Apply production-grade best practices to your Keymate deployment. This guide covers high availability, resource management, monitoring activation, backup planning, upgrade strategies, and namespace organization — turning an initial installation into a production-ready platform.

Audience

Platform engineers and operators responsible for running Keymate in production environments.

Prerequisites

  • A running Keymate deployment (Helm-based or GitOps-based)
  • Administrative access to the Kubernetes cluster
  • Familiarity with Kubernetes resource management (requests, limits, PodDisruptionBudgets)

Before You Start

These best practices apply after the initial installation is complete and verified. If you have not installed the platform yet, start with the Pre-Deployment Checklist and the appropriate installation guide.

Steps

1. Configure high availability

Run critical components with multiple replicas to eliminate single points of failure.

Recommended replica counts for production:

ComponentMinimum replicasNotes
Identity Provider2Session affinity recommended
Authorization Engine2Stateless — scales horizontally
API Gateway2Handles all inbound traffic
Platform Services2Per service instance
Relational Database2 (primary + replica)Use operator-managed failover

Configure PodDisruptionBudgets to prevent voluntary disruptions from taking all replicas offline at once:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: identity-provider-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: identity-provider

2. Set resource requests and limits

Set explicit CPU and memory requests and limits for every Keymate component. This prevents resource contention and ensures the Kubernetes scheduler places pods on nodes with sufficient capacity.

Principles:

  • Set requests to the expected steady-state usage — this reserves the resources
  • Set limits to handle peak workloads — this prevents a single component from consuming all node resources
  • Never deploy production workloads without resource requests
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
tip

Start with conservative limits and adjust based on observed usage through the Metrics dashboard. Over-provisioning wastes resources; under-provisioning causes OOM kills and CPU throttling.

3. Enable monitoring from day one

Do not wait for an incident to set up monitoring. Enable observability immediately after installation.

Minimum monitoring setup:

  • Activate the Observability layer during installation
  • Configure dashboards for key indicators: request latency, error rate, CPU/memory usage, database connection pool utilization
  • Set up alerts for critical conditions: pod restarts, high error rates, certificate expiration, database replication lag
  • Verify that logs, metrics, and traces are flowing into the observability platform

If you use external monitoring tools, configure telemetry export through the OpenTelemetry-first model. You can run both the built-in observability stack and your own tools simultaneously.

4. Implement a backup strategy

Protect against data loss by establishing regular backups for all persistent data.

What to back up:

DataFrequencyRetention
Identity provider databaseDaily + before upgrades30 days minimum
Authorization engine dataDaily + before upgrades30 days minimum
Platform configurationOn every change (GitOps handles this automatically)Full Git history
TLS certificates and secretsOn changeAlign with certificate lifecycle

Backup principles:

  • Automate backups — teams forget manual backups under pressure
  • Store backups in a separate location from the cluster (object storage, off-site storage)
  • Test restores regularly — a backup you cannot restore is not a backup
  • Document the restore procedure and ensure at least two team members can execute it

5. Plan your upgrade strategy

Keymate is a multi-component platform. Upgrading requires coordination across layers.

Upgrade principles:

  • Upgrade one layer at a time following the dependency order: infrastructure → data → application → observability
  • Read release notes before every upgrade — pay attention to breaking changes and migration requirements
  • Test upgrades in a non-production environment before applying to production
  • Take database backups before every upgrade
  • Monitor the platform closely after each upgrade for unexpected behavior

For GitOps deployments:

Upgrades are Git commits. Promote version changes through your environment pipeline (dev → staging → production) and let ArgoCD apply them.

For Helm deployments:

Use helm upgrade per component in the correct dependency order. Validate each layer before proceeding to the next.

6. Organize namespaces

Use dedicated namespaces to isolate components by function. This improves access control, resource quotas, and operational visibility.

Recommended namespace layout:

NamespacePurpose
Platform infrastructureService mesh, certificate management
Platform dataDatabases, caches, message brokers
Platform applicationIdentity, authorization, gateway, services
Platform observabilityTelemetry, dashboards, alerting
warning

Avoid deploying all Keymate components into a single namespace. Namespace separation enables fine-grained RBAC, independent resource quotas, and cleaner operational visibility.

7. Secure secrets management

Production deployments require disciplined secrets handling.

  • Store all credentials in Kubernetes Secrets, not in Helm values files or Git repositories
  • Use an external secrets operator (e.g., External Secrets Operator, Vault) if your organization requires centralized secrets management
  • Rotate database passwords and API keys on a regular schedule
  • Audit secret access through Kubernetes audit logs

8. Configure network policies

Restrict network traffic to only the communication paths that platform components require.

  • Allow inter-namespace traffic only between components that need to communicate
  • Block direct external access to data layer services (databases, caches)
  • Use the service mesh for mTLS enforcement between all platform services
  • Review network policies after adding new components or tenants

Validation Scenario

Scenario

After applying all best practices, verify that the production deployment meets operational readiness criteria.

Expected Result

  • All critical components run with 2+ replicas
  • PodDisruptionBudgets protect all stateful services
  • Every pod has resource requests and limits
  • Monitoring dashboards show metrics, logs, and traces flowing
  • Alerts fire correctly for test conditions (e.g., kill a pod, verify alert)
  • Database backup runs successfully and a test restore completes
  • Network policies block unauthorized traffic paths

How to Verify

  • kubectl get pdb -A — verify PodDisruptionBudgets exist
  • kubectl top pods -A — verify resource usage is within limits
  • Trigger a test alert and confirm it reaches the notification channel
  • Restore a database backup to a test instance and verify data integrity

Troubleshooting

  • Pods evicted or OOMKilled. Resource limits are too low. Increase memory limits and review actual usage in the metrics dashboard.
  • Rolling update takes too long. PodDisruptionBudget minAvailable may be too high relative to replica count. Ensure at least one pod can be disrupted at a time.
  • Backup job fails. Check storage permissions and available disk space in the backup destination. Verify network access from the cluster to the backup storage.
  • Upgrade breaks a component. Roll back to the previous version using helm rollback or a Git revert. Check release notes for missed migration steps.

Next Steps