Skip to main content

FGA Backend Sync Issues

This guide provides diagnostic steps and resolution paths for synchronization failures between the Keymate platform and FGA Engine backends. Use this guide when authorization decisions return unexpected results or when policy changes do not take effect.

Symptom

Authorization decisions return unexpected results, or changes to policies, permissions, or relationships do not take effect. You may observe:

  • Permission checks return stale results after policy updates
  • Policy expressions do not reflect webhook events from Integration Hub
  • Audit logs are missing or delayed
  • Cache hit/miss patterns indicate sync failures
  • Response headers show increased latency or error status

Likely Causes

  • Webhook processing failures — Event validation errors or malformed payloads
  • Cache unavailability — Distributed cache connection timeouts or network issues
  • Audit collection failures — Audit service unavailable or gRPC connection issues
  • Token exchange failures — Keycloak token endpoint unreachable or misconfigured
  • Network partitions — Connectivity issues between FGA services

How to Diagnose

Check 1: Inspect Response Headers

Examine authorization response headers for sync status:

HeaderValueMeaning
Keymate-Decisionresult="allow|deny|error"Authorization decision
Keymate-Decision-Cachehit or missCache status
Keymate-Decision-LatencymillisecondsTotal processing time
Keymate-Decision-Authority-LatencymillisecondsDownstream call time (absent on cache hit)

Diagnosis:

  • Consistent miss with high latency → Cache sync issue
  • error result → Check error code in response body
  • Missing Authority-Latency on miss → Early pipeline failure

Check 2: Review Error Codes

Common error codes indicating sync issues:

CodeStatusDescription
TOKEN_INACTIVE401Token expired or revoked
TOKEN_EXCHANGE_FAILED403Keycloak token exchange failed
PERMISSION_CHECK_FAILED500Permission service unavailable
RESOURCE_RESOLUTION_FAILED400Access rules not matching

Check 3: Verify Service Health

Check health endpoints for FGA services:

# TODO: replace with actual diagnostic command
curl -f http://<fga-service-host>:<port>/health

Health endpoints return service status:

{
"status": "UP",
"checks": [...]
}

If status is not UP, the service may be experiencing issues.

Check 4: Review Audit Logs

Look for these patterns in audit logs:

Log PatternMeaning
Webhook successEvent processed successfully
Webhook failedEvent validation or processing error
Audit request failedAudit collection temporarily unavailable
Cache operation timeoutDistributed cache connectivity issue

Check 5: Check Cache Connectivity

If cache operations are failing:

  1. Verify distributed cache service is running
  2. Check network connectivity between services
  3. Review cache operation retry logs

How to Resolve

Resolution 1: Webhook Event Failures

If policy expression changes are not syncing:

  1. Validate event payload — Ensure all required fields are present:

    • Event source: id, service, type, resource type, operation type, timestamp
    • Payload: id, time, realm, resource, action, auth context
    • Context map: all fields required for the event type
  2. Check supported event types — Only these events are processed:

    • attribute-definition:create
    • attribute-definition:update
    • attribute-definition:delete
  3. Retry the event — If validation passed, retry from Integration Hub

Resolution 2: Cache Sync Issues

If permission decisions are stale:

  1. Wait for cache TTL — Permission cache has a short TTL (typically seconds)
  2. Verify cache service — Check distributed cache health
  3. Review retry configuration — Cache operations retry automatically with backoff
  4. Monitor cache headersKeymate-Decision-Cache should show hit after initial miss

Resolution 3: Audit Collection Failures

If audit logs are missing:

  1. Check audit service — Verify gRPC port is accessible
  2. Review network connectivity — Ensure services can reach audit collector
  3. Wait for retry — Audit failures retry with exponential backoff
  4. Note: Audit failures are non-blocking — authorization continues

Resolution 4: Token Exchange Failures

If token exchange consistently fails:

  1. Verify Keycloak availability — Check token endpoint is reachable
  2. Check client configuration — Ensure target client exists and accepts exchange
  3. Review access rules — Verify rule patterns match the request
  4. Check token validity — Source token must not be expired

Signals to Inspect

  • Logs: Look for webhook success/failure, audit request status, cache timeout patterns
  • Metrics: Monitor cache hit rate, authorization latency, error rate by code
  • Traces: Check distributed traces for latency breakdown across services
  • Audit events: Review authorization decisions and policy change events

Key Metrics to Monitor

MetricDescriptionAlert Threshold
Cache hit ratePercentage of cache hits< 80% may indicate issues
Authorization latencyP99 response timeSudden increase indicates sync delay
Error rate by codeErrors grouped by codeAny sustained increase
Webhook processing timeEvent processing durationTimeout threshold

Escalation Notes

Escalate to the platform team when:

  • Cache service is completely unavailable for more than 5 minutes
  • Webhook events are consistently rejected after payload validation
  • Token exchange failures persist after Keycloak verification
  • Authorization latency exceeds SLA thresholds

Data to attach:

  • Response headers from failed requests
  • Error codes and messages
  • Service health check results
  • Relevant time window for log analysis
  • Cache hit/miss statistics
danger

If authorization is consistently failing for all users, this may indicate a critical sync failure. Check all FGA service health endpoints and distributed cache connectivity immediately.

Next Step

After resolving sync issues, verify authorization behavior by testing permission checks with the Decision Simulation tool.