Keymate Logo
← Back to Blog

How Keymate Migrated 20+ Million Identities to Keycloak

Keymate Team
December 31, 2025
How Keymate Migrated 20+ Million Identities to Keycloak

How Keymate Migrated 20+ Million Identities to Keycloak

Keycloak was just the beginning. Meet Keymate.

Estimated read: 7-8 minutes

TL;DR

  • Migrating 20+ million identities is not a data copy task — it is a long-running, failure-aware distributed process.
  • Uncontrolled retries and unbounded concurrency can silently turn transient errors into systemic outages.
  • Keymate designed the migration to be observable, resumable, and safe by design, with no silent data loss.
  • Load on Keycloak was carefully controlled using bounded concurrency and backpressure, keeping the system fully utilized without overload.
  • Every record had an explicit success or failure state, allowing safe restarts, replays, and verification.
  • The result: a large-scale IAM migration completed without downtime, guesswork, or operational chaos.

Large-scale IAM migrations are rarely just about moving data. They are about risk management, system behavior under load, and maintaining control when things don’t go as planned.

Keymate, as an enterprise-grade IAM platform building a new IAM platform on top of Keycloak, we recently faced exactly this challenge: migrating more than 20+million records from a customer's existing PostgreSQL-backed IAM system into Keymate’s Keycloak-based runtime.

This post is not a checklist or a tuning guide. It's the story of how we designed a migration that could survive real-world conditions, and what we learned along the way.

The Challenge: Scale Without Guesswork

Our customer’s data consisted of over 20+ million identities, spread across multiple IAM domains such as Realms, organizations, users, clients and roles. The source system was already under heavy operational load, so the migration had to be executed within several non-negotiable constraints:

No Uncontrolled Retries

If retries are not limited, even small failures can quickly cause serious problems. Too many retries can overload downstream systems, hide the real issue, and slow down recovery.

No Silent Data Loss

In identity systems, losing even a small percentage of records can lead to security gaps and irreversible trust issues. Missing users, roles, or credentials often surface much later as authentication failures or authorization anomalies that are difficult to trace back to the migration.

The Ability to Resume or Replay Failed Records

Long-running migrations must tolerate interruptions without forcing full restarts or manual recovery. Without resumability, teams are pushed toward risky "all-or-nothing" runs that increase downtime, operational stress, and the likelihood of human error.

Clear Visibility Into Progress and Failure States

Without precise visibility, teams lose the ability to make informed decisions under pressure. Lack of real-time insight turns migration into guesswork, making it impossible to know whether slowing down, stopping, or proceeding is the safest option.

Given these constraints, treating the migration as a one-off data transfer was simply not an option. It had to be approached as a carefully controlled, long-running distributed system, with safety, observability, and recoverability built in from day one.

Designing Migration as a Controlled System

Instead of treating migration as a bulk import, we modeled it as a work-driven process.

At the core of the design was a dedicated work queue in PostgreSQL. Each record to be migrated was transformed into a self-contained work item, carrying:

  • Its source identity
  • The target domain (user, client, etc.)
  • A JSON payload representing the target state
  • Retry metadata and error context

This gave us three critical properties:

  1. Deterministic processing – every record had a clear lifecycle
  2. Back-pressure awareness – throughput could be tuned dynamically
  3. Safe retries – failures were explicit, bounded, and recoverable

Nothing moved forward unless it was acknowledged as successful.

Work queue architecture: records with explicit success/failure states and retry metadata.

Why We Built a Custom Keycloak Extension

One important architectural decision was not sending migration traffic directly to standard Keycloak endpoints.

Instead, we developed a custom Keycloak extension that acted as a controlled ingestion layer between the migrator and Keycloak itself.

This served multiple purposes:

  • Supporting domain-specific constraints that standard APIs could not express
  • Ensuring consistent validation and transformation during migration
  • Isolating migration behavior from runtime authentication flows
  • Giving us precise control over error semantics and retry decisions

From the migrator’s point of view, this extension was a single, stable contract. From Keycloak’s point of view, it allowed migration to be handled as a first-class internal process, not as external API noise.

When Reality Hit: Throughput Was Far Below Expectations

When we initiated the migration, the initial throughput was significantly lower than expected, at roughly 2 million records per hour.

At the same time, system signals were misleading:

  • CPU utilization was low
  • Memory usage was stable
  • Keycloak instances appeared healthy
  • Database connections were available but under pressure

At one point, even Keycloak’s metrics endpoint became intermittently unresponsive under load.

This was a critical moment. We could have increased parallelism and pushed harder, but that would have meant flying blind.

Instead, we slowed down and treated the migration itself as a system to be debugged.

Compound system pressure: database writes, connection pool, and retry storms.

The Turning Point: Measure, Reduce, Control

What we discovered was a classic distributed systems lesson:

More parallelism does not mean more throughput.

The system was under stress not because of a single bottleneck, but because of compound pressure across layers:

  • Database write amplification
  • Connection pool contention
  • Background maintenance tasks falling behind
  • Retry storms amplifying load

Our response was not one big change, but a series of deliberate, measured adjustments:

  • Reducing excessive concurrency
  • Tightening database connection pools
  • Making database behavior explicit and predictable
  • Isolating migration traffic from normal runtime behavior
  • Introducing smarter queuing and retry boundaries

Each change made the system calmer, and paradoxically, faster.

The Result: Stability First, Speed Follows

After these adjustments, migration throughput increased to well over 10M+ consistent records, with:

  • Predictable progress
  • Controlled retry behavior
  • No unbounded failure loops
  • Clear operational visibility

More importantly, the system remained stable and observable throughout the process.

For us, this was the real success metric.

Migration result: stable throughput with observable progress and controlled retries.

What This Means for Keymate

This migration reinforced a core belief behind Keymate:

IAM migration is not a data problem. It's a systems problem.

That's why Keymate is designed not just to authenticate users, but to:

  • Absorb large-scale change safely
  • Expose operational control points
  • Support real-world customer complexity
  • And behave predictably under pressure

If your IAM system can not migrate safely, it probably won’t operate safely at scale either.

Final Thoughts

Large IAM migrations will always be challenging, but they don't have to be risky.

With the right architecture and clear boundaries, even migrations involving tens of millions of identities become predictable and uneventful, and in critical infrastructure projects, uneventful is exactly what you want.

Want the technical deep-dive?

In an upcoming post, we'll share the low-level engineering lessons behind this migration, including what actually slowed Keycloak down, why fewer connections made things faster, and how database behavior shaped the outcome.


This migration was only possible because of the IAM foundation we describe in our Baseline-First series:

Talk to the Keymate Team

Planning a large-scale IAM migration? Learn how Keymate helps teams migrate safely without downtime.

Stay updated with our latest insights and product updates

Frequently Asked Questions