Why are large IAM migrations risky?

Because they combine high data volume, strict consistency requirements, and live production systems. Uncontrolled retries, missing observability, or silent failures can quickly turn migrations into outages.

Why was Keycloak chosen as the target system?

Keycloak provides a robust, extensible IAM foundation with strong community support. Keymate builds on top of Keycloak to address enterprise-scale migration, governance, and authorization challenges.

Was this a one-time data copy?

No. The migration was designed as a long-running, resumable, and observable process that could tolerate failures, restarts, and partial replays.

How did you avoid silent data loss?

Every record was tracked with explicit success and failure states, durable retry queues, and verifiable completion guarantees.

How did you control load on Keycloak?

By using bounded concurrency and backpressure-aware workers, ensuring the target system was fully utilized without being overwhelmed.

Can this approach be reused for other migrations?

Yes. The principles described—bounded concurrency, resumability, and full observability—apply to any large-scale data or identity migration.

How Keymate Migrated 20+ Million Identities to Keycloak

Keycloak was just the beginning. Meet Keymate.

Estimated read: 7-8 minutes

TL;DR

Migrating 20+ million identities is not a data copy task — it is a long-running, failure-aware distributed process.
Uncontrolled retries and unbounded concurrency can silently turn transient errors into systemic outages.
Keymate designed the migration to be observable, resumable, and safe by design, with no silent data loss.
Load on Keycloak was carefully controlled using bounded concurrency and backpressure, keeping the system fully utilized without overload.
Every record had an explicit success or failure state, allowing safe restarts, replays, and verification.
The result: a large-scale IAM migration completed without downtime, guesswork, or operational chaos.

The solution is implemented as an open-source migrator application, published at: https://github.com/keymate-io/keymate-migrator

Large-scale IAM migrations are rarely just about moving data. They are about risk management, system behavior under load, and maintaining control when things don’t go as planned.

Keymate, as an enterprise-grade IAM platform building a new IAM platform on top of Keycloak, we recently faced exactly this challenge: migrating more than 20+million records from a customer's existing PostgreSQL-backed IAM system into Keymate’s Keycloak-based runtime.

This post is not a checklist or a tuning guide. It's the story of how we designed a migration that could survive real-world conditions, and what we learned along the way.

The Challenge: Scale Without Guesswork

Our customer’s data consisted of over 20+ million identities, spread across multiple IAM domains such as Realms, organizations, users, clients and roles. The source system was already under heavy operational load, so the migration had to be executed within several non-negotiable constraints:

No Uncontrolled Retries

If retries are not limited, even small failures can quickly cause serious problems. Too many retries can overload downstream systems, hide the real issue, and slow down recovery.

No Silent Data Loss

In identity systems, losing even a small percentage of records can lead to security gaps and irreversible trust issues. Missing users, roles, or credentials often surface much later as authentication failures or authorization anomalies that are difficult to trace back to the migration.

The Ability to Resume or Replay Failed Records

Long-running migrations must tolerate interruptions without forcing full restarts or manual recovery. Without resumability, teams are pushed toward risky "all-or-nothing" runs that increase downtime, operational stress, and the likelihood of human error.

Clear Visibility Into Progress and Failure States

Without precise visibility, teams lose the ability to make informed decisions under pressure. Lack of real-time insight turns migration into guesswork, making it impossible to know whether slowing down, stopping, or proceeding is the safest option.

Given these constraints, treating the migration as a one-off data transfer was simply not an option. It had to be approached as a carefully controlled, long-running distributed system, with safety, observability, and recoverability built in from day one.

Designing Migration as a Controlled System

Instead of treating migration as a bulk import, we modeled it as a work-driven process.

At the core of the design was a dedicated work queue in PostgreSQL. Each record to be migrated was transformed into a self-contained work item, carrying:

Its source identity
The target domain (user, client, etc.)
A JSON payload representing the target state
Retry metadata and error context

This gave us three critical properties:

Deterministic processing – every record had a clear lifecycle
Back-pressure awareness – throughput could be tuned dynamically
Safe retries – failures were explicit, bounded, and recoverable

Nothing moved forward unless it was acknowledged as successful.

Work queue architecture: records with explicit success/failure states and retry metadata.

Why We Built a Custom Keycloak Extension

One important architectural decision was not sending migration traffic directly to standard Keycloak endpoints.

Instead, we developed a custom Keycloak extension that acted as a controlled ingestion layer between the migrator and Keycloak itself.

This served multiple purposes:

Supporting domain-specific constraints that standard APIs could not express
Ensuring consistent validation and transformation during migration
Isolating migration behavior from runtime authentication flows
Giving us precise control over error semantics and retry decisions

From the migrator’s point of view, this extension was a single, stable contract. From Keycloak’s point of view, it allowed migration to be handled as a first-class internal process, not as external API noise.

When Reality Hit: Throughput Was Far Below Expectations

When we initiated the migration, the initial throughput was significantly lower than expected, at roughly 2 million records per hour.

At the same time, system signals were misleading:

CPU utilization was low
Memory usage was stable
Keycloak instances appeared healthy
Database connections were available but under pressure

At one point, even Keycloak’s metrics endpoint became intermittently unresponsive under load.

This was a critical moment. We could have increased parallelism and pushed harder, but that would have meant flying blind.

Instead, we slowed down and treated the migration itself as a system to be debugged.

Compound system pressure: database writes, connection pool, and retry storms.

The Turning Point: Measure, Reduce, Control

What we discovered was a classic distributed systems lesson:

More parallelism does not mean more throughput.

The system was under stress not because of a single bottleneck, but because of compound pressure across layers:

Database write amplification
Connection pool contention
Background maintenance tasks falling behind
Retry storms amplifying load

Our response was not one big change, but a series of deliberate, measured adjustments:

Reducing excessive concurrency
Tightening database connection pools
Making database behavior explicit and predictable
Isolating migration traffic from normal runtime behavior
Introducing smarter queuing and retry boundaries

Each change made the system calmer, and paradoxically, faster.

The Result: Stability First, Speed Follows

After these adjustments, migration throughput increased to well over 10M+ consistent records, with:

Predictable progress
Controlled retry behavior
No unbounded failure loops
Clear operational visibility

More importantly, the system remained stable and observable throughout the process.

For us, this was the real success metric.

Migration result: stable throughput with observable progress and controlled retries.

What This Means for Keymate

This migration reinforced a core belief behind Keymate:

IAM migration is not a data problem. It's a systems problem.

That's why Keymate is designed not just to authenticate users, but to:

Absorb large-scale change safely
Expose operational control points
Support real-world customer complexity
And behave predictably under pressure

If your IAM system can not migrate safely, it probably won’t operate safely at scale either.

Final Thoughts

Large IAM migrations will always be challenging, but they don't have to be risky.

With the right architecture and clear boundaries, even migrations involving tens of millions of identities become predictable and uneventful, and in critical infrastructure projects, uneventful is exactly what you want.

The solution is implemented as an open-source migrator application, published at: https://github.com/keymate-io/keymate-migrator

Want the technical deep-dive?

📖 Read Part 2: Keymate's Guide to Reactive Data Migration — The low-level engineering lessons behind this migration, including database-backed work queues, bounded concurrency and reactive execution with Quarkus and Mutiny.

This migration was only possible because of the IAM foundation we describe in our Baseline-First series:

Why Keymate? — Keep identity. Add context, visibility, and governance.
The Limits of RBAC — Why roles alone aren't enough at scale.
Multi-Tenancy & Delegation — Scoped authority without role explosion.

How Keymate Migrated 20+ Million Identities to Keycloak