Keycloak was just the beginning. Meet Keymate.
Estimated read: 7-8 minutes
Large-scale IAM migrations are rarely just about moving data. They are about risk management, system behavior under load, and maintaining control when things don’t go as planned.
Keymate, as an enterprise-grade IAM platform building a new IAM platform on top of Keycloak, we recently faced exactly this challenge: migrating more than 20+million records from a customer's existing PostgreSQL-backed IAM system into Keymate’s Keycloak-based runtime.
This post is not a checklist or a tuning guide. It's the story of how we designed a migration that could survive real-world conditions, and what we learned along the way.
Our customer’s data consisted of over 20+ million identities, spread across multiple IAM domains such as Realms, organizations, users, clients and roles. The source system was already under heavy operational load, so the migration had to be executed within several non-negotiable constraints:
If retries are not limited, even small failures can quickly cause serious problems. Too many retries can overload downstream systems, hide the real issue, and slow down recovery.
In identity systems, losing even a small percentage of records can lead to security gaps and irreversible trust issues. Missing users, roles, or credentials often surface much later as authentication failures or authorization anomalies that are difficult to trace back to the migration.
Long-running migrations must tolerate interruptions without forcing full restarts or manual recovery. Without resumability, teams are pushed toward risky "all-or-nothing" runs that increase downtime, operational stress, and the likelihood of human error.
Without precise visibility, teams lose the ability to make informed decisions under pressure. Lack of real-time insight turns migration into guesswork, making it impossible to know whether slowing down, stopping, or proceeding is the safest option.
Given these constraints, treating the migration as a one-off data transfer was simply not an option. It had to be approached as a carefully controlled, long-running distributed system, with safety, observability, and recoverability built in from day one.
Instead of treating migration as a bulk import, we modeled it as a work-driven process.
At the core of the design was a dedicated work queue in PostgreSQL. Each record to be migrated was transformed into a self-contained work item, carrying:
This gave us three critical properties:
Nothing moved forward unless it was acknowledged as successful.
One important architectural decision was not sending migration traffic directly to standard Keycloak endpoints.
Instead, we developed a custom Keycloak extension that acted as a controlled ingestion layer between the migrator and Keycloak itself.
This served multiple purposes:
From the migrator’s point of view, this extension was a single, stable contract. From Keycloak’s point of view, it allowed migration to be handled as a first-class internal process, not as external API noise.
When we initiated the migration, the initial throughput was significantly lower than expected, at roughly 2 million records per hour.
At the same time, system signals were misleading:
At one point, even Keycloak’s metrics endpoint became intermittently unresponsive under load.
This was a critical moment. We could have increased parallelism and pushed harder, but that would have meant flying blind.
Instead, we slowed down and treated the migration itself as a system to be debugged.
What we discovered was a classic distributed systems lesson:
More parallelism does not mean more throughput.
The system was under stress not because of a single bottleneck, but because of compound pressure across layers:
Our response was not one big change, but a series of deliberate, measured adjustments:
Each change made the system calmer, and paradoxically, faster.
After these adjustments, migration throughput increased to well over 10M+ consistent records, with:
More importantly, the system remained stable and observable throughout the process.
For us, this was the real success metric.
This migration reinforced a core belief behind Keymate:
IAM migration is not a data problem. It's a systems problem.
That's why Keymate is designed not just to authenticate users, but to:
If your IAM system can not migrate safely, it probably won’t operate safely at scale either.
Large IAM migrations will always be challenging, but they don't have to be risky.
With the right architecture and clear boundaries, even migrations involving tens of millions of identities become predictable and uneventful, and in critical infrastructure projects, uneventful is exactly what you want.
In an upcoming post, we'll share the low-level engineering lessons behind this migration, including what actually slowed Keycloak down, why fewer connections made things faster, and how database behavior shaped the outcome.
This migration was only possible because of the IAM foundation we describe in our Baseline-First series:
Planning a large-scale IAM migration? Learn how Keymate helps teams migrate safely without downtime.
Stay updated with our latest insights and product updates