Rethinking Data Placement in Distributed Databases: The Case for Automatic Row Migration

Two Roles, One Key

In every horizontally scalable database—Spanner, CockroachDB, YugabyteDB, TiDB, and their descendants—a primary key quietly serves two masters. It is a logical identifier, the application's stable handle for a specific entity. And it is a physical address, the value that determines which node in the cluster stores the row. These two roles are in fundamental tension.

The tension is visible within a single table. A Documents table keyed by DocumentId—a UUID chosen for global uniqueness—scatters documents uniformly across ranges, because range-partitioned databases assign rows to nodes based on key order. But if most transactions access documents belonging to the same tenant, the physical layout is exactly wrong: rows that are always read together are distributed as far apart as the key space allows. The logical choice (UUID for uniqueness) directly conflicts with the physical need (shared prefix for locality). A developer who keys by (TenantId, DocumentId) instead gets intra-table locality but gives up the ability to look up a document by its globally unique ID without a secondary index.

The problem compounds across tables. Consider a multi-tenant SaaS application with tables for Tenants, Users, Documents, and AccessLogs. A transaction that creates a new document checks the tenant's quota, records the author, inserts the document, and appends an access log entry. These four rows live in four tables with four independent key spaces, which means four potentially independent range assignments on four different nodes. A single logical operation becomes a distributed transaction coordinating across multiple nodes, paying network round trips proportional to the number of distinct nodes involved and a multi-node commit protocol at the end. At scale, this overhead dominates: higher tail latencies, lower throughput, more aborts under contention.

Distributed databases have long offered partial remedies for both problems. Spanner's interleaved tables co-locate child rows with their parent by sharing a key prefix. CockroachDB provides locality configurations. YugabyteDB supports colocated tables. These work well when access patterns are strictly hierarchical: every transaction walks from parent to child and back. But real applications have many-to-many relationships, peer-to-peer access patterns, and workloads that evolve over time. Interleaving constrains schema design, penalizes independent scans of child tables, and cannot express the cross-table affinities that dominate complex applications. Zone-level and table-level colocation are too coarse to arrange individual rows for optimal locality.

The result is that developers spend enormous effort on key design, denormalization, and application-level sharding, all manual labor aimed at solving a placement problem that the database itself could, in principle, solve automatically. The database has access to the actual workload, the real data distribution, and the live cluster topology. It observes which rows are accessed together in each transaction. It knows, in real time, things that the developer could only guess at schema design time.

What if we let the database use that knowledge?

The Core Primitive: An Unstable Identifier

The idea is deceptively simple. Introduce a new column type—call it UNSTABLE_ROW_ID—whose defining property is that the database may change its value at any time. The database assigns the initial value when a row is inserted. The application can read it within a transaction but must never persist it, cache it, or treat it as durable. It is ephemeral by contract.

This gives the database a degree of freedom it has never had: the ability to move rows by changing their primary keys. If it observes that rows from tables Documents and Tenants are frequently accessed in the same transaction, it can change the UNSTABLE_ROW_ID of one or both rows so that they fall in the same range on the same node. If a range becomes a hotspot, it can scatter rows by assigning new IDs. If a foreign key column is declared as UNSTABLE_ROW_ID and references another table's unstable primary key, the database updates the foreign key atomically whenever the referenced row moves. No application code changes. No foreign key constraint is ever violated.

This is an opportunistic, best-effort version of the colocation that interleaving provides, without the rigid hierarchy.

But as a user-facing feature, explicit UNSTABLE_ROW_ID columns have a serious ergonomic problem. They require the developer to reason about a new kind of identifier with unfamiliar semantics, to remember which IDs are stable and which are not, and to never accidentally leak an unstable ID into an API response, a log, or a session variable. For advanced users building performance-critical systems, this level of control may be valuable—for instance, reading the unstable ID within a transaction to implement custom routing logic, or controlling which columns participate in placement decisions. It should exist as an escape hatch for such cases. But for the common case, there is a much better interface.

Hidden Unstable Identifiers: The Real Proposal

Exposing unstable identifiers creates a class of bugs that are intermittent, hard to reproduce, and violate reasonable developer expectations. If the database is managing row placement autonomously, the mechanism should be invisible. The developer should not interact with unstable identifiers and, ideally, should not know they exist.

Consider a table declaration like this:

CREATE TABLE Documents (
  DocumentId STRING(36),
  TenantId   STRING(36),
  AuthorId   STRING(36),
  Content    BYTES(MAX),
  CreatedAt  TIMESTAMP,
) PRIMARY KEY (DocumentId),
  AUTO_PLACEMENT;

The AUTO_PLACEMENT keyword tells the database: "I want the semantics of a normal table with a normal primary key, but I am willing to let you manage the physical placement of rows independently of my key choice." Behind the scenes, the database transforms this into:

Internal schema:
  _unstable_id  UNSTABLE_ROW_ID  (hidden, actual primary key)
  DocumentId    STRING(36)
  TenantId      STRING(36)
  AuthorId      STRING(36)
  Content       BYTES(MAX)
  CreatedAt     TIMESTAMP

  PRIMARY KEY (_unstable_id)
  UNIQUE INDEX _idx_documentid (DocumentId)  -- hidden, enforces declared PK

The application's declared primary key (DocumentId) becomes a unique secondary index. The actual physical primary key is a hidden unstable identifier. Queries filtering on DocumentId transparently use the hidden unique index. Foreign keys from other tables that reference Documents(DocumentId) continue to work at the SQL level; internally, the database adds a hidden UNSTABLE_ROW_ID column to the referencing table for each such constraint, maintaining a direct physical pointer that is updated atomically whenever the referenced row moves.

From the application's perspective, nothing has changed. DocumentId is still the primary key. It is stable, meaningful, and safe for API responses, URLs, and logs. The application writes the same queries, uses the same joins, and observes the same transactional guarantees. Schema changes—adding columns, altering types, creating new indexes—work exactly as they would on a normal table, because the hidden unstable ID is an implementation detail of the storage layer, not a user-visible schema element. Change streams and change data capture emit events keyed by the declared primary key columns, not the hidden unstable ID; migrations that only change the internal identifier produce no externally visible change events, nor do updates to hidden foreign key columns in referencing tables. Backup and restore preserve the hidden IDs to maintain internal consistency; after a restore, the migration logic simply reconverges to the current workload.

A secondary benefit is that AUTO_PLACEMENT decouples the question of whether a table has a primary key from how its rows are physically organized. An AUTO_PLACEMENT table without an explicit primary key is perfectly coherent—the database assigns hidden unstable IDs as internal row identifiers and manages placement autonomously.

The Unique Index Cost

There is an obvious objection. Converting the declared primary key into a secondary index means that every point lookup by that key now requires two reads: one to the index to retrieve the hidden unstable ID, and one to the base table to retrieve the row. In systems like Spanner, where secondary indexes are stored separately, these two reads may hit different splits on different nodes. For a table accessed primarily by single-row primary key lookups, this is a real concern.

But the concern is substantially mitigated by initial placement. The same heuristics that drive background migration also inform the choice of unstable ID at insert time. For a brand-new table with no traffic data, structural signals bootstrap the system: the presence of a declared primary key is itself a strong signal that lookups by that key will be common, and the database uses this to assign unstable IDs that place rows near their corresponding unique index entries from the start. This structural reasoning extends beyond the primary key—foreign key declarations, secondary index definitions, and column types all provide schema-level hints about likely access patterns before any traffic arrives. A table with a foreign key to Tenants will very likely be accessed in transactions that also touch Tenants. A covering index on (CustomerId, OrderDate) signals range scans by customer. The system doesn't need traffic data to make reasonable initial decisions; it needs traffic data to make optimal ones.

Beyond initial placement, the hidden unique index entry and the base table row are always accessed together—every lookup through the index reads both. The migration logic observes this perfect co-access pattern and, over time, corrects any rows that have drifted out of alignment. The two reads collapse into two reads on the same node, which in many storage engines means two reads in the same process with no network hop.

More generally, for tables that are primarily accessed through a different secondary index—looking up orders by customer ID rather than order ID—the base rows gravitate toward that index instead. The migration logic tracks which access paths carry the most traffic and places rows accordingly.

For workloads dominated by single-table point lookups on the declared primary key with no multi-table transactions, AUTO_PLACEMENT is the wrong tool—the unique index overhead is paid without any cross-table benefit to offset it. The feature targets tables involved in multi-table transactions, where the coordination cost across nodes exceeds the cost of an additional local index lookup.

How It Works: Observation, Decision, Migration

The mechanism operates continuously in three phases.

Observation

The database samples transaction execution to record which rows from which tables are accessed together. Even a modest sample rate provides a statistically useful signal about which ranges and nodes participate in the same transaction. The key metric is the average number of distinct nodes per transaction, or equivalently, the average cross-node coordination cost.

The sampling itself can be biased to maximize the information value of each sample. Transactions that involve more nodes than average are more likely to represent optimization opportunities and can be sampled at higher rates. Queries targeting splits or tables with known poor locality can similarly be oversampled. This concentrates the observation budget on the transactions that matter most for placement decisions. It also makes the system self-focusing: as migrations improve placement and the number of multi-node transactions decreases, the biased sampling naturally shifts toward the remaining problem areas without any explicit mechanism to deprioritize already-optimized regions.

Additional signals beyond sampled traces include declared foreign key relationships (strong static hints about co-access), query execution traces from the database's tracing infrastructure, and the current split-to-node mapping. The sampling and signal collection operate at multiple granularities. Coarse-grained patterns emerge quickly: within minutes, the system might observe that 60% of transactions touching Documents also touch Tenants, which is enough to inform split-to-node assignment—prefer placing Documents ranges on nodes that also host Tenants ranges. Fine-grained per-row signals accumulate over longer horizons: over days, the system learns that doc-42 specifically is almost always accessed with tenant-7, enabling a targeted migration of that individual row.

Decision

Given observed access patterns, the database identifies candidate migrations: specific rows whose unstable IDs, if changed, would reduce cross-node coordination. The critical design question is how to avoid oscillation, wasted work, and unpredictable performance changes from naive heuristic-driven optimization. The answer is empirical validation using the same sampled traces that drive observation.

For a candidate migration, the system performs a routing simulation: given a recorded trace (table Documents, key doc-42, read; table Tenants, key tenant-7, read; ...) and the proposed new range assignment for doc-42, compute how many nodes would have been involved. If the simulated traces consistently show fewer nodes per transaction above a configurable significance threshold, the migration proceeds.

Because candidates may interact—migrating row A toward row B is beneficial only if row B hasn't itself been migrated elsewhere—the system evaluates and applies candidates incrementally, re-evaluating remaining candidates against the post-migration state or discarding and regenerating them as fresh trace data arrives, rather than evaluating a batch against a stale snapshot.

After performing a migration, the system continues simulating in reverse: for newly sampled traces, compute what would have happened if the migration had not occurred. If the post-migration traces show degradation—because the workload shifted, the candidate was part of a multimodal pattern, or the benefit was illusory—the migration is rolled back. This creates a feedback loop that converges toward genuinely beneficial placements and self-corrects when conditions change.

The robustness of this loop depends on calibration choices that are important for production deployability. The sample size question is a natural fit for sequential hypothesis testing, a statistical framework designed for "observe until confident" decisions that controls error rates without requiring a fixed sample size. The observation window question—how long to wait before rolling back, given that diurnal or weekly patterns may make a migration look temporarily harmful—has a natural solution in maintaining multiple windows at different time scales (minutes, hours, days) and requiring positive signal across all of them before acting or rolling back.

Migration

The actual row movement is a transaction: read the row at its old key, write it at the new key, delete the old entry, and update any foreign key references in other tables. This transaction is atomic—no concurrent transaction can ever observe a broken foreign key or a dangling reference. If the migration transaction conflicts with a concurrent application transaction, it yields—losing contention resolution by policy—and either retries later or abandons the candidate.

To minimize impact on foreground workload, migration transactions run at low priority: they yield CPU and I/O bandwidth to application work, use relaxed commit timing optimized for throughput rather than latency, and lose deadlock resolution against application transactions. Migrations are biased toward periods of low load, spreading the cost across off-peak windows.

Initial placement of new rows benefits from the same intelligence without waiting for background migration. When inserting a row into an AUTO_PLACEMENT table, the database uses all available signals—foreign key targets, the table's dominant access patterns, current split-to-node assignments, and structural schema information—to assign an unstable ID that provides good placement from the first transaction. If the table has a declared primary key, the unstable ID is chosen so the row lands near the corresponding unique index entry. If the table has foreign keys to other tables, the ID biases toward the referenced rows. If there are no anchors at all, the system picks a split that minimizes the expected number of nodes for the transactions most likely to touch this row.

Secondary Indexes as Anchors

Secondary indexes on stable columns—including the hidden unique index that enforces the declared primary key of an AUTO_PLACEMENT table—have entries whose positions are determined by application-visible values. The database cannot reorder them. They are, in effect, anchors in the optimization landscape: fixed points around which movable rows arrange themselves.

A Documents table with a unique index on DocumentId and a secondary index on TenantId has two sets of anchors. Each index lives in its own sorted key space, split into ranges assigned to nodes. The rows in the base table, keyed by hidden unstable IDs, are free to move. The migration logic positions each row to minimize the expected cross-node cost for the transactions that access it, weighted by frequency.

When a table has multiple secondary indexes pulling a row in different directions—an Orders table indexed by both CustomerId and ProductId—the row cannot be close to both index entries if they reside on different nodes. The system places the row to minimize total weighted cost. Suppose order ord-99 has its CustomerId index entry on node 3 and its ProductId index entry on node 7. If 80% of transactions accessing ord-99 arrive through the CustomerId index and 20% through ProductId, the migration logic places the base row on node 3. The 20% of transactions arriving through ProductId pay a hop from node 7 to node 3, but the 80% arriving through CustomerId read locally. The weighted average cost is minimized.

This means that the less frequently used index will consistently incur cross-node lookups. If an operator runs a nightly batch analytics job scanning through the ProductId index, that job will be slower than it would be on a traditionally clustered table, and the migration logic will correctly decline to optimize for it because the nightly batch is vastly outnumbered by real-time CustomerId lookups. This is the right behavior, but it must be visible: the observability tooling should surface per-index cross-node rates and explain why a particular access path is slower—"this index's cross-node rate is 73% because placement is optimized for index CustomerId." Without this, the operator sees a slow batch job and has no way to understand the cause.

The split-to-node assignment algorithm can amplify these benefits further. By preferring assignments where a secondary index range and the base table ranges it most frequently references land on the same node, the system reduces cross-node reads without moving any rows. For AUTO_PLACEMENT tables, the system can also prefer finer-grained splits for secondary indexes, increasing assignment granularity and making it more likely that index entries and their corresponding base rows share a node.

Stable Key Placement Assistance

AUTO_PLACEMENT tables still need stable values for their declared primary keys. When these are randomly generated—UUIDs, random integers—the values scatter across the key space of the hidden unique index. Since index entries are anchors that cannot be moved, a badly placed anchor forces the migration logic to choose between co-locating the base row with its own index entry and co-locating it with related rows from other tables. A well-placed anchor aligns these goals, reducing the tension the optimizer must resolve.

The database can help with a general-purpose function—AUTO_ID(table.column)—that generates an identifier whose type matches the specified column. When placement-relevant context is available—foreign key relationships, secondary index co-access patterns, current split-to-node assignments—the function biases the generated value to be lexicographically near keys that the database believes will be co-accessed. When inserting a Documents row for tenant tenant-7, AUTO_ID(Documents.DocumentId) returns a UUID whose prefix biases the new document's unique index entry toward the region where tenant-7's data is concentrated. The anchor starts life in a good position, reducing the work the migration logic must do later. Because the function takes the target column as an argument, it can be polymorphic—returning a UUID, an INT64, or a STRING as appropriate—and it has the context needed to make a meaningful placement decision.

When no context is available—a table with no traffic data, no foreign keys, and no secondary indexes—the function falls back to a uniformly random value, which is the same as what the developer would have generated without it. AUTO_ID never makes placement worse; at best it provides meaningful locality bias, and at worst it is neutral. Even when locality bias is applied, dedicating a portion of the key space to placement leaves more than enough entropy for uniqueness in any practical deployment.

The function is useful beyond AUTO_PLACEMENT tables. For traditionally clustered tables where the developer has chosen a random key for distribution purposes, AUTO_ID provides a middle ground: the key is stable and distributed enough to avoid hotspots, but biased toward a region that improves locality. This is weaker than full AUTO_PLACEMENT—the key never changes—but it costs nothing at runtime and requires no schema changes.

Handling High Fan-Out Foreign Keys

A practical concern arises when a row in an AUTO_PLACEMENT table is referenced by a large number of foreign keys in other tables. Moving such a row requires atomically updating every referencing foreign key, which may involve thousands or millions of rows and exceed normal transaction size limits.

The preferred approach depends on the scale of the fan-out. For moderate fan-out (thousands of references), the simplest strategy is to relax the mutation limit for system-initiated migrations, since these are internal operations with well-understood semantics. The migration completes atomically in a single transaction. For larger fan-out, the migration can proceed in phases: the old key becomes a forwarding pointer that redirects lookups to the new key while referencing rows are updated in background batches. Forwarding pointers add read-time cost—every access through the old key pays a redirect—but for moderate-duration migrations this cost is transient; the system should prefer this approach only when the expected locality benefit clearly exceeds the temporary redirect overhead. For very large fan-out, where even a phased migration of the referenced row would be prohibitively expensive, the optimization runs in reverse: the referencing rows migrate toward the heavily-referenced row instead. This last approach is almost always preferred in extreme cases because it distributes the migration cost across many small, cheap operations rather than concentrating it in one enormous one.

A Natural Extension: Localized Read Replicas

The observation infrastructure built for migration—sampling transactions, tracking co-access patterns, measuring cross-node costs—can identify a pattern that migration alone cannot optimally serve: reference data that is read frequently from many nodes but updated rarely. Dimension tables, configuration tables, feature flag catalogs—these are accessed from transactions originating across the cluster, and no single placement is optimal because every node needs them.

For such data, the right strategy is not to move rows but to replicate them. The database could maintain, on each node that would benefit, a materialized subset of the hottest rows that node specifically needs (and possibly only the columns most frequently read), functioning as a locally served read cache. A transaction that hits the local cache avoids a network hop for that read. Not every node receives a copy of every cached row—each node caches only the rows its transactions actually access, keeping the replication footprint proportional to actual demand rather than cluster size.

The consistency model for such a cache requires care. Before a write to a cached row can proceed, the writing transaction first invalidates the row's cache entry on every node that holds a copy, forcing all subsequent transactions on those nodes to read from the authoritative location. Transactions that previously read from the cache check at commit time whether any of their cached reads have been invalidated; if so, the transaction aborts and retries with a remote read. This ensures that no committed transaction ever depends on stale cached data, preserving the database's consistency guarantees.

The cost is that writes to cached rows become more expensive: the writing transaction's latency increases proportionally to the number of nodes holding copies, because it must wait for invalidation acknowledgments before proceeding. Read latency for other transactions is unaffected—they either hit a valid cache or are transparently redirected to the authoritative copy. But the rows selected for caching are, by definition, those with extremely skewed read-to-write ratios—configuration data read thousands of times per second and updated once a day, dimension tables that change during nightly ETL jobs. For such data, the rare write paying extra latency is a negligible cost against the aggregate savings of eliminating thousands of network hops per second on reads.

The decision framework is the same as for migration—simulate, validate, promote or demote—applied to a different point on the read/write spectrum.

Observability Is a Prerequisite

Automatic optimization of physical data layout is only deployable in production if operators can understand what the system is doing and why. This is not a mitigation for a downside—it is a hard prerequisite for the feature to be trusted.

When a query becomes slower, the first question an operator asks is "what changed?" If the answer involves background row migrations, the operator must be able to discover this quickly. The system needs detailed, queryable logs of every migration: which row, from which key to which key, when, and what evidence justified the decision. It needs attribution tooling that connects a performance change to a specific migration or set of migrations. It needs the ability to preview proposed migrations, to pause migration for specific tables, and to pin specific rows or ranges against movement. Per-index cross-node rates, as discussed in the secondary index section, are one example of the kind of derived metric that makes the system's placement decisions legible.

Aggregate metrics should expose the current value of the optimization target (average nodes per transaction), the volume and outcome of migrations (beneficial, neutral, rolled back), and the estimated benefit of pending candidates. If the beneficial-to-wasted ratio drops or the optimization metric plateaus, the system should adaptively reduce migration aggressiveness, and operators should be alerted.

Inherent Costs

Two costs are inherent to the approach and must be paid regardless of workload.

Write amplification. Every migration is a write: delete old key, insert new key, update foreign keys. If access patterns change rapidly, the database might perform migrations that are obsolete before they complete. The throttling mechanism and empirical validation filter out marginally beneficial candidates, but some wasted work is unavoidable. The system should track the ratio of beneficial to wasted migrations and reduce aggressiveness when the ratio is poor.

Implementation complexity. Hidden columns, automatic index creation, atomic multi-key changes, background scheduling, routing simulation, and the feedback loop all add substantial complexity to the storage engine. This is the strongest argument for introducing the feature incrementally: AUTO_PLACEMENT tables with hidden unstable IDs are a clean interface boundary behind which the machinery can be built and hardened independently.

When Not to Use This

Two categories of workload are better served by existing mechanisms.

Stable hierarchical access patterns. For workloads with a clear, stable parent-child access pattern, interleaved tables remain the better choice. They provide guaranteed colocation with no background processing, no write amplification, and no convergence delay. AUTO_PLACEMENT is for the cases where interleaving's constraints are too rigid—many-to-many relationships, multiple access patterns, evolving workloads—not a replacement for it.

Single-table point lookup workloads. For tables accessed almost exclusively by single-row lookups on the declared primary key, with few or no multi-table transactions, the unique index overhead is a net loss. These tables should remain traditionally clustered.

For tables in between—primarily accessed by key lookups but occasionally part of multi-table transactions—the decision is less clear-cut and may depend on the specific ratio of single-table to multi-table operations. Observability tooling can help here: per-table metrics on cross-node transaction participation can inform whether enabling AUTO_PLACEMENT would be beneficial, and the system could surface recommendations based on observed traffic patterns.

Why This Should Be Everywhere

The argument for hidden unstable identifiers is not specific to any single database. It applies to every system that range-partitions data by primary key and distributes ranges across nodes—virtually every horizontally scalable database built in the last fifteen years.

Today, developers accept as a fact of life that choosing a primary key is a physical layout decision with lasting performance consequences, that normalized schemas and performant schemas are often in conflict, and that changing access patterns may require schema redesigns. AUTO_PLACEMENT tables challenge each of these assumptions. They propose that the database—which continuously observes the actual workload—is better positioned to make placement decisions than a developer writing a schema months before the workload materializes.

This is not unprecedented. The same argument justified cost-based query optimizers over hint-driven execution, buffer pool management over application-controlled caching, and automatic statistics collection over manual ANALYZE. In each case, the database took on a responsibility that had previously required manual effort and domain-specific guesswork, and the result was better performance for the common case with manual overrides for the exceptional one.

But row placement differs from these precedents in an important respect: it involves durable physical changes, not per-query decisions. A bad query plan is forgotten when the query finishes; a bad row migration persists until it is reversed. This is precisely why the empirical validation and rollback mechanism is not optional but central to the design. It transforms what would otherwise be a fragile heuristic into a self-correcting feedback system that converges on beneficial placements and retreats from harmful ones. The system earns trust not by being right on every decision, but by detecting and undoing its mistakes faster than they accumulate.

The primitive is a hidden, database-managed row identifier that decouples logical identity from physical location. The mechanism is continuous, empirically validated, incremental migration with low-priority scheduling and automatic rollback. The interface is a single keyword on a CREATE TABLE statement. Databases that adopt this approach will offer their users something that careful key engineering alone struggles to provide when workloads evolve, shift, and surprise: continuous, automatic adaptation of physical data layout to the actual workload as it exists today, not as it was predicted months or years ago. Those that do not will continue to ask their users to solve, at schema design time, an optimization problem that is dynamic, multidimensional, and fundamentally unknowable in advance.

CAFxX/Rethinking Data Placement in Distributed Databases: The Case for Automatic Row Migration.md

Select an option

No results found