Module tiered

Module tiered 

Source
Expand description

Two-tier storage backend with size-based routing and redirect tombstones.

TieredStorage routes objects to a high-volume or long-term backend based on size and maintains redirect tombstones so that reads never need to probe both backends. See the crate-level documentation for the high-level motivation, and the TieredStorage struct docs for routing and tombstone semantics.

§Cross-Tier Consistency

A single logical object may span both backends: a tombstone in HV pointing to a payload in LT. Mutations keep the two in sync through compare-and-swap on the high-volume backend (see HighVolumeBackend::compare_and_write). Each operation reads the current HV revision, performs its work, then atomically swaps the HV entry only if the revision is still current — rolling back on conflict.

§Revision Keys

Every large-object write stores its payload at a revision key in the long-term backend: {original_key}/{uuid}. The UUID suffix is random (no monotonicity is guaranteed), so each write targets a distinct LT path regardless of whether another write to the same logical key is in progress. The tombstone in HV then points to this specific revision. Because each writer owns its own LT blob, the compare-and-swap on the tombstone becomes an atomic pointer swap: the winner’s revision is committed and the loser can safely delete its own blob without affecting the winner.

See new_long_term_revision for the key construction.

§Compare-and-Swap

All mutating operations follow a common pattern of reading the current revision, performing the upload, atomically swapping the revision (commit point), and cleaning up the now-unreferenced LT blob in the background:

§Large-Object Write (> 1 MiB)

  1. Read HV to capture the current revision (existing tombstone target, or absent).
  2. Write payload to LT at a unique revision key.
  3. Compare-and-swap in HV: write a tombstone pointing to the new revision, only if the current revision still matches step 1.
    • OK — schedule background deletion of the old LT blob, if any.
    • Conflict — another writer won the race; schedule background deletion of our new LT blob.
    • Error — reload the tombstone and delete the unreferenced blob or blobs.

§Small-Object Write (≤ 1 MiB)

  1. Write inline to HV, skipping the write if a tombstone is present.
    • OK — done; the object is stored entirely in HV.
    • Tombstone present — a large object already occupies this key; continue:
  2. Compare-and-swap in HV: replace the tombstone with inline data, only if the tombstone’s revision still matches.
    • OK — schedule background deletion of the old LT blob.
    • Conflict — another writer won the race; they will clean up the LT blob and we have no new LT blob to clean up.
    • Error — reload the tombstone and delete the unreferenced blob if the write went through.

§Delete

  1. Delete from HV if the entry is not a tombstone.
    • OK — done; there is no LT data to clean up.
    • Tombstone present — a large object is stored here; continue:
  2. Compare-and-swap in HV: remove the tombstone, only if its revision still matches.
    • OK — schedule background deletion of the LT blob.
    • Conflict — another writer won the race; they will clean up.
    • Error — reload the tombstone and delete the unreferenced blob if the write went through.

Tombstone removal is the commit point for deletes. If the subsequent LT cleanup fails, an orphan blob remains but the object is already unreachable through the normal read path.

§Last-Writer-Wins

Concurrent mutations on the same key are inherently a race. Even a write that returns Ok may be immediately overwritten by another caller — there is no ordering guarantee and objectstore cannot provide a read-your-writes promise.

CAS conflicts are therefore not errors: the losing writer’s data is cleaned up and Ok is returned, because the result is indistinguishable from having succeeded a moment earlier and then been overwritten.

§Idempotency

compare_and_write is idempotent: if the row is already in the target state, it returns true without re-applying the mutation. This is critical for retry safety. If the server commits a write but the response is lost, a retry sees the already-mutated state and still returns true — so callers do not mistakenly treat a successful commit as a lost race and clean up data that was actually persisted.

Structs§

TieredStorage
Two-tier storage backend that routes objects by size.
TieredStorageConfig
Configuration for TieredStorage.