chunks - the content-addressed data plane

chunks are the atomic unit of storage in roost. every file in a version is a list of chunk digests; every chunk is an immutable blob of bytes keyed by its sha-256 hash. this doc is the end-to-end contract for chunking, hash format, storage layout, upload flow, download flow, cross-roost mount, referrer lookup, retry behavior, and error taxonomy.

Last updated: 2026-05-01 Status: normative for all roost v2 uploads, downloads, gc, and rollback paths.

versions.md - version publish schema; chunks are embedded in version.files[].chunks[] as { hash, size }.
web/openapi.yaml - rendered route reference.
quickstart.md - first public API smoke workflow before you move on to roost publishing.

1. chunk shape

fixed size: every chunk is exactly 4 MiB (4,194,304 bytes) except the last chunk of each file, which may be 1..4,194,304 bytes.
algorithm: sha-256 only in v1. content-defined chunking (fastcdc / rabin) is explicitly deferred to v3.
hash encoding: lowercase hex, exactly 64 chars, matching ^[0-9a-f]{64}$.
wire format: every public chunk endpoint accepts and returns the bare 64-hex digest. do not add an algorithm prefix in request bodies, query strings, or path parameters.
version format: the same bare 64-hex digest is stored in version.files[].chunks[].hash.
zero-byte files have chunks: [] and size: 0. the empty chunk is never stored in r2.
ordering: chunks within a file are ordered by file offset. concatenation in order reproduces the file exactly; no gaps, no overlaps.

file bytes  --------------------------------------------------------------->
            |-- chunk[0] --|-- chunk[1] --|-- chunk[2] --|-- chunk[3] --|
            |  4,194,304   |  4,194,304   |  4,194,304   |   812,345    |
            |  64-hex hash |  64-hex hash |  64-hex hash |  64-hex hash |

2. storage layout

chunks are stored in cloudflare r2 at a per-site, sharded, content-addressed key:

project-content/{siteId}/{hash[0:2]}/{hash}

segment	source	purpose
`project-content`	fixed bucket prefix	separates chunks from version bodies
`{siteId}`	owlette site that owns the chunk	tenant isolation
`{hash[0:2]}`	first two hex chars of the chunk hash	avoids hot prefixes in r2's keyspace
`{hash}`	full 64-hex sha-256	the content address

example:

project-content/kiosk-fleet-01/4e/4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce

chunk paths never appear inside the version. the agent reconstructs them from siteId and hash. third-party clients normally do not construct them at all; they receive signed urls that embed the full path.

3. per-site isolation

chunks are scoped to the siteId they were uploaded under. two different sites cannot dedup against each other's content, even if they independently upload byte-identical files. this is enforced at the storage-key level ({siteId} is part of the r2 path) and at the api level.

the client never asks "does roost have this chunk?". it asks "does site X have this chunk?". every chunk-plane endpoint takes siteId in the body or query string. a missing or malformed siteId is validation_failed.

4. dedup flow (upload)

the chunker never ships bytes the server already has. the upload dance is a three-step round trip per batch:

POST /api/chunks/check finds missing hashes for a site.
POST /api/chunks/upload-urls mints signed r2 PUT urls for those missing hashes.
the client uploads each chunk directly to r2.

step 1 - batch existence check

POST /api/chunks/check
content-type: application/json

{
  "siteId": "kiosk-fleet-01",
  "hashes": [
    "2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6",
    "18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4",
    "4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce"
  ]
}

scope: site:<siteId>:write.
up to 1000 hashes per call; larger batches are rejected with validation_failed.
response lists only the digests missing from the site's cas namespace. everything else is already reusable.

{
  "missing": [
    "4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce"
  ]
}

step 2 - mint signed put urls for the missing set

POST /api/chunks/upload-urls
content-type: application/json

{
  "siteId": "kiosk-fleet-01",
  "hashes": [
    "4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce"
  ]
}

scope: site:<siteId>:write.
signed urls carry a 60-minute ttl. the expiry is returned as expiresAt in rfc 3339 utc form.
urls are per-hash. one request can mint many urls at once, capped at 1000 hashes.
this route does not implement a response cache. retry by requesting fresh signed urls for the hashes that still need uploading.

{
  "urls": {
    "4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce": "https://owlette-prod.r2.cloudflarestorage.com/project-content/kiosk-fleet-01/4e/4e0740856...?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=..."
  },
  "expiresAt": "2026-04-22T16:30:00.000Z"
}

step 3 - put the bytes directly to r2

PUT <signed url>
content-type: application/octet-stream

<exactly size bytes of chunk content>

bytes travel client to r2 with no roost intermediary.
retry policy is per-chunk. if the PUT fails, retry that single chunk. if the signed url has expired, re-mint only that hash via POST /api/chunks/upload-urls.
clients should upload chunks concurrently. 8-16 parallel PUTs is typical.

5. download flow

POST /api/chunks/download-urls
content-type: application/json

{
  "siteId": "kiosk-fleet-01",
  "hashes": [
    "2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6",
    "18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4"
  ]
}

for small batches, use the GET form with repeated hash query parameters:

GET /api/chunks/download-urls?siteId=kiosk-fleet-01&hash=2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6&hash=18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4

scope: agent token for the same site, or site:<siteId>:read.
signed urls carry a 15-minute ttl.
cross-site requests are rejected by the site auth and scope checks. there is no out-of-band way to resolve a digest to another tenant's storage key.

{
  "urls": {
    "2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6": "https://owlette-prod.r2.cloudflarestorage.com/project-content/kiosk-fleet-01/2e/2e7d2c03...?X-Amz-Algorithm=AWS4-HMAC-SHA256&...",
    "18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4": "https://owlette-prod.r2.cloudflarestorage.com/project-content/kiosk-fleet-01/18/18ac3e73...?X-Amz-Algorithm=AWS4-HMAC-SHA256&..."
  },
  "expiresAt": "2026-04-22T15:45:00.000Z"
}

clients then issue GET against each signed url. the agent re-verifies each chunk's sha-256 as it writes to the local content store; a mismatch aborts the sync and the chunk is fetched again.

6. cross-roost mount

POST /api/chunks/{digest}/mount records a cross-roost chunk reference without moving bytes. {digest} is the bare 64-hex chunk hash.

POST /api/chunks/2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6/mount
content-type: application/json

{
  "siteId": "kiosk-fleet-01",
  "from": "roost_lobby_td",
  "to": "roost_lobby_td_v2"
}

scope: site:<siteId>:write.
siteId, from, and to can be supplied in the JSON body. query parameters with the same names are also accepted.
from and to must be different roost ids in the same site.
the chunk must already exist under project-content/{siteId}/... and both roost documents must exist.
bytes moved: zero. firestore records or updates sites/{siteId}/chunk_referrers/{digest}/entries/mount_{from}_{to}.
retry behavior: the same (digest, siteId, from, to) upserts the same referrer entry; it is retry-safe but not backed by a request replay cache.

successful mounts return 201:

{
  "digest": "2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6",
  "siteId": "kiosk-fleet-01",
  "from": "roost_lobby_td",
  "to": "roost_lobby_td_v2",
  "mounted": true,
  "zeroByte": true
}

7. referrer query

GET /api/chunks/{digest}/referrers returns recorded mount and version-publish referrers for a chunk.

GET /api/chunks/2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6/referrers?siteId=kiosk-fleet-01&page_size=25

{
  "digest": "2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6",
  "siteId": "kiosk-fleet-01",
  "referrers": [
    {
      "entryId": "mount_roost_lobby_td_roost_lobby_td_v2",
      "source": "mount",
      "roostId": null,
      "fromRoostId": "roost_lobby_td",
      "toRoostId": "roost_lobby_td_v2",
      "versionId": null,
      "versionNumber": null,
      "fileCount": null,
      "pathCount": null,
      "totalBytes": null,
      "referencedAt": "2026-04-22T15:30:00.000Z",
      "createdAt": null,
      "createdBy": null,
      "mountedAt": "2026-04-22T15:30:00.000Z",
      "mountedBy": "user_123"
    }
  ],
  "items": [
    {
      "entryId": "mount_roost_lobby_td_roost_lobby_td_v2",
      "source": "mount",
      "roostId": null,
      "fromRoostId": "roost_lobby_td",
      "toRoostId": "roost_lobby_td_v2",
      "versionId": null,
      "versionNumber": null,
      "fileCount": null,
      "pathCount": null,
      "totalBytes": null,
      "referencedAt": "2026-04-22T15:30:00.000Z",
      "createdAt": null,
      "createdBy": null,
      "mountedAt": "2026-04-22T15:30:00.000Z",
      "mountedBy": "user_123"
    }
  ],
  "next_page_token": "",
  "nextPageToken": ""
}

canonical pagination parameters are page_size and page_token; legacy aliases limit and cursor are also accepted.
default page size is 50 and max page size is 200.

8. retry behavior

operation	retry behavior
`POST /api/chunks/check`	safe to retry; it recomputes missing hashes from r2 metadata.
`POST /api/chunks/upload-urls`	safe to retry by minting fresh signed urls. the route does not cache request replays.
`PUT <signed r2 url>`	safe to retry with the same bytes while the signed url is valid; otherwise mint a new url.
`POST` or `GET /api/chunks/download-urls`	safe to retry; it mints fresh signed download urls.
`POST /api/chunks/{digest}/mount`	retry-safe for the same `(digest, siteId, from, to)` because the referrer entry id is deterministic.

9. reference chunker - pseudocode

both snippets below stream-read a file in fixed 4-MiB blocks, compute sha-256 per block, and yield (hash, size, offset). neither loads the file into memory.

python (cpython 3.9+)

import hashlib
from pathlib import Path
from typing import Iterator

CHUNK_SIZE = 4 * 1024 * 1024  # 4 MiB

def chunk_file(path: Path) -> Iterator[dict]:
    """
    stream-read `path` in 4 MiB blocks; yield {"hash": hex, "size": int, "offset": int}
    for each chunk. the last chunk may be 1..CHUNK_SIZE bytes. a zero-byte file yields
    no chunks.
    """
    offset = 0
    with path.open("rb") as f:
        while True:
            block = f.read(CHUNK_SIZE)
            if not block:
                return
            digest = hashlib.sha256(block).hexdigest()
            yield {"hash": digest, "size": len(block), "offset": offset}
            offset += len(block)


def file_to_version_entry(path: Path, rel_path: str) -> dict:
    chunks = list(chunk_file(path))
    return {
        "path": rel_path,
        "size": sum(c["size"] for c in chunks),
        "chunks": [{"hash": c["hash"], "size": c["size"]} for c in chunks],
    }

node (20+)

import { createReadStream } from 'node:fs';
import { createHash } from 'node:crypto';

const CHUNK_SIZE = 4 * 1024 * 1024; // 4 MiB

/**
 * stream-read `path` in 4 MiB blocks; yield { hash, size, offset } per chunk.
 * last chunk may be 1..CHUNK_SIZE bytes. zero-byte files yield nothing.
 */
export async function* chunkFile(path) {
  let offset = 0;
  let pending = Buffer.alloc(0);

  for await (const buf of createReadStream(path, { highWaterMark: CHUNK_SIZE })) {
    pending = pending.length === 0 ? buf : Buffer.concat([pending, buf]);

    while (pending.length >= CHUNK_SIZE) {
      const block = pending.subarray(0, CHUNK_SIZE);
      pending = pending.subarray(CHUNK_SIZE);
      const hash = createHash('sha256').update(block).digest('hex');
      yield { hash, size: block.length, offset };
      offset += block.length;
    }
  }

  if (pending.length > 0) {
    const hash = createHash('sha256').update(pending).digest('hex');
    yield { hash, size: pending.length, offset };
  }
}

export async function fileToVersionEntry(path, relPath) {
  const chunks = [];
  for await (const c of chunkFile(path)) {
    chunks.push({ hash: c.hash, size: c.size });
  }
  return {
    path: relPath,
    size: chunks.reduce((n, c) => n + c.size, 0),
    chunks,
  };
}

notes applicable to both implementations:

read buffers must be at least 4 MiB, or the reader must accumulate smaller reads until it has a full chunk.
the final chunk is whatever remains at eof.
hashing and i/o must not be interleaved with other writers on the same file; take a read lock or copy the file first.
the version's files[].size should equal sum(chunks[].size). current agents reject mismatches when parsing versions, but the publish API does not currently enforce this invariant.

10. error taxonomy

every error response follows rfc 7807 application/problem+json with standard extensions (code, docsUrl, requestId, and optional field-level errors). the code field is the stable contract; match on that, not on detail prose.

code	status	endpoints	meaning
`validation_failed`	400	all chunk routes	malformed bare digest, missing or invalid `siteId`, empty `hashes[]`, more than 1000 hashes, invalid `from` / `to`, `from` equal to `to`, or invalid pagination.
`not_found`	404	auth/site guard, `POST /api/chunks/{digest}/mount`	site not found or not accessible; mount digest is not stored for the site; source or target roost does not exist.
`precondition_failed`	412	chunk-dependent publish/finalize paths	a version references chunk hashes that are not present in the site's r2 namespace. upload the missing chunks and retry the publish.

universal errors (unauthorized 401, token_expired 401, scope_insufficient 403, rate_limited 429, internal_error 500) are documented in the top-level api conventions and are not repeated here.

example validation_failed for a malformed digest:

{
  "type": "https://owlette.app/problems/validation-failed",
  "title": "validation failed",
  "status": 400,
  "detail": "field hashes contains malformed hash entries (must be lowercase 64-char hex sha-256)",
  "code": "validation_failed",
  "errors": {
    "hashes": [
      "malformed entries: bad-digest..."
    ]
  },
  "docsUrl": "https://owlette.app/docs/api/errors#validation_failed",
  "requestId": "req_01HW..."
}

11. operational notes

signed-url expiry is enforced by r2, not roost. clients should trust expiresAt from the api response, not local time.
retries during the upload session should respect Retry-After on 429 and use exponential backoff on 5xx from r2.
partial uploads are not representable. r2 PUT is atomic per object; retry the whole chunk.
a chunk with zero referrers across all roosts in a site becomes eligible for deletion after the gc grace period. during that window, a mount can still resurrect it.

chunks - the content-addressed data plane

on this page