chunks - the content-addressed data plane
chunks are the atomic unit of storage in roost. every file in a version is a list of chunk digests; every chunk is an immutable blob of bytes keyed by its sha-256 hash. this doc is the end-to-end contract for chunking, hash format, storage layout, upload flow, download flow, cross-roost mount, referrer lookup, retry behavior, and error taxonomy.
Last updated: 2026-05-01 Status: normative for all roost v2 uploads, downloads, gc, and rollback paths.
related:
versions.md- version publish schema; chunks are embedded inversion.files[].chunks[]as{ hash, size }.web/openapi.yaml- rendered route reference.quickstart.md- first public API smoke workflow before you move on to roost publishing.
1. chunk shape
- fixed size: every chunk is exactly 4 MiB (4,194,304 bytes) except the last chunk of each file, which may be 1..4,194,304 bytes.
- algorithm: sha-256 only in v1. content-defined chunking (fastcdc / rabin) is explicitly deferred to v3.
- hash encoding: lowercase hex, exactly 64 chars, matching
^[0-9a-f]{64}$. - wire format: every public chunk endpoint accepts and returns the bare 64-hex digest. do not add an algorithm prefix in request bodies, query strings, or path parameters.
- version format: the same bare 64-hex digest is stored in
version.files[].chunks[].hash. - zero-byte files have
chunks: []andsize: 0. the empty chunk is never stored in r2. - ordering: chunks within a file are ordered by file offset. concatenation in order reproduces the file exactly; no gaps, no overlaps.
file bytes --------------------------------------------------------------->
|-- chunk[0] --|-- chunk[1] --|-- chunk[2] --|-- chunk[3] --|
| 4,194,304 | 4,194,304 | 4,194,304 | 812,345 |
| 64-hex hash | 64-hex hash | 64-hex hash | 64-hex hash |2. storage layout
chunks are stored in cloudflare r2 at a per-site, sharded, content-addressed key:
project-content/{siteId}/{hash[0:2]}/{hash}| segment | source | purpose |
|---|---|---|
project-content | fixed bucket prefix | separates chunks from version bodies |
{siteId} | owlette site that owns the chunk | tenant isolation |
{hash[0:2]} | first two hex chars of the chunk hash | avoids hot prefixes in r2's keyspace |
{hash} | full 64-hex sha-256 | the content address |
example:
project-content/kiosk-fleet-01/4e/4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fcechunk paths never appear inside the version. the agent reconstructs them from siteId and hash. third-party clients normally do not construct them at all; they receive signed urls that embed the full path.
3. per-site isolation
chunks are scoped to the siteId they were uploaded under. two different sites cannot dedup against each other's content, even if they independently upload byte-identical files. this is enforced at the storage-key level ({siteId} is part of the r2 path) and at the api level.
the client never asks "does roost have this chunk?". it asks "does site X have this chunk?". every chunk-plane endpoint takes siteId in the body or query string. a missing or malformed siteId is validation_failed.
4. dedup flow (upload)
the chunker never ships bytes the server already has. the upload dance is a three-step round trip per batch:
POST /api/chunks/checkfinds missing hashes for a site.POST /api/chunks/upload-urlsmints signed r2PUTurls for those missing hashes.- the client uploads each chunk directly to r2.
step 1 - batch existence check
POST /api/chunks/check
content-type: application/json
{
"siteId": "kiosk-fleet-01",
"hashes": [
"2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6",
"18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4",
"4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce"
]
}- scope:
site:<siteId>:write. - up to 1000 hashes per call; larger batches are rejected with
validation_failed. - response lists only the digests missing from the site's cas namespace. everything else is already reusable.
{
"missing": [
"4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce"
]
}step 2 - mint signed put urls for the missing set
POST /api/chunks/upload-urls
content-type: application/json
{
"siteId": "kiosk-fleet-01",
"hashes": [
"4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce"
]
}- scope:
site:<siteId>:write. - signed urls carry a 60-minute ttl. the expiry is returned as
expiresAtin rfc 3339 utc form. - urls are per-hash. one request can mint many urls at once, capped at 1000 hashes.
- this route does not implement a response cache. retry by requesting fresh signed urls for the hashes that still need uploading.
{
"urls": {
"4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce": "https://owlette-prod.r2.cloudflarestorage.com/project-content/kiosk-fleet-01/4e/4e0740856...?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=..."
},
"expiresAt": "2026-04-22T16:30:00.000Z"
}step 3 - put the bytes directly to r2
PUT <signed url>
content-type: application/octet-stream
<exactly size bytes of chunk content>- bytes travel client to r2 with no roost intermediary.
- retry policy is per-chunk. if the
PUTfails, retry that single chunk. if the signed url has expired, re-mint only that hash viaPOST /api/chunks/upload-urls. - clients should upload chunks concurrently. 8-16 parallel
PUTs is typical.
5. download flow
POST /api/chunks/download-urls
content-type: application/json
{
"siteId": "kiosk-fleet-01",
"hashes": [
"2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6",
"18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4"
]
}for small batches, use the GET form with repeated hash query parameters:
GET /api/chunks/download-urls?siteId=kiosk-fleet-01&hash=2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6&hash=18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4- scope: agent token for the same site, or
site:<siteId>:read. - signed urls carry a 15-minute ttl.
- cross-site requests are rejected by the site auth and scope checks. there is no out-of-band way to resolve a digest to another tenant's storage key.
{
"urls": {
"2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6": "https://owlette-prod.r2.cloudflarestorage.com/project-content/kiosk-fleet-01/2e/2e7d2c03...?X-Amz-Algorithm=AWS4-HMAC-SHA256&...",
"18ac3e7343f016890c510e93f935261169d9e3f565436429830faf0934f4f8e4": "https://owlette-prod.r2.cloudflarestorage.com/project-content/kiosk-fleet-01/18/18ac3e73...?X-Amz-Algorithm=AWS4-HMAC-SHA256&..."
},
"expiresAt": "2026-04-22T15:45:00.000Z"
}clients then issue GET against each signed url. the agent re-verifies each chunk's sha-256 as it writes to the local content store; a mismatch aborts the sync and the chunk is fetched again.
6. cross-roost mount
POST /api/chunks/{digest}/mount records a cross-roost chunk reference without moving bytes. {digest} is the bare 64-hex chunk hash.
POST /api/chunks/2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6/mount
content-type: application/json
{
"siteId": "kiosk-fleet-01",
"from": "roost_lobby_td",
"to": "roost_lobby_td_v2"
}- scope:
site:<siteId>:write. siteId,from, andtocan be supplied in the JSON body. query parameters with the same names are also accepted.fromandtomust be different roost ids in the same site.- the chunk must already exist under
project-content/{siteId}/...and both roost documents must exist. - bytes moved: zero. firestore records or updates
sites/{siteId}/chunk_referrers/{digest}/entries/mount_{from}_{to}. - retry behavior: the same
(digest, siteId, from, to)upserts the same referrer entry; it is retry-safe but not backed by a request replay cache.
successful mounts return 201:
{
"digest": "2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6",
"siteId": "kiosk-fleet-01",
"from": "roost_lobby_td",
"to": "roost_lobby_td_v2",
"mounted": true,
"zeroByte": true
}7. referrer query
GET /api/chunks/{digest}/referrers returns recorded mount and version-publish referrers for a chunk.
GET /api/chunks/2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6/referrers?siteId=kiosk-fleet-01&page_size=25{
"digest": "2e7d2c03a9507ae265ecf5b5356885a53393a2029d241394997265a1a25aefc6",
"siteId": "kiosk-fleet-01",
"referrers": [
{
"entryId": "mount_roost_lobby_td_roost_lobby_td_v2",
"source": "mount",
"roostId": null,
"fromRoostId": "roost_lobby_td",
"toRoostId": "roost_lobby_td_v2",
"versionId": null,
"versionNumber": null,
"fileCount": null,
"pathCount": null,
"totalBytes": null,
"referencedAt": "2026-04-22T15:30:00.000Z",
"createdAt": null,
"createdBy": null,
"mountedAt": "2026-04-22T15:30:00.000Z",
"mountedBy": "user_123"
}
],
"items": [
{
"entryId": "mount_roost_lobby_td_roost_lobby_td_v2",
"source": "mount",
"roostId": null,
"fromRoostId": "roost_lobby_td",
"toRoostId": "roost_lobby_td_v2",
"versionId": null,
"versionNumber": null,
"fileCount": null,
"pathCount": null,
"totalBytes": null,
"referencedAt": "2026-04-22T15:30:00.000Z",
"createdAt": null,
"createdBy": null,
"mountedAt": "2026-04-22T15:30:00.000Z",
"mountedBy": "user_123"
}
],
"next_page_token": "",
"nextPageToken": ""
}- canonical pagination parameters are
page_sizeandpage_token; legacy aliaseslimitandcursorare also accepted. - default page size is 50 and max page size is 200.
8. retry behavior
| operation | retry behavior |
|---|---|
POST /api/chunks/check | safe to retry; it recomputes missing hashes from r2 metadata. |
POST /api/chunks/upload-urls | safe to retry by minting fresh signed urls. the route does not cache request replays. |
PUT <signed r2 url> | safe to retry with the same bytes while the signed url is valid; otherwise mint a new url. |
POST or GET /api/chunks/download-urls | safe to retry; it mints fresh signed download urls. |
POST /api/chunks/{digest}/mount | retry-safe for the same (digest, siteId, from, to) because the referrer entry id is deterministic. |
9. reference chunker - pseudocode
both snippets below stream-read a file in fixed 4-MiB blocks, compute sha-256 per block, and yield (hash, size, offset). neither loads the file into memory.
python (cpython 3.9+)
import hashlib
from pathlib import Path
from typing import Iterator
CHUNK_SIZE = 4 * 1024 * 1024 # 4 MiB
def chunk_file(path: Path) -> Iterator[dict]:
"""
stream-read `path` in 4 MiB blocks; yield {"hash": hex, "size": int, "offset": int}
for each chunk. the last chunk may be 1..CHUNK_SIZE bytes. a zero-byte file yields
no chunks.
"""
offset = 0
with path.open("rb") as f:
while True:
block = f.read(CHUNK_SIZE)
if not block:
return
digest = hashlib.sha256(block).hexdigest()
yield {"hash": digest, "size": len(block), "offset": offset}
offset += len(block)
def file_to_version_entry(path: Path, rel_path: str) -> dict:
chunks = list(chunk_file(path))
return {
"path": rel_path,
"size": sum(c["size"] for c in chunks),
"chunks": [{"hash": c["hash"], "size": c["size"]} for c in chunks],
}node (20+)
import { createReadStream } from 'node:fs';
import { createHash } from 'node:crypto';
const CHUNK_SIZE = 4 * 1024 * 1024; // 4 MiB
/**
* stream-read `path` in 4 MiB blocks; yield { hash, size, offset } per chunk.
* last chunk may be 1..CHUNK_SIZE bytes. zero-byte files yield nothing.
*/
export async function* chunkFile(path) {
let offset = 0;
let pending = Buffer.alloc(0);
for await (const buf of createReadStream(path, { highWaterMark: CHUNK_SIZE })) {
pending = pending.length === 0 ? buf : Buffer.concat([pending, buf]);
while (pending.length >= CHUNK_SIZE) {
const block = pending.subarray(0, CHUNK_SIZE);
pending = pending.subarray(CHUNK_SIZE);
const hash = createHash('sha256').update(block).digest('hex');
yield { hash, size: block.length, offset };
offset += block.length;
}
}
if (pending.length > 0) {
const hash = createHash('sha256').update(pending).digest('hex');
yield { hash, size: pending.length, offset };
}
}
export async function fileToVersionEntry(path, relPath) {
const chunks = [];
for await (const c of chunkFile(path)) {
chunks.push({ hash: c.hash, size: c.size });
}
return {
path: relPath,
size: chunks.reduce((n, c) => n + c.size, 0),
chunks,
};
}notes applicable to both implementations:
- read buffers must be at least 4 MiB, or the reader must accumulate smaller reads until it has a full chunk.
- the final chunk is whatever remains at eof.
- hashing and i/o must not be interleaved with other writers on the same file; take a read lock or copy the file first.
- the version's
files[].sizeshould equalsum(chunks[].size). current agents reject mismatches when parsing versions, but the publish API does not currently enforce this invariant.
10. error taxonomy
every error response follows rfc 7807 application/problem+json with standard extensions (code, docsUrl, requestId, and optional field-level errors). the code field is the stable contract; match on that, not on detail prose.
| code | status | endpoints | meaning |
|---|---|---|---|
validation_failed | 400 | all chunk routes | malformed bare digest, missing or invalid siteId, empty hashes[], more than 1000 hashes, invalid from / to, from equal to to, or invalid pagination. |
not_found | 404 | auth/site guard, POST /api/chunks/{digest}/mount | site not found or not accessible; mount digest is not stored for the site; source or target roost does not exist. |
precondition_failed | 412 | chunk-dependent publish/finalize paths | a version references chunk hashes that are not present in the site's r2 namespace. upload the missing chunks and retry the publish. |
universal errors (unauthorized 401, token_expired 401, scope_insufficient 403, rate_limited 429, internal_error 500) are documented in the top-level api conventions and are not repeated here.
example validation_failed for a malformed digest:
{
"type": "https://owlette.app/problems/validation-failed",
"title": "validation failed",
"status": 400,
"detail": "field hashes contains malformed hash entries (must be lowercase 64-char hex sha-256)",
"code": "validation_failed",
"errors": {
"hashes": [
"malformed entries: bad-digest..."
]
},
"docsUrl": "https://owlette.app/docs/api/errors#validation_failed",
"requestId": "req_01HW..."
}11. operational notes
- signed-url expiry is enforced by r2, not roost. clients should trust
expiresAtfrom the api response, not local time. - retries during the upload session should respect
Retry-Afteron 429 and use exponential backoff on 5xx from r2. - partial uploads are not representable. r2
PUTis atomic per object; retry the whole chunk. - a chunk with zero referrers across all roosts in a site becomes eligible for deletion after the gc grace period. during that window, a mount can still resurrect it.
rate limits and quotas
owlette enforces request rate limits and quota limits for public API callers.
roost versions
A version is an immutable JSON body that lists the files in a roost publish and the chunk hashes needed to reconstruct each file. The public routes use that body to publish history, resolve version refs, list files, compute diffs, roll back the current pointer, and trigger deployments.