Open
Conversation
Defines all module-level constants (system volume name/mount/size, sync file names, history dir conventions, ChainStore hkey, manifest schema version, failure stage strings) and the SyncManager class skeleton with NotImplementedError stubs for every public method called out in the plan: resolve_container_path, _write_json_atomic, history readers/writers, claim_request, make_archive, publish_snapshot, fetch_latest, validate_manifest, extract_archive, apply_snapshot, _retire_previous_cid. Module parses cleanly; no behaviour yet. Path helpers (system_volume_host_root, volume_sync_dir, history_root) derive locations from owner.get_data_folder() / _get_instance_data_subfolder() so the manager has no hardcoded plugin assumptions beyond the documented owner interface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Concrete helpers in SyncManager (no plugin lifecycle yet):
resolve_container_path(): six-rule chokepoint — absolute, covered, fixed-size
backed (host_root under fixed_volumes/mounts/), not the system volume,
no '..' segments, resolved host path stays within host_root.
_write_json_atomic(): tmp + os.replace, creates parent dir, fsyncs the
payload, removes the tmp on failure so we never leak orphans.
History helpers: append_sent / append_received / latest_sent /
latest_received / update_history_deletion. Filenames are
'<10-digit-version>__<12-char-cid>.json' so lexical sort = chronological.
The 'deletion' sub-record defaults to {None,None,None} on append and is
mutated in place via atomic write when a snapshot is superseded.
22 unit tests cover happy paths, rejection paths for every rule, atomic
write semantics, history ordering, and missing-file deletion-update.
All pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claim_request(): atomic os.replace of request.json -> .processing, JSON parse, top-level shape validation, per-path resolve_container_path check. On any failure writes request.json.invalid (with the original body or raw_body for malformed JSON) and response.json (error shape), deletes the .processing file. Returns (archive_paths, metadata) on success or None. _fail_request(): shared helper that writes the .invalid + response.json pair so the artifacts stay consistent across all failure stages. make_archive(): tarfile-based gzip with member names = container-absolute paths. Re-runs resolve_container_path on each entry as defence in depth and FileNotFoundErrors if the host path is missing. Output goes to owner.get_output_folder() with a pid+timestamp-suffixed filename. extract_archive(): two-pass — validate every member first (so unmapped members abort the entire extract before any write), then atomic per-file write via tmp + os.replace. Skips symlink/hardlink members for safety. Member names from tarfile are stripped of leading '/' by POSIX default, so we re-prepend before resolving. 19 new unit tests (41 total) cover claim happy path, all the validation failure shapes (malformed JSON, wrong type, missing/empty archive_paths, metadata not object, traversal, unmounted, non-fixed-size, system volume, links, archive round-trip, non-existent host path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/retire)
publish_snapshot(): full provider flow as four staged operations
(archive_build, r1fs_upload, chainstore_publish, then history+response).
Each stage has its own _fail_request shape with the matching stage string.
On success: writes response.json{ok}, clears any prior request.json.invalid,
deletes .processing, and best-effort retires the prior CID. The archive
tmp file is always removed (success or fail) via try/finally.
fetch_latest(): chainstore_hsync (non-fatal on error) + chainstore_hget.
validate_manifest(): runs the manifest's archive_paths through
resolve_container_path against the consumer's volumes; returns the list of
unmappable entries so the caller can decide between "apply" and "skip".
apply_snapshot(): pre-flight via validate_manifest, then r1fs.get_file →
extract_archive → append history → write last_apply.json → retire prior
CID with cleanup_local_files=True (consumer-only — drops the local R1FS
download too). Failure modes are all non-fatal (no last_apply written so
the consumer-side app sees nothing landed; history not advanced).
_retire_previous_cid(): finds the most recent prior un-retired entry whose
cid differs from the latest, calls r1fs.delete_file, updates the prior
entry's deletion sub-record. Never raises — deletion failure must not
roll back the new publish/apply.
20 new tests (61 total): orchestrator happy paths, every failure stage
with the right artifact shape, ChainStore ack=False non-fatal,
two-snapshot retirement (sender + receiver sides), failed-retirement audit
trail, layout misalignment skip, end-to-end provider→consumer round-trip
through shared FakeR1FS + FakeChainStore. All pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lume
The mixin is the integration layer between SyncManager and the plugin
lifecycle. Public surface (called by ContainerAppRunnerPlugin in the next
step):
* _configure_system_volume(): provisions the always-on /r1en_system
fixed-size loopback (idempotent across restarts via fixed_volume.provision),
tracks it in self._fixed_volumes for cleanup parity, mkdir -p's the
volume-sync subdir, and binds it into self.volumes.
* _inject_sync_env_vars(): R1_SYSTEM_VOLUME / R1_VOLUME_SYNC_DIR /
R1_SYNC_REQUEST_FILE always set; R1_SYNC_TYPE / R1_SYNC_KEY only when
SYNC.ENABLED so apps that branch on role can.
* _sync_provider_tick / _sync_consumer_tick: throttled by POLL_INTERVAL.
Drive stop_container -> work -> start_container INLINE (not via a
StopReason -> _restart_container, which would unmount the loopback).
Validation failures don't disturb the container; execution failures
still restart it.
* _sync_initial_consumer_block(): blocks consumer's first start_container
until ChainStore has a record (bounded by INITIAL_SYNC_TIMEOUT;
0=forever).
* _recover_stale_processing(): renames orphan request.json.processing back
to request.json so a crash mid-publish doesn't leave a request stuck.
21 unit tests against a fake plugin that records stop/start ordering: env
injection across enabled/disabled, role helpers, throttle, full provider
flow with success and r1fs failure, validation failure does not stop
container, full consumer flow, misalignment skip, already-applied no-op,
stale .processing recovery (and don't-clobber rule when both files exist).
All pass alongside the 61 sync_manager tests (82 total).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Integration changes in container_app_runner.py + mixins/__init__.py:
* MRO: insert _SyncMixin before _ContainerUtilsMixin so its
_configure_system_volume / _inject_sync_env_vars / tick methods are
available to the plugin.
* _CONFIG: new SYNC block (ENABLED/KEY/TYPE/POLL_INTERVAL/
INITIAL_SYNC_TIMEOUT). System volume itself is hardcoded — no config.
* __reset_vars: initialize _last_sync_check + _sync_manager.
* on_init:
- _configure_system_volume() after the existing _configure_*_volumes
chain so /r1en_system is always provisioned;
- _recover_stale_processing() so a crash mid-publish doesn't strand
a request in .processing;
- _validate_sync_config() (logs a clear error and disables SYNC if
KEY/TYPE are bad — the system volume keeps working);
- _inject_sync_env_vars() right after _setup_env_and_ports() in the
non-semaphored branch.
* _restart_container: same _configure_system_volume + recovery +
_inject_sync_env_vars sequence so a full restart (e.g. for image
update) recreates the volume + env vars cleanly.
* _handle_initial_launch:
- _inject_sync_env_vars() after _setup_env_and_ports in the
semaphored branch;
- _sync_initial_consumer_block() before the very first
start_container so consumer pods boot on populated state.
* _perform_additional_checks: drives the provider/consumer tick INLINE
(return None — must NOT use a StopReason because that routes through
_restart_container which unmounts the loopback before our work).
All 82 sync-only unit tests still pass. The 10 pre-existing failures in
the rest of the test_*.py suite are unrelated (test env doesn't have
docker-py installed; same failure on master).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
E2E surfaced this on the first deploy: flask-files-explorer runs as a non-root user, the system volume's loop mount is root-owned (because _resolve_image_owner() returned (None, None) for this image), and the app's POST /create against /r1en_system/volume-sync/request.json got 'Permission denied'. The system volume is purpose-built as an app-writable control-plane channel between the container and CAR. There's no isolation gain in restricting the volume-sync subdir to root: the volume is per-CAR- instance, and the app already owns the container. Chmod both the mount root and the volume-sync/ subdir to 0o777 after mkdir so any container user can write requests. Existing 82 unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… read
E2E surfaced two more permission issues:
1. _write_json_atomic uses tempfile.mkstemp which defaults to 0o600.
CAR runs as root, but the app inside the container is typically a
non-root user (e.g. flask-files-explorer's appuser). After
os.replace the response.json / last_apply.json / request.json.invalid
files were unreadable from inside the container. Now chmod 0o666
before replace.
2. extract_archive preserves the tar member's mode, but the new file
is root-owned (CAR's identity). If the source mode was something
restrictive, the app couldn't read it. Now we max() the source
mode against 0o644 for files and 0o755 for directories so the app
can always cat / traverse what CAR landed.
82 unit tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_volume E2E found this: after a CAR restart, the data volume (declared via FIXED_SIZE_VOLUMES) appeared empty inside the container even though fixed_volume.provision had just chowned and mounted it. Cause: _configure_system_volume was calling fixed_volume.cleanup_stale_mounts on the shared <plugin_data>/fixed_volumes/ root. That scan iterates EVERY meta/*.json file under the root, including the appdata.json that _FixedSizeVolumesMixin._configure_fixed_size_volumes() (which runs FIRST) just provisioned. cleanup_stale_mounts saw the active mount, treated it as 'stale', and umount + losetup -d'd it. Then provision() of the system volume ran but never re-mounted appdata. The container started with an empty bind from an unmounted host path. Fix: don't repeat the stale-mount sweep — the previous configure step already did it. Add an explicit comment so a future maintainer doesn't re-introduce the call thinking it's defensive. 82 unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The decision the consumer needs is content-identity: 'is this bundle
the one I just applied?'. CID is a content-addressed hash that answers
that exactly. Version was a CAR-side timestamp that was indirectly
serving as identity, but it imported a class of failure modes (clock
skew, multi-provider non-monotonic ordering) that simply don't apply
to a CID comparison.
Two coupled changes:
* _sync_consumer_tick: skip when record.cid == latest_received().cid
instead of new_version <= last_version. Different cid → apply,
regardless of version metadata.
* _latest_in: sort by mtime, not by filename. Filenames stay
version-prefixed for chronological browsability under normal
operation, but 'what did I last apply?' is fundamentally an
insert-order question. Without this, a back-dated record would
write a history file with a lex-smaller filename than the previous
entry, latest_received would still return the older one, and the
consumer would re-apply the back-dated record on every tick
forever.
version is kept everywhere as informational metadata (record schema +
history entries + response.json + last_apply.json + filename prefixes)
so wire compat is preserved and human-readable logs still say
'applying v1714742400 ...'. Only the comparison logic changed.
Tests updated:
* test_skips_already_applied_version → test_skips_when_record_cid_matches_last_apply
* Added test_applies_when_cid_differs_even_if_version_lower (covers the
clock-skew failure mode the old code couldn't survive).
* test_latest_picks_highest_version → test_latest_picks_most_recently_written
83 unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The volume-sync feature was scattered across two locations because the
codebase has a 'mixins go in mixins/' convention while the manager is
not a mixin:
before:
extensions/business/container_apps/
├── sync_manager.py ← module-level constants + SyncManager
└── mixins/
└── sync_mixin.py ← _SyncMixin (lifecycle integration)
After this refactor the entire feature lives in one folder and a reader
can scan it without jumping around:
extensions/business/container_apps/
└── sync/
├── __init__.py ← re-exports the public API
├── constants.py ← file names, hkeys, schema version,
│ stage labels — pure data
├── manager.py ← SyncManager + host-side path helpers
└── mixin.py ← _SyncMixin
Module renames:
- sync_manager.py → sync/manager.py
- mixins/sync_mixin.py → sync/mixin.py
- (new) → sync/constants.py (extracted from manager)
- (new) → sync/__init__.py (re-exports)
Import-site updates:
- container_app_runner.py: pull _SyncMixin from .sync, not .mixins
- mixins/__init__.py: drop _SyncMixin export (note added pointing
to the new location)
- tests/test_sync_manager.py and tests/test_sync_mixin.py: import
from .sync (the package) rather than the deleted modules
Tradeoff: this one feature breaks the 'mixins live in mixins/'
convention. Worth it because everything related to sync — constants,
helpers, the manager class, and the mixin — is now reachable with
'cd sync/'. The mixins/__init__.py docstring points the next reader at
the new location so it's not surprising.
Verification: all 83 unit tests pass (test_sync_manager + test_sync_mixin
exercise the resolver, atomic JSON, history I/O, claim_request, archive
roundtrip, orchestrators, env injection, ticks, recovery — all green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.