Feature/ CAR volume sync by toderian · Pull Request #399 · Ratio1/edge_node

toderian · 2026-05-04T21:45:11Z

No description provided.

Defines all module-level constants (system volume name/mount/size, sync file names, history dir conventions, ChainStore hkey, manifest schema version, failure stage strings) and the SyncManager class skeleton with NotImplementedError stubs for every public method called out in the plan: resolve_container_path, _write_json_atomic, history readers/writers, claim_request, make_archive, publish_snapshot, fetch_latest, validate_manifest, extract_archive, apply_snapshot, _retire_previous_cid. Module parses cleanly; no behaviour yet. Path helpers (system_volume_host_root, volume_sync_dir, history_root) derive locations from owner.get_data_folder() / _get_instance_data_subfolder() so the manager has no hardcoded plugin assumptions beyond the documented owner interface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Concrete helpers in SyncManager (no plugin lifecycle yet): resolve_container_path(): six-rule chokepoint — absolute, covered, fixed-size backed (host_root under fixed_volumes/mounts/), not the system volume, no '..' segments, resolved host path stays within host_root. _write_json_atomic(): tmp + os.replace, creates parent dir, fsyncs the payload, removes the tmp on failure so we never leak orphans. History helpers: append_sent / append_received / latest_sent / latest_received / update_history_deletion. Filenames are '<10-digit-version>__<12-char-cid>.json' so lexical sort = chronological. The 'deletion' sub-record defaults to {None,None,None} on append and is mutated in place via atomic write when a snapshot is superseded. 22 unit tests cover happy paths, rejection paths for every rule, atomic write semantics, history ordering, and missing-file deletion-update. All pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claim_request(): atomic os.replace of request.json -> .processing, JSON parse, top-level shape validation, per-path resolve_container_path check. On any failure writes request.json.invalid (with the original body or raw_body for malformed JSON) and response.json (error shape), deletes the .processing file. Returns (archive_paths, metadata) on success or None. _fail_request(): shared helper that writes the .invalid + response.json pair so the artifacts stay consistent across all failure stages. make_archive(): tarfile-based gzip with member names = container-absolute paths. Re-runs resolve_container_path on each entry as defence in depth and FileNotFoundErrors if the host path is missing. Output goes to owner.get_output_folder() with a pid+timestamp-suffixed filename. extract_archive(): two-pass — validate every member first (so unmapped members abort the entire extract before any write), then atomic per-file write via tmp + os.replace. Skips symlink/hardlink members for safety. Member names from tarfile are stripped of leading '/' by POSIX default, so we re-prepend before resolving. 19 new unit tests (41 total) cover claim happy path, all the validation failure shapes (malformed JSON, wrong type, missing/empty archive_paths, metadata not object, traversal, unmounted, non-fixed-size, system volume, links, archive round-trip, non-existent host path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…/retire) publish_snapshot(): full provider flow as four staged operations (archive_build, r1fs_upload, chainstore_publish, then history+response). Each stage has its own _fail_request shape with the matching stage string. On success: writes response.json{ok}, clears any prior request.json.invalid, deletes .processing, and best-effort retires the prior CID. The archive tmp file is always removed (success or fail) via try/finally. fetch_latest(): chainstore_hsync (non-fatal on error) + chainstore_hget. validate_manifest(): runs the manifest's archive_paths through resolve_container_path against the consumer's volumes; returns the list of unmappable entries so the caller can decide between "apply" and "skip". apply_snapshot(): pre-flight via validate_manifest, then r1fs.get_file → extract_archive → append history → write last_apply.json → retire prior CID with cleanup_local_files=True (consumer-only — drops the local R1FS download too). Failure modes are all non-fatal (no last_apply written so the consumer-side app sees nothing landed; history not advanced). _retire_previous_cid(): finds the most recent prior un-retired entry whose cid differs from the latest, calls r1fs.delete_file, updates the prior entry's deletion sub-record. Never raises — deletion failure must not roll back the new publish/apply. 20 new tests (61 total): orchestrator happy paths, every failure stage with the right artifact shape, ChainStore ack=False non-fatal, two-snapshot retirement (sender + receiver sides), failed-retirement audit trail, layout misalignment skip, end-to-end provider→consumer round-trip through shared FakeR1FS + FakeChainStore. All pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…lume The mixin is the integration layer between SyncManager and the plugin lifecycle. Public surface (called by ContainerAppRunnerPlugin in the next step): * _configure_system_volume(): provisions the always-on /r1en_system fixed-size loopback (idempotent across restarts via fixed_volume.provision), tracks it in self._fixed_volumes for cleanup parity, mkdir -p's the volume-sync subdir, and binds it into self.volumes. * _inject_sync_env_vars(): R1_SYSTEM_VOLUME / R1_VOLUME_SYNC_DIR / R1_SYNC_REQUEST_FILE always set; R1_SYNC_TYPE / R1_SYNC_KEY only when SYNC.ENABLED so apps that branch on role can. * _sync_provider_tick / _sync_consumer_tick: throttled by POLL_INTERVAL. Drive stop_container -> work -> start_container INLINE (not via a StopReason -> _restart_container, which would unmount the loopback). Validation failures don't disturb the container; execution failures still restart it. * _sync_initial_consumer_block(): blocks consumer's first start_container until ChainStore has a record (bounded by INITIAL_SYNC_TIMEOUT; 0=forever). * _recover_stale_processing(): renames orphan request.json.processing back to request.json so a crash mid-publish doesn't leave a request stuck. 21 unit tests against a fake plugin that records stop/start ordering: env injection across enabled/disabled, role helpers, throttle, full provider flow with success and r1fs failure, validation failure does not stop container, full consumer flow, misalignment skip, already-applied no-op, stale .processing recovery (and don't-clobber rule when both files exist). All pass alongside the 61 sync_manager tests (82 total). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Integration changes in container_app_runner.py + mixins/__init__.py: * MRO: insert _SyncMixin before _ContainerUtilsMixin so its _configure_system_volume / _inject_sync_env_vars / tick methods are available to the plugin. * _CONFIG: new SYNC block (ENABLED/KEY/TYPE/POLL_INTERVAL/ INITIAL_SYNC_TIMEOUT). System volume itself is hardcoded — no config. * __reset_vars: initialize _last_sync_check + _sync_manager. * on_init: - _configure_system_volume() after the existing _configure_*_volumes chain so /r1en_system is always provisioned; - _recover_stale_processing() so a crash mid-publish doesn't strand a request in .processing; - _validate_sync_config() (logs a clear error and disables SYNC if KEY/TYPE are bad — the system volume keeps working); - _inject_sync_env_vars() right after _setup_env_and_ports() in the non-semaphored branch. * _restart_container: same _configure_system_volume + recovery + _inject_sync_env_vars sequence so a full restart (e.g. for image update) recreates the volume + env vars cleanly. * _handle_initial_launch: - _inject_sync_env_vars() after _setup_env_and_ports in the semaphored branch; - _sync_initial_consumer_block() before the very first start_container so consumer pods boot on populated state. * _perform_additional_checks: drives the provider/consumer tick INLINE (return None — must NOT use a StopReason because that routes through _restart_container which unmounts the loopback before our work). All 82 sync-only unit tests still pass. The 10 pre-existing failures in the rest of the test_*.py suite are unrelated (test env doesn't have docker-py installed; same failure on master). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

E2E surfaced this on the first deploy: flask-files-explorer runs as a non-root user, the system volume's loop mount is root-owned (because _resolve_image_owner() returned (None, None) for this image), and the app's POST /create against /r1en_system/volume-sync/request.json got 'Permission denied'. The system volume is purpose-built as an app-writable control-plane channel between the container and CAR. There's no isolation gain in restricting the volume-sync subdir to root: the volume is per-CAR- instance, and the app already owns the container. Chmod both the mount root and the volume-sync/ subdir to 0o777 after mkdir so any container user can write requests. Existing 82 unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… read E2E surfaced two more permission issues: 1. _write_json_atomic uses tempfile.mkstemp which defaults to 0o600. CAR runs as root, but the app inside the container is typically a non-root user (e.g. flask-files-explorer's appuser). After os.replace the response.json / last_apply.json / request.json.invalid files were unreadable from inside the container. Now chmod 0o666 before replace. 2. extract_archive preserves the tar member's mode, but the new file is root-owned (CAR's identity). If the source mode was something restrictive, the app couldn't read it. Now we max() the source mode against 0o644 for files and 0o755 for directories so the app can always cat / traverse what CAR landed. 82 unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_volume E2E found this: after a CAR restart, the data volume (declared via FIXED_SIZE_VOLUMES) appeared empty inside the container even though fixed_volume.provision had just chowned and mounted it. Cause: _configure_system_volume was calling fixed_volume.cleanup_stale_mounts on the shared <plugin_data>/fixed_volumes/ root. That scan iterates EVERY meta/*.json file under the root, including the appdata.json that _FixedSizeVolumesMixin._configure_fixed_size_volumes() (which runs FIRST) just provisioned. cleanup_stale_mounts saw the active mount, treated it as 'stale', and umount + losetup -d'd it. Then provision() of the system volume ran but never re-mounted appdata. The container started with an empty bind from an unmounted host path. Fix: don't repeat the stale-mount sweep — the previous configure step already did it. Add an explicit comment so a future maintainer doesn't re-introduce the call thinking it's defensive. 82 unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The decision the consumer needs is content-identity: 'is this bundle the one I just applied?'. CID is a content-addressed hash that answers that exactly. Version was a CAR-side timestamp that was indirectly serving as identity, but it imported a class of failure modes (clock skew, multi-provider non-monotonic ordering) that simply don't apply to a CID comparison. Two coupled changes: * _sync_consumer_tick: skip when record.cid == latest_received().cid instead of new_version <= last_version. Different cid → apply, regardless of version metadata. * _latest_in: sort by mtime, not by filename. Filenames stay version-prefixed for chronological browsability under normal operation, but 'what did I last apply?' is fundamentally an insert-order question. Without this, a back-dated record would write a history file with a lex-smaller filename than the previous entry, latest_received would still return the older one, and the consumer would re-apply the back-dated record on every tick forever. version is kept everywhere as informational metadata (record schema + history entries + response.json + last_apply.json + filename prefixes) so wire compat is preserved and human-readable logs still say 'applying v1714742400 ...'. Only the comparison logic changed. Tests updated: * test_skips_already_applied_version → test_skips_when_record_cid_matches_last_apply * Added test_applies_when_cid_differs_even_if_version_lower (covers the clock-skew failure mode the old code couldn't survive). * test_latest_picks_highest_version → test_latest_picks_most_recently_written 83 unit tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The volume-sync feature was scattered across two locations because the codebase has a 'mixins go in mixins/' convention while the manager is not a mixin: before: extensions/business/container_apps/ ├── sync_manager.py ← module-level constants + SyncManager └── mixins/ └── sync_mixin.py ← _SyncMixin (lifecycle integration) After this refactor the entire feature lives in one folder and a reader can scan it without jumping around: extensions/business/container_apps/ └── sync/ ├── __init__.py ← re-exports the public API ├── constants.py ← file names, hkeys, schema version, │ stage labels — pure data ├── manager.py ← SyncManager + host-side path helpers └── mixin.py ← _SyncMixin Module renames: - sync_manager.py → sync/manager.py - mixins/sync_mixin.py → sync/mixin.py - (new) → sync/constants.py (extracted from manager) - (new) → sync/__init__.py (re-exports) Import-site updates: - container_app_runner.py: pull _SyncMixin from .sync, not .mixins - mixins/__init__.py: drop _SyncMixin export (note added pointing to the new location) - tests/test_sync_manager.py and tests/test_sync_mixin.py: import from .sync (the package) rather than the deleted modules Tradeoff: this one feature breaks the 'mixins live in mixins/' convention. Worth it because everything related to sync — constants, helpers, the manager class, and the mixin — is now reachable with 'cd sync/'. The mixins/__init__.py docstring points the next reader at the new location so it's not surprising. Verification: all 83 unit tests pass (test_sync_manager + test_sync_mixin exercise the resolver, atomic JSON, history I/O, claim_request, archive roundtrip, orchestrators, env injection, ticks, recovery — all green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

toderian and others added 10 commits May 3, 2026 18:40

toderian self-assigned this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/ CAR volume sync#399

Feature/ CAR volume sync#399
toderian wants to merge 11 commits intodevelopfrom
feature/car-volume-sync

toderian commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

toderian commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant