Skip to content

feat: add Maestro replay compatibility#581

Open
thymikee wants to merge 2 commits into
mainfrom
codex/maestro-replay-compat
Open

feat: add Maestro replay compatibility#581
thymikee wants to merge 2 commits into
mainfrom
codex/maestro-replay-compat

Conversation

@thymikee
Copy link
Copy Markdown
Member

@thymikee thymikee commented May 21, 2026

Summary

Adds a focused Maestro replay compatibility path for existing YAML flows, including command mapping, variable/runScript handling, flow-control shims, and replay-only runtime helpers for scroll/tap behavior. Also adds the iOS non-hittable selector tap backdoor needed for hidden RN E2E controls and documents the supported replay subset.

Updates #558.

Touched 34 files. Scope is replay compatibility plus the narrow iOS runner tap fallback required by that workflow. Device utility commands such as permissions, mock location, airplane mode, orientation, and recording are intentionally left unsupported for now.

Validation

Verified with pnpm format, focused replay/help tests, pnpm build, and git diff --check. pnpm check:unit outside the sandbox passed all but one unrelated Android .aab install test that timed out in the full run; rerunning that exact test passed. Earlier validation on this branch also passed pnpm build:xcuitest for iOS and macOS. Manual Bluesky replay comparison produced Agent Device and Maestro recordings during validation; Maestro still fails the selected Bluesky flow earlier than Agent Device because its precondition does not reach the target screen.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://callstackincubator.github.io/agent-device/pr-preview/pr-581/

Built to branch gh-pages at 2026-05-22 07:51 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@thymikee thymikee force-pushed the codex/maestro-replay-compat branch from 6794225 to baaee00 Compare May 21, 2026 19:36
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6794225256

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +145 to +147
typeof value.direction === 'string'
? readScrollPositionalsFromDirectionSwipe(value.direction)[0]
: 'down';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve scrollUntilVisible direction instead of inverting it

The scrollUntilVisible mapper currently derives direction via readScrollPositionalsFromDirectionSwipe(...)[0], which flips UP/DOWN/LEFT/RIGHT. That inversion is correct for Maestro swipe gestures, but scrollUntilVisible.direction is already a scroll direction; e.g. DOWN should keep scrolling toward lower content. With this mapping, flows that specify a direction scroll the opposite way and can consistently time out before finding elements.

Useful? React with 👍 / 👎.

Comment on lines +231 to +233
if (!conditionResponse.ok) {
return { ok: true, data: { skipped: true, condition: mode, selector } };
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fail runFlow.when on runtime errors instead of silently skipping

When evaluating runFlow.when, any failed is response is treated as a skipped condition (ok: true). This conflates expected predicate-false outcomes with real runtime failures (for example transport/session errors), so a broken condition check can silently bypass required subflow steps and let replay continue as a false pass. The handler should only skip on genuine condition-miss cases and propagate operational errors.

Useful? React with 👍 / 👎.

@thymikee thymikee force-pushed the codex/maestro-replay-compat branch from baaee00 to 1ea51f9 Compare May 22, 2026 05:53
Copy link
Copy Markdown
Member Author

Code review

Overview

This reworks the Maestro replay compatibility layer: it removes the previously-mapped device/utility commands (setPermissions, setLocation, setOrientation, setAirplaneMode, startRecording/stopRecording, killApp, literal assertTrue) and adds a runtime command layer (__maestro* shims dispatched at replay time), scrollUntilVisible, percentage point taps, direction-based swipe, runFlow.when.visible/notVisible, runScript (JS execution + http.post/json/output), launchApp launch arguments + iOS-sim clearState, and an iOS "non-hittable selector tap" backdoor for RN E2E controls. The split into flow-control.ts / points.ts / run-script.ts / runtime-commands.ts / session-replay-maestro-runtime.ts is clean and the runtime tests are substantial.

Findings

Security / decisions to make consciously

S1 — runScript executes arbitrary JS + network requests at parse time, via vm (not a sandbox). (src/compat/maestro/run-script.ts, executeRunScript)

  • executeRunScript runs the referenced .js through vm.runInNewContext(...) and intentionally exposes json() and http.post. Node's docs are explicit that vm is not a security mechanism, so this should not be described as sandboxed.
  • It runs during parse/convert (it takes a MaestroParseContext and writes results into context.env synchronously), not when the step is reached during replay. Combined with runFlow: <file> (recursive parse) and repeat, merely parsing a flow can execute JS and fire http.post to arbitrary URLs before any device action — and a runScript under repeat.times: N executes N×.
  • Recommendation: (a) document honestly that runScript runs untrusted user JS with full Node + network capability and is not sandboxed; (b) consider deferring execution to replay-runtime so a dry parse/validate is side-effect-free; (c) consider an explicit opt-in flag. Path handling itself (absolute or baseDir-relative only) is fine.

S2 — iOS non-hittable selector tap: bounds check is looser than its comment. (RunnerTests+Interaction.swift, hasTappableFrame)
hasTappableFrame accepts when appFrame.isEmpty || appFrame.intersects(frame), then taps at frame.midX/midY. A partially-intersecting element (mostly off-screen) can be tapped at a point outside the visible app frame. It's gated behind the replay-only allowNonHittableSelectorTap flag (set only by tapOn), so blast radius is limited, but consider clamping the tap point into the app frame, or requiring the center (not just any intersection) to be inside it.

Issues to verify

I1 — Android launchArgs are silently dropped. (src/core/interactors/android.ts, open)
convertLaunchApp maps Maestro arguments/launchArguments generically (no platform gate), but the Android interactor's open forwards only options?.activitylaunchArgs are ignored. A cross-platform flow using launchApp.arguments on Android appears to succeed while the arguments vanish. Either wire launch args through openAndroidApp or fail loudly / document them as iOS-only (mirroring how clearState is explicitly iOS-sim-only).

I2 — clearState messaging is self-contradictory; verify the iOS data-wipe is correct. (clearIosSimulatorAppState in src/platforms/ios/apps.ts, dispatched from dispatch.ts)
The iOS-sim path resolves simctl get_app_container ... data and fs.rms every entry, then relaunches without reinstall. Worth confirming this reliably produces clean state and removes nothing simctl launch expects. Separately, the help/docs text still says the subset works "without state-reset side effects" while clearState now does reset iOS-sim state — align the wording.

I3 — pressEnter "fetch failed" recovery can report a non-submit as success. (session-replay-maestro-runtime.ts)
On a type ['\n'] failure whose message contains fetch failed, it issues a snapshot and, if that succeeds, returns { ok: true, recovered: true }. Success is inferred from transport recovery, not from any UI state change, and keys off substring-matching the error text (brittle — prefer an error code). Reasonable as a pragmatic shim, but worth a comment on the risk.

I4 — runHttpRequestSync blindly JSON.parse(result.stdout). (run-script.ts)
If the spawned node -e process times out or emits non-JSON, JSON.parse throws an opaque Unexpected token and the child's stderr is discarded. Check exitCode and include stderr in the surfaced error.

I5 — runFlow.when round-trip + fractional trace-step keys. (flow-control.ts sessionActionToBatchStep, runtime batchStepToSessionAction)
when-gated blocks are wrapped into batchSteps and re-expanded via invokeReplayAction. Nested __maestro* runtime commands inside a when block do appear to re-dispatch correctly, but there's no test for a runtime command nested inside when.visible (only tapOnfind). Also the step + index/1000 trace-step scheme will collide if a when block holds ≥1000 steps — astronomically unlikely, but worth a comment.

Nits

  • invokeMaestroScrollUntilVisible doesn't validate the direction positional at runtime (only the map form is validated at parse). Internal-only, low risk.
  • extractMaestroVisibleTextQuery returns null (disables fuzzy matching) for any non-pure label/text/id-equal selector — correct, but a one-line comment on why mixed selectors opt out would help.
  • tapFlags gained maestroOptional but doubleTap/longPress don't, though Maestro allows optional on those too — noting the asymmetry, not in scope.
  • Removing setPermissions/startRecording etc. breaks any external examples referencing them; make sure the Track Maestro flow compatibility for replay --maestro #558 supported matrix is updated (in-repo replay-e2e.md is).

Test coverage & CI

Coverage of the new mapping/runtime surface is good: replay-flow.test.ts covers tapOn→__maestroTapOn (+ the allowNonHittableSelectorTap flag only on selector taps), percentage point taps, scrollUntilVisible, direction swipe, openLink appId prefixing, launchApp args + clearState, and removed-command rejections. session-replay-vars.test.ts covers scrollUntilVisible probe/scroll/timeout, fuzzy tapOn, optional native-label fallback, percentage point resolution, tapOn retry, pressEnter transport recovery, and runFlow.when.visible run/skip.

Gaps worth filling:

  • runScript: only one happy-path mapping test (outputtype, exercising json() + ${output.x}). No test for http.post, none for a throwing/failing script, none asserting the parse-time/non-sandboxed behavior.
  • Android launchArgs drop (I1) — untested.
  • clearIosSimulatorAppState (I2) — untested.
  • iOS non-hittable fallback has no Swift-level test; hasTappableFrame partial-intersection (S2) is untested (only the TS handler forwarding allowNonHittableTap is asserted).

CI: not failing. All completed checks pass; the unstable/pending state at review time was due to two still-in-progress checks (a Smoke Tests job and "Analyze (swift)" CodeQL) plus the umbrella CodeQL run being neutral (informational), not a red check.

Summary

Solid, well-factored compatibility work with strong runtime tests. The item warranting an explicit maintainer decision is runScript (S1) — it runs arbitrary user JS with network access at parse time and shouldn't be called sandboxed; consider deferring to runtime and/or an opt-in flag, and document the capability honestly. The most concrete functional bug is Android launchArgs being silently dropped (I1), and the "no state-reset side effects" wording is now inconsistent with clearState (I2). None of these block the approach; they're about hardening behavior and aligning the documented contract.

Reviewed with Claude Code.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant