Skip to content

4-atari-hard: Go-Explore (exploration phase) on Montezuma's Revenge + benchmark#132

Open
dnddnjs wants to merge 1 commit into
masterfrom
ai/montezuma-go-explore
Open

4-atari-hard: Go-Explore (exploration phase) on Montezuma's Revenge + benchmark#132
dnddnjs wants to merge 1 commit into
masterfrom
ai/montezuma-go-explore

Conversation

@dnddnjs
Copy link
Copy Markdown
Contributor

@dnddnjs dnddnjs commented Jun 8, 2026

Go-Explore phase 1 (exploration only) on Montezuma's Revenge — the archive + emulator-restore paradigm, side by side with the PPO+RND row.

Best end-of-episode score: 31,000 at 500M agent steps (~5.5h, Mac Studio M4 Max, 12 explorer processes, no neural network). Single seed. Replay-verified: re-executing the stored 5,336-action trajectory from reset reproduces exactly 31,000.

Protocol notes (also in the README block):

  • Deterministic ALE (no sticky actions, frameskip 4, fixed seed) — required by restore-based exploration, not comparable to the sticky-action RL rows.
  • Score = best end-of-episode trajectory found by search, not an RL policy score; the paper's robustification phase is not run here.
  • Reference: Nature exploration-phase mean without domain knowledge is 24,758 at the same 2B-frame budget (50+ seeds vs our single seed). Rooms found: 24.

W&B (full metrics history + gameplay video): https://wandb.ai/rlcode/rl-atari-hard-go-explore/runs/m6ox4l3m

(Single-seed diagnostic run; merge is a human decision.)

… benchmark

Go-Explore phase 1 (Ecoffet et al. 2019 / Nature 2021), no neural net:
an archive of downscaled-frame cells (11x8, 9 gray levels), emulator
state save/restore to return to frontier cells, repeated random actions
to explore from them. 12 explorer processes over raw gymnasium ALE
(envpool exposes no clone API, hence the separate env_go_explore.py).

Result: best end-of-episode score 31,000 at 500M agent steps (~5.5h on
a Mac Studio M4 Max), single seed, replay-verified (re-executing the
stored 5,336-action demo from reset reproduces the score exactly).
Deterministic protocol (no sticky actions) -- a trajectory-search
result, not an RL policy score; see the README caveat.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant