Skip to content

Fix path-traversal vulnerability in emergency P2P checkpoint service#3105

Closed
YuvalElbar6 wants to merge 1 commit into
google:mainfrom
YuvalElbar6:main
Closed

Fix path-traversal vulnerability in emergency P2P checkpoint service#3105
YuvalElbar6 wants to merge 1 commit into
google:mainfrom
YuvalElbar6:main

Conversation

@YuvalElbar6
Copy link
Copy Markdown
Contributor

A malicious or compromised peer on the P2P network could supply a manifest whose rel_path contained '..' segments or an absolute path, causing P2PNode.fetch_shard_from_peer() to write attacker-controlled bytes outside the staging directory (e.g. a .pth file in site-packages, yielding persistent RCE on the training host).

  • Add _safe_path_join() which joins a peer-supplied relative path onto a base directory only if the resolved result stays inside that base. Resolution goes through os.path.realpath so symlink-escape attempts are caught as well.
  • Apply the helper on both sides of the wire:
    • Client: fetch_shard_from_peer() validates every manifest entry against stage_dir and aborts the whole fetch on any unsafe entry.
    • Server: handle_download() replaces the substring '..' check with the same resolve-based containment check against self.directory.
  • Log every rejection with peer and request context.
  • Add regression tests for the helper and both call sites.

Reported via the Google OSS VRP.

@linxiulei
Copy link
Copy Markdown

Hi, thanks for your contribution!

Since Orbax runs on top of JAX, and JAX's coordination service doesn't enforce strong authentication, the Orbax runtime generally assumes it's operating in a secure network environment. In practice, if there were a "compromised peer," we would already be facing much larger systemic risks.

That said, we always welcome security enhancements. Could you please clean up the empty commits in the PR so we can get it merged?

@YuvalElbar6
Copy link
Copy Markdown
Contributor Author

Thanks for the context on the threat model — agreed it's a defense-in-depth hardening rather than a critical fix. I've cleaned up the merge commits; the PR is now a single commit on top of main.
@linxiulei

@YuvalElbar6
Copy link
Copy Markdown
Contributor Author

Hi @orbax-dev, @linxiulei this PR is now one commit can we merge it?

@YuvalElbar6 YuvalElbar6 force-pushed the main branch 2 times, most recently from 862ab75 to 3a54a83 Compare May 11, 2026 19:22
@orbax-dev
Copy link
Copy Markdown
Collaborator

Hey @YuvalElbar6, we don't have external contributions properly set up, so your commit would just get overwritten next time we sync from internal. Waiting for @linxiulei to send me an internal version of this change, at which point we can merge this PR.

Copy link
Copy Markdown

@linxiulei linxiulei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comments are informative if you have no objections and they will be in the final merge.

self.enter_context(
mock.patch.object(service, '_ThreadingTCPServer', autospec=True)
)
server = service._ThreadingTCPServer.return_value
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    mock_server_cls = self.enter_context(
        mock.patch.object(service, '_ThreadingTCPServer', autospec=True)
    )
    server = mock_server_cls.return_value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resloved


A malicious peer returns a manifest whose ``rel_path`` tries to escape the
staging directory (e.g., writing a ``.pth`` file into site-packages).
``fetch_shard_from_peer`` must abort before any ``download`` call.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    Args:
      mock_download: Mock for download.
      mock_request: Mock for request.
      unused_mock_time: Unused mock for time.
      unused_mock_move: Unused mock for move.
      unused_mock_rmtree: Unused mock for rmtree.

to be consistent with code style.

mock_request,
unused_mock_time,
unused_mock_move,
unused_mock_rmtree,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@YuvalElbar6
Copy link
Copy Markdown
Contributor Author

Yes I have no objection,
When I will get to home I will fix it

A malicious or compromised peer on the P2P network could supply a
manifest whose rel_path contained '..' segments or an absolute path,
causing P2PNode.fetch_shard_from_peer() to write attacker-controlled
bytes outside the staging directory (e.g. a .pth file in site-packages,
yielding persistent RCE on the training host).

- Add _safe_path_join() which joins a peer-supplied relative path onto
  a base directory only if the resolved result stays inside that base.
  Resolution goes through os.path.realpath so symlink-escape attempts
  are caught as well.
- Apply the helper on both sides of the wire:
  * Client: fetch_shard_from_peer() validates every manifest entry
    against stage_dir and aborts the whole fetch on any unsafe entry.
  * Server: handle_download() replaces the substring '..' check with
    the same resolve-based containment check against self.directory.
- Log every rejection with peer and request context.
- Add regression tests for the helper and both call sites.

Reported via the Google OSS VRP.
@YuvalElbar6
Copy link
Copy Markdown
Contributor Author

Hi @linxiulei I fiixed the comments

@YuvalElbar6 YuvalElbar6 requested a review from linxiulei May 12, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants