Skip to content

fix: add default writer_batch_size to prevent pyarrow offset overflow#1059

Open
dieutx wants to merge 2 commits intoPrimeIntellect-ai:mainfrom
dieutx:fix/pyarrow-offset-overflow
Open

fix: add default writer_batch_size to prevent pyarrow offset overflow#1059
dieutx wants to merge 2 commits intoPrimeIntellect-ai:mainfrom
dieutx:fix/pyarrow-offset-overflow

Conversation

@dieutx
Copy link
Copy Markdown

@dieutx dieutx commented Mar 24, 2026

Summary

Closes #230.

  • Datasets with large strings (e.g. serialized test cases in coding datasets like DeepCoder's lcbv5 subset) cause pyarrow offset overflow during dataset.map() calls because the default batch size produces Arrow arrays exceeding the 2 GB limit.
  • Sets writer_batch_size=200 as a default in Environment.__init__ (propagated to all internal .map() calls via map_kwargs) and in the standalone format_dataset / load_example_dataset utilities.
  • Users can still override writer_batch_size by passing it explicitly in map_kwargs.

Changes

  • verifiers/envs/environment.py -- copy map_kwargs to avoid mutating the shared mutable default, then setdefault("writer_batch_size", 200).
  • verifiers/utils/data_utils.py -- same default in format_dataset(), and explicit writer_batch_size=200 in load_example_dataset()'s .map() call.

Test plan

  • Existing tests pass (uv run pytest tests/test_env_group.py tests/test_message_utils_multimodal.py)
  • uv run ruff check and uv run ruff format clean

Note

Medium Risk
Changes default HuggingFace Dataset.map() batching across environment dataset formatting and example dataset utilities, which could affect performance/memory and any callers relying on previous implicit defaults.

Overview
Prevents pyarrow ArrowInvalid: offset overflow on datasets with very large string fields by defaulting HuggingFace .map() writer_batch_size to 200.

Environment now copies map_kwargs (avoiding mutation of the default dict) and applies the default via setdefault, while format_dataset() and load_example_dataset() apply the same default (including an explicit writer_batch_size=200 on the example preprocessing .map()). Documentation adds an FAQ explaining the error and how to override writer_batch_size when needed.

Written by Cursor Bugbot for commit e08dc65. This will update automatically on new commits. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread verifiers/envs/environment.py
dieutx added 2 commits March 24, 2026 22:24
…PrimeIntellect-ai#230)

Datasets with large strings (e.g. serialized test cases in coding
datasets like DeepCoder) cause pyarrow offset overflow during
dataset.map() calls. Set writer_batch_size=200 by default so pyarrow
writes smaller batches and stays within the 2GB Arrow array limit.
Addresses review feedback: documents the new writer_batch_size=200
default and how users can override it via map_kwargs.
@dieutx dieutx force-pushed the fix/pyarrow-offset-overflow branch from acc51fe to e08dc65 Compare March 24, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pyarrow offset overflow for datasets with large strings

1 participant