fix: add default writer_batch_size to prevent pyarrow offset overflow by dieutx · Pull Request #1059 · PrimeIntellect-ai/verifiers

dieutx · 2026-03-24T15:12:56Z

Summary

Closes #230.

Datasets with large strings (e.g. serialized test cases in coding datasets like DeepCoder's lcbv5 subset) cause pyarrow offset overflow during dataset.map() calls because the default batch size produces Arrow arrays exceeding the 2 GB limit.
Sets writer_batch_size=200 as a default in Environment.__init__ (propagated to all internal .map() calls via map_kwargs) and in the standalone format_dataset / load_example_dataset utilities.
Users can still override writer_batch_size by passing it explicitly in map_kwargs.

Changes

verifiers/envs/environment.py -- copy map_kwargs to avoid mutating the shared mutable default, then setdefault("writer_batch_size", 200).
verifiers/utils/data_utils.py -- same default in format_dataset(), and explicit writer_batch_size=200 in load_example_dataset()'s .map() call.

Test plan

Existing tests pass (uv run pytest tests/test_env_group.py tests/test_message_utils_multimodal.py)
uv run ruff check and uv run ruff format clean

Note

Medium Risk
Changes default HuggingFace Dataset.map() batching across environment dataset formatting and example dataset utilities, which could affect performance/memory and any callers relying on previous implicit defaults.

Overview
Prevents pyarrow ArrowInvalid: offset overflow on datasets with very large string fields by defaulting HuggingFace .map() writer_batch_size to 200.

Environment now copies map_kwargs (avoiding mutation of the default dict) and applies the default via setdefault, while format_dataset() and load_example_dataset() apply the same default (including an explicit writer_batch_size=200 on the example preprocessing .map()). Documentation adds an FAQ explaining the error and how to override writer_batch_size when needed.

^{Written by Cursor Bugbot for commit e08dc65. This will update automatically on new commits. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

…PrimeIntellect-ai#230) Datasets with large strings (e.g. serialized test cases in coding datasets like DeepCoder) cause pyarrow offset overflow during dataset.map() calls. Set writer_batch_size=200 by default so pyarrow writes smaller batches and stays within the 2GB Arrow array limit.

Addresses review feedback: documents the new writer_batch_size=200 default and how users can override it via map_kwargs.

cursor bot reviewed Mar 24, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py

dieutx added 2 commits March 24, 2026 22:24

docs: add FAQ entry for pyarrow offset overflow and writer_batch_size

e08dc65

Addresses review feedback: documents the new writer_batch_size=200 default and how users can override it via map_kwargs.

dieutx force-pushed the fix/pyarrow-offset-overflow branch from acc51fe to e08dc65 Compare March 24, 2026 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add default writer_batch_size to prevent pyarrow offset overflow#1059

fix: add default writer_batch_size to prevent pyarrow offset overflow#1059
dieutx wants to merge 2 commits intoPrimeIntellect-ai:mainfrom
dieutx:fix/pyarrow-offset-overflow

dieutx commented Mar 24, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dieutx commented Mar 24, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dieutx commented Mar 24, 2026 •

edited by cursor bot

Loading