Skip to content

extract_answer: prefer boxed{N} extraction, fall back to legacy tags#4150

Open
py4 wants to merge 1 commit into
mainfrom
pr/extract-answer-boxed
Open

extract_answer: prefer boxed{N} extraction, fall back to legacy tags#4150
py4 wants to merge 1 commit into
mainfrom
pr/extract-answer-boxed

Conversation

@py4

@py4 py4 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

extract_answer (used by the built-in check_numbers reward) returned the raw <answer> content. Modern reasoning models (Qwen3, DeepSeek-R1, etc.) emit the final answer as \boxed{N} (the maxtext GSM8K chat template itself asks for {solution_start_token}\boxed{}{solution_end_token}). So the extractor returned the literal string \boxed{42}, which math_verify cannot match against a bare numeric gold like 42. Result: ~0% accuracy on Qwen3/GSM8K even when the model's numeric answer is correct, a silent scoring failure rather than a model failure.

This makes the extractor consistent with maxtext's own chat template.

Strategy (priority order)

  1. If <answer>...</answer> is present, scope to the last block's content; otherwise use the full response.
  2. Inside the scope, extract the last \boxed{N} via a brace-balanced scan (handles nested LaTeX like \boxed{\frac{1}{2}}), with a permissive regex fallback.
  3. If no \boxed is found, fall back to the legacy {solution_start_token}...{solution_end_token} regex, so recipes that emit plain-text answers are unaffected.

Step 3 is what keeps this backward compatible.

Tests

tests/post_training/unit/extract_answer_test.py (10 cases, cpu_only + post_training):

Boxed extraction:

  • inside <answer> tags, without tags, nested LaTeX, multiple boxed (last wins), whitespace stripping, negatives, and <answer>-tag scoping over a \boxed that appears in <reasoning>.

Legacy fallback (no \boxed):

  • plain-text answer inside <answer> tags still extracts; last-answer-wins preserved; no-answer returns FALLBACK_ANSWER.

All 10 executed and passed against the real function (Ran 10 tests ... OK).

Files

File Change
src/maxtext/trainers/post_train/rl/utils_rl.py extract_answer: boxed extraction + legacy fallback (+40/-4)
tests/post_training/unit/extract_answer_test.py new, 10 cases

Checklist

  • Pyink-clean (--pyink-indentation=2 --line-length=122)
  • Backward compatible: legacy plain-text answers still extract via the tier-3 fallback
  • No effect on non-RL paths
  • Unit test covering boxed extraction + legacy fallback, verified passing

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.60870% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/trainers/post_train/rl/utils_rl.py 82.60% 2 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@py4 py4 force-pushed the pr/extract-answer-boxed branch from 906e76f to df2aecf Compare June 15, 2026 18:14
Comment thread src/maxtext/trainers/post_train/rl/utils_rl.py Outdated
Modern reasoning models (Qwen3, DeepSeek-R1, etc.) emit `\boxed{N}`
inside `<answer>...</answer>` (or with no answer tags at all). The
legacy regex returned the raw `<answer>` content (e.g. `\boxed{42}`
as a string), which math_verify cannot match against a bare numeric
gold like "42". Result: ~0% accuracy on Qwen3/GSM8K even when the
model's numeric answer is correct.

New strategy (priority order):
  1. If a `{solution_start_token}...{solution_end_token}` block is
     present (default `<answer>...</answer>`), use the last block's
     content as the search scope; otherwise use the full response.
  2. Inside the scope, extract the last `\boxed{N}` via brace-balanced
     scan + permissive regex fallback.
  3. If no `\boxed` is found, fall back to the same configured
     solution-tag regex (backward-compat for recipes that emit
     plain-text answers).

Both the scoping (step 1) and the plain-text fallback (step 3) reuse
get_answer_fallback_regex, so the solution tags have a single source
of truth in solution_start_token / solution_end_token.
@py4 py4 force-pushed the pr/extract-answer-boxed branch from df2aecf to cddd7dc Compare June 16, 2026 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants