FormGym

FormGym is a benchmark for evaluating language model agents on end-to-end form completion. Given an unfilled form image and source persona data, agents must place correct text values into the appropriate fields using an editor API.

We find that field localization is the primary bottleneck for current models and introduce FieldFinder, a fine-tuned Florence-2 tool that enables zero-shot VLAs to accurately locate input fields, improving accuracy from ≤3% (baseline) to 23% (Claude + FieldFinder).

Paper: FormGym: Doing Paperwork with Agents (EACL 2026)

Datasets

FormGym includes four datasets spanning scanned and digital documents:

Dataset	Domain	Forms (train/test)	Fields (train/test)	Language	Source
Auto Loans (AL)	Financial	— / 10	- / 886	English	Manually annotated
FUNSD	Tobacco industry	155 / 39	2,246 / 577	English	Scanned documents
XFUND	Common Crawl	1,112 / 100	19,559 / 1,950	7 languages	Scanned documents
Form-NLU	Australian financial	442 / 66	3,661 / 476	English	Digital filings

Task Overview

Action space:

Action	Parameters	Description
`PlaceText`	`cx, cy, value`	Place text at normalized coordinates (0–1)
`DeleteText`	`x, y`	Remove text at a location
`SignOrInitial`	`x, y, value`	Add a signature or initials
`Terminate`	—	End the episode

With FieldFinder enabled (--study_condition ours), coordinate-based actions are replaced with field-name-based actions (PlaceWithLocalizer, SignOrInitialWithLocalizer).

Evaluation modes:

Setting	`--task`	Description
One-shot	`oneshot`	Agent places all text in a single turn
Iterative	`iterative`	Agent gets up to `--max_turns` rounds with feedback

Profile source (--profile_source):

Source	Description
`text`	User information as plain text (default)
`image`	Information from a completed source document image (Auto Loans only)

Study conditions (--study_condition):

Condition	Description
`baseline`	Coordinate-based actions (`PlaceText`)
`ours`	FieldFinder-based actions (`PlaceWithLocalizer`)

Metric: Field accuracy — percentage of fields with correct values whose text center falls within the field bounding box.

Setup

Installation

# Clone the repo
git clone https://github.com/mtoles/form-filler.git
cd form-filler

# Create environment and install dependencies
pip install -r requirements.txt

# Preprocess PDFs to PNGs and generate target files
python preprocess/preprocess.py

Preparing converted datasets (FUNSD, Form-NLU, XFUND)

For FUNSD, Form-NLU, and XFUND, additional preprocessing is required:

python preprocess/process_form-nlu.py

These datasets require processed annotation files in tool/dataset/processed/.

FieldFinder setup (optional, for `--study_condition ours`)

FieldFinder requires building content-aware-fill from source into tool/content-aware-fill/, and a fine-tuned Florence-2 checkpoint in tool/checkpoints/.

Supported Models

API models

`--model_type`	`--model_name`	Provider
`gpt`	`gpt-5`	OpenAI
`gpt`	`gpt-5-mini`	OpenAI
`anthropic`	`claude-sonnet-4-20250514`	Anthropic

Set OPENAI_API_KEY or ANTHROPIC_API_KEY environment variables.

Local models (vLLM)

`--model_type`	`--model_name`	Notes
`hf`	`aria`	Aria 25B
`hf`	`llava`	Llava 7B
`hf`	`molmo`	Molmo 7B
`hf`	`qwen_vl`	Qwen-VL
`hf`	`deepseek_vl2`	DeepSeek-VL2
`hf`	`gemma3`	Gemma 3
`hf`	`mllama`	MLlama

Local models require a CUDA GPU and vLLM. Use --download_dir to set the HF model cache location.

Usage

API models

# GPT-4o on Auto Loans (one-shot, baseline)
python main.py \
  --model_type gpt \
  --model_name gpt-5 \
  --task oneshot \
  --domain al \
  --chosen_file_ids al_0_0 \
  --study_condition baseline \
  --profile_source text

# Claude on Auto Loans (iterative, multiple forms)
python main.py \
  --model_type anthropic \
  --model_name claude-sonnet-4-20250514 \
  --task iterative \
  --domain al \
  --chosen_file_ids al_0_0 al_1_0 al_2_0 \
  --study_condition baseline \
  --profile_source text \
  --max_turns 5

# GPT-4o on FUNSD
python main.py \
  --model_type gpt \
  --model_name gpt-5 \
  --task oneshot \
  --domain funsd \
  --chosen_file_ids 82092117 \
  --study_condition baseline

Local models (vLLM)

# Molmo 7B on Auto Loans
python main.py \
  --model_type hf \
  --model_name molmo \
  --task oneshot \
  --domain al \
  --chosen_file_ids al_0_0 \
  --study_condition baseline \
  --download_dir /path/to/hf_cache

With FieldFinder (experimental condition)

python main.py \
  --model_type gpt \
  --model_name gpt-5 \
  --task oneshot \
  --domain al \
  --chosen_file_ids al_0_0 \
  --study_condition ours

Document transfer (image profile source, Auto Loans only)

python main.py \
  --model_type gpt \
  --model_name gpt-5 \
  --task oneshot \
  --domain al \
  --chosen_file_ids al_1_0 \
  --study_condition baseline \
  --profile_source image \
  --source_doc_id al_0_0

CLI Reference

Argument	Type	Description
`--model_type`	str	Model backend: `gpt`, `anthropic`, `hf`, `scripted`
`--model_name`	str	Model identifier (e.g., `gpt-5`, `claude-sonnet-4-20250514`)
`--task`	str	`oneshot` or `iterative`
`--domain`	str	`al`, `cr`, `funsd`, `form-nlu`, `xfund`
`--chosen_file_ids`	str+	Space-separated list of document IDs
`--study_condition`	str	`baseline` (coordinates) or `ours` (FieldFinder)
`--profile_source`	str	`text` (default) or `image` (AL only)
`--max_turns`	int	Max rounds for iterative mode (default: 3)
`--user_idx`	int	User profile index (default: 0)
`--source_doc_id`	str	Source document for image profile transfer
`--gt_coordinates`	flag	Pass ground-truth coordinates to the model
`--draw_grid`	flag	Overlay coordinate grid on form images
`--download_dir`	str	Download directory for HF models
`--use_short_dataset`	bool	Use short dataset splits (default: True)
`--note`	str	Note saved with results

Output

Results are saved to results/<model_name>/<domain>/<task>/<study_condition>/<profile_source>/u<user_idx>/<date>/<time>/:

results.md — summary metrics (average accuracy, cost, token usage)
history.jsonl — full run data per document
images/ — visualizations of form state at each turn

FieldFinder Training

To train FieldFinder from scratch:

Download the FUNSD dataset to tool/dataset/FUNSD from https://guillaumejaume.github.io/FUNSD/download/
Build content-aware-fill from source to tool/content-aware-fill

Preprocess training data:

python tool/process_annotations_and_images.py

Train:
```
python tool/train_florence.py
```

Citation

@inproceedings{toles2025formgym,
  title={FormGym: Doing Paperwork with Agents},
  author={Toles, Michael and others},
  booktitle={Proceedings of EACL},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
analysis		analysis
annotations		annotations
gui_agents		gui_agents
pdfs		pdfs
pngs		pngs
preprocess		preprocess
targets		targets
tmp		tmp
tool		tool
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
actions.py		actions.py
annotations.py		annotations.py
apis.py		apis.py
batch_main.py		batch_main.py
doc_state.py		doc_state.py
fields.py		fields.py
form_fields.py		form_fields.py
hfmodels.py		hfmodels.py
main.py		main.py
models.py		models.py
process_data_to_classes.py		process_data_to_classes.py
prompt.py		prompt.py
requirements.txt		requirements.txt
setup.py		setup.py
tasks.py		tasks.py
user_features.py		user_features.py
user_profile_attributes.py		user_profile_attributes.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FormGym

Datasets

Task Overview

Setup

Installation

Preparing converted datasets (FUNSD, Form-NLU, XFUND)

FieldFinder setup (optional, for `--study_condition ours`)

Supported Models

API models

Local models (vLLM)

Usage

API models

Local models (vLLM)

With FieldFinder (experimental condition)

Document transfer (image profile source, Auto Loans only)

CLI Reference

Output

FieldFinder Training

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FormGym

Datasets

Task Overview

Setup

Installation

Preparing converted datasets (FUNSD, Form-NLU, XFUND)

FieldFinder setup (optional, for --study_condition ours)

Supported Models

API models

Local models (vLLM)

Usage

API models

Local models (vLLM)

With FieldFinder (experimental condition)

Document transfer (image profile source, Auto Loans only)

CLI Reference

Output

FieldFinder Training

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

FieldFinder setup (optional, for `--study_condition ours`)

Packages