Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Code and datasets for Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [paper].

This code is based on the code of, What's "up" with vision-language models? Investigating their struggle with spatial reasoning [paper][code].

Datasets

The code to load and evaluate each dataset in dataset_zoo/aro_datasets.py. The Question and Answering data is in prompt/.

Method: ScalingVis and AdaptVis

Setting Up the environment

git clone https://github.com/shiqichen17/AdaptVis.git
mkdir data
mkdir output
pip install requirements.txt

Downloading the data

The data all lives in whatsup_vlms/data, which is also where your models will go as they're downloaded.

For all the datasets, setting --download=True (while running python main_aro.py or while instantiating the dataset directly, as mentioned later in this README) will download the data JSONs and images if the files don't already exist.

You can also download the data directly from this Google Drive link. Alternatively, you can download from HuggingFace datasets here.

Running experiments scaling_vis and adapt_vis

You can fast implement an example by:

bash run.sh

Argument

All parameter choices are indicated in run.sh.

Argument	Example	Description
`dataset`	`Controlled_Images_A`	Specifies the dataset you want to evaluate. Can choose from `Controlled_Images_A, Controlled_Images_B..`.
`model`	`llava1.5`	Specifies the model you want to use.
`method`	`scaling_vis`	The method for evaluation. Can choose from `"scaling_vis"` or `"adapt_vis"`.
`weight`	`1.2`	Coefficient for Scaling_vis. Can set from `[0, 0.5, 0.8, 1.2, 1.5, 2.0]`.
`weight1`	`0.5`	Coefficient for AdaptVis. Can set from `[0.5, 0.8]`.
`weight2`	`1.2`	Coefficient for AdaptVis. Can set from `[1.2, 1.5, 2.0]`.
`threshold`	`0.3`	Threshold for AdaptVis.

Citation

If you use this code or data, please consider citing our paper:

@misc{chen2025spatialreasoninghardvlms,
      title={Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas}, 
      author={Shiqi Chen and Tongyao Zhu and Ruochen Zhou and Jinghan Zhang and Siyang Gao and Juan Carlos Niebles and Mor Geva and Junxian He and Jiajun Wu and Manling Li},
      year={2025},
      eprint={2503.01773},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.01773}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Datasets

Method: ScalingVis and AdaptVis

Setting Up the environment

Downloading the data

Running experiments scaling_vis and adapt_vis

Argument

Citation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

Datasets

Method: ScalingVis and AdaptVis

Setting Up the environment

Downloading the data

Running experiments scaling_vis and adapt_vis

Argument

Citation