🌟🔥 [ICLR 2026] From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation
- [2026.1.25] EG-GRPO has been accepted by ICLR 2026! 🎉🎉
Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. In this project, We present a systematic entropy-based analysis and and derive several insightful findings. Based on the findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty.
Our analysis reveals three main findings:
- Exploration vs. Exploitation: CoT increases the diversity of the generative space, whereas RL progressively focuses generation toward high-reward regions.
- Entropy–Reward Coupling: The final reward exhibits a strong negative correlation with both the mean and variance of image-token entropy.
- CoT Entropy Controls Image Quality: Lower-entropy textual CoTs lead to more stable and higher-quality image generation.
Motivated by this, we use EG-GRPO: bonus high-entropy tokens to encourage structured exploration and exclude low-entropy tokens from reward-driven updates to preserve stability. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.
Clone the repository:
git clone git@github.com:minebetter/EG-GRPO.gitCreate a conda environment:
conda create -n eg-grpo python=3.10
conda activate eg-grpoPlease follow the official instructions here to install both PyTorch and TorchVision dependencies.
Install additional dependencies:
cd src
pip install -r requirements.txtNote: The versions specified in requirements.txt are recommended but not mandatory.
Make sure to install from our repo. We make some necessary modifications to train with Zero3.
Install GrouningDINO if you want to use Object Detector reward
cd src/eg-grpo/src/utils/GroundingDINO
pip install -e .Install LLaVA if you want to use ORM reward
cd src/eg-grpo/src/utils/LLaVA-NeXT
pip install -e ".[train]"Install hpsv2 if you want to use HPS reward
cd src/eg-grpo/src/utils/HPSv2
pip install -e .Please download the reward model you need for training.
mkdir reward_weight
cd reward_weight- Download HPS checkpoint from this link by
wget https://huggingface.co/xswu/HPSv2/resolve/main/HPS_v2.1_compressed.pt- Download GIT checkpoint from this link by
huggingface-cli download microsoft/git-large-vqav2 --repo-type model --local-dir git-large-vqav2- Download GroundingDINO checkpoint from this link by
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth- Download ORM checkpoint from this link by
huggingface-cli download CaraJ/ORM-T2I-R1 --repo-type model --local-dir ORM-T2I-R1We compare entropy and reward across three settings: Janus-Pro, Janus-Pro + CoT (no training), and T2I-R1.
Since Janus-Pro does not include a texture-level CoT component, please use the following command to generate images and record the corresponding entropy and reward statistics.
cd src/eg-grpo/src/analysis
python batch_inference_janus_analysis.py \
--model_path YOUR_MODEL_CKPT \
--data_path test_data.txt \
--save_dir YOUR_OUTPUT_IMAGE_PATH \
--mark_dir YOUR_OUTPUT_DATA_PATH \
...For Janus-Pro + CoT (no training) and T2I-R1, use the command below to run the analysis experiments.
cd src/eg-grpo/src/analysis
python batch_inference_analysis.py \
--model_path YOUR_MODEL_CKPT \
--data_path test_data.txt \
--reasoning_prompt_path YOUR_REASONING_PROMPT_PATH \
--save_dir YOUR_OUTPUT_IMAGE_PATH \
--mark_dir YOUR_OUTPUT_DATA_PATH \
...Notes:
- Observation: When analyzing the relationship between entropy and reward, it is important to control all other variables except the one of interest. For example, when studying how reward correlates with text entropy, you should keep the image-entropy standard deviation low to minimize confounding effects from the model’s image-generation alignment capability.
cd src
bash scripts/run_eg_grpo.shNotes:
- Parameters:
- reward_funcs: The options are
hps,git,gdino,orm. You can choose whatever composition you need for training. Make sure to substitute the correct checkpoint path and config path in therun_grpo.sh
- reward_funcs: The options are
- Hyperparameters:
- we use the 50% percentile as the low threshold of high entropy region and the 20% percentile boundary for low entropy.
You can train the model yourself and run inference using the following command:
cd src/eg-grpo/src/infer
python batch_inference.py \
--model_path YOUR_MODEL_CKPT \
--data_path test_data.txt \
--reasoning_prompt_path YOUR_REASONING_PROMPT_PATH \
--save_dir YOUR_OUTPUT_PATHWe evaluate the performance of our method using T2I-CompBench and WISE.
Please refer to the official repositories of these benchmarks for detailed evaluation protocols.
We modify the reward_gdino implementation to enforce stricter penalties when the model generates more objects than required. The original version is located at EG-GRPO/src/eg-grpo/src/utils/reward_gdino.py, and the revised version can be found at EG-GRPO/src/eg-grpo/src/utils/reward_gdino_strict.py.
The layout and presentation of this README are inspired by the project page of T2I-R1.

