Skip to content

minebetter/EG-GRPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌟🔥 [ICLR 2026] From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

💥 News

  • [2026.1.25] EG-GRPO has been accepted by ICLR 2026! 🎉🎉

👀 Exploring the interaction between CoT's exploration and RL's optimization

Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. In this project, We present a systematic entropy-based analysis and and derive several insightful findings. Based on the findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty.

🔑 Key Insights

Our analysis reveals three main findings:

  • Exploration vs. Exploitation: CoT increases the diversity of the generative space, whereas RL progressively focuses generation toward high-reward regions.
  • Entropy–Reward Coupling: The final reward exhibits a strong negative correlation with both the mean and variance of image-token entropy.
  • CoT Entropy Controls Image Quality: Lower-entropy textual CoTs lead to more stable and higher-quality image generation.

Motivated by this, we use EG-GRPO: bonus high-entropy tokens to encourage structured exploration and exclude low-entropy tokens from reward-driven updates to preserve stability. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.

💪 Get Started

Installation

Clone the repository:

   git clone git@github.com:minebetter/EG-GRPO.git

Create a conda environment:

   conda create -n eg-grpo python=3.10
   conda activate eg-grpo

Please follow the official instructions here to install both PyTorch and TorchVision dependencies.

Install additional dependencies:

   cd src
   pip install -r requirements.txt

Note: The versions specified in requirements.txt are recommended but not mandatory.

Set up the Reward Model Environment

Make sure to install from our repo. We make some necessary modifications to train with Zero3.

Install GrouningDINO if you want to use Object Detector reward

   cd src/eg-grpo/src/utils/GroundingDINO
   pip install -e .

Install LLaVA if you want to use ORM reward

   cd src/eg-grpo/src/utils/LLaVA-NeXT
   pip install -e ".[train]"

Install hpsv2 if you want to use HPS reward

   cd src/eg-grpo/src/utils/HPSv2
   pip install -e .

Prepare Reward Model Checkpoints

Please download the reward model you need for training.

   mkdir reward_weight
   cd reward_weight
   wget https://huggingface.co/xswu/HPSv2/resolve/main/HPS_v2.1_compressed.pt
   huggingface-cli download microsoft/git-large-vqav2 --repo-type model --local-dir git-large-vqav2
  • Download GroundingDINO checkpoint from this link by
   wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
   huggingface-cli download CaraJ/ORM-T2I-R1 --repo-type model --local-dir ORM-T2I-R1

🔬 Analysis

We compare entropy and reward across three settings: Janus-Pro, Janus-Pro + CoT (no training), and T2I-R1.
Since Janus-Pro does not include a texture-level CoT component, please use the following command to generate images and record the corresponding entropy and reward statistics.

cd src/eg-grpo/src/analysis
python batch_inference_janus_analysis.py \
  --model_path YOUR_MODEL_CKPT \
  --data_path test_data.txt \
  --save_dir YOUR_OUTPUT_IMAGE_PATH \
  --mark_dir YOUR_OUTPUT_DATA_PATH \
  ...

For Janus-Pro + CoT (no training) and T2I-R1, use the command below to run the analysis experiments.

cd src/eg-grpo/src/analysis
python batch_inference_analysis.py \
  --model_path YOUR_MODEL_CKPT \
  --data_path test_data.txt \
  --reasoning_prompt_path YOUR_REASONING_PROMPT_PATH \
  --save_dir YOUR_OUTPUT_IMAGE_PATH \
  --mark_dir YOUR_OUTPUT_DATA_PATH \
  ...

Notes:

  • Observation: When analyzing the relationship between entropy and reward, it is important to control all other variables except the one of interest. For example, when studying how reward correlates with text entropy, you should keep the image-entropy standard deviation low to minimize confounding effects from the model’s image-generation alignment capability.

🚀 Training

cd src
bash scripts/run_eg_grpo.sh

Notes:

  • Parameters:
    • reward_funcs: The options are hps, git, gdino, orm. You can choose whatever composition you need for training. Make sure to substitute the correct checkpoint path and config path in the run_grpo.sh
  • Hyperparameters:
    • we use the 50% percentile as the low threshold of high entropy region and the 20% percentile boundary for low entropy.

💫 Inference

You can train the model yourself and run inference using the following command:

   cd src/eg-grpo/src/infer
   python batch_inference.py \
   --model_path YOUR_MODEL_CKPT \
   --data_path test_data.txt \
   --reasoning_prompt_path YOUR_REASONING_PROMPT_PATH \
   --save_dir YOUR_OUTPUT_PATH

📈 Evaluation

We evaluate the performance of our method using T2I-CompBench and WISE.
Please refer to the official repositories of these benchmarks for detailed evaluation protocols.

📒 Notes

We modify the reward_gdino implementation to enforce stricter penalties when the model generates more objects than required. The original version is located at EG-GRPO/src/eg-grpo/src/utils/reward_gdino.py, and the revised version can be found at EG-GRPO/src/eg-grpo/src/utils/reward_gdino_strict.py.

📌 Acknowledgements

The layout and presentation of this README are inspired by the project page of T2I-R1.

About

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors