feat: add JoyImage edit plus#14032
Conversation
|
Hi @tangyanf, thanks for the PR! It does not appear to link an issue it fixes. If this PR addresses an existing issue, please add a closing keyword (e.g. |
There was a problem hiding this comment.
🤗 Serge says:
This PR adds the JoyImage Edit Plus model and pipeline. There are several blocking issues that need to be addressed before merging.
Blocking — Debug artifacts left in production code
Multiple torch.save() calls, a print() statement, and a commented-out exit(0) are left in pipeline_joyimage_edit_plus.py. These will write files to the user's working directory and print to stdout during every inference call.
Blocking — einops dependency
Per .ai/models.md: "No new mandatory dependency without discussion (e.g. einops). Optional deps guarded with is_X_available() and a dummy in utils/dummy_*.py." The pipeline directly imports from einops import rearrange — this is the only non-comment usage of einops in src/diffusers/. The rearrange calls should be rewritten with native PyTorch (reshape, permute, unflatten).
Blocking — sglang integration code in model forward
The transformer's forward method contains sglang-specific code: list-unwrapping for "SglangXvideo CFG branches" (lines 272-276) and a try: from sglang... fallback (lines 279-287). Per .ai/AGENTS.md: "No defensive code, unused code paths, or legacy stubs — do not add fallback paths, safety checks, or configuration options 'just in case'." This code doesn't belong in the diffusers model — the pipeline always passes the required arguments.
Blocking — Missing dummy objects
JoyImageEditPlusTransformer3DModel, JoyImageEditPlusPipeline, and JoyImageEditPlusPipelineOutput are not registered in dummy_pt_objects.py / dummy_torch_and_transformers_objects.py. This will cause ImportError when torch/transformers are not installed.
Blocking — Missing tests
No test files were added for the new model or pipeline.
Blocking — Hardcoded device_type="cuda" in torch.autocast
torch.autocast(device_type="cuda", ...) is hardcoded in two places in the pipeline. This will fail on MPS, XPU, and other non-CUDA devices.
Non-blocking — Inlined scheduler sigma math
Per .ai/pipelines.md gotcha #3, the pipeline manually computes shifted sigmas and temporarily overrides self.scheduler.shift — this is exactly what FlowMatchEulerDiscreteScheduler does with its shift config. The scheduler should own this logic.
Non-blocking — Unused imports and parameters
import inspectintransformer_joyimage_edit_plus.pyis unused.enable_denormalizationparameter is declared inprepare_latentsand__call__but never read.retrieve_timestepsis duplicated from the existing pipeline without a# Copied fromannotation.
serge v0.1.0 · model: claude-opus-4-6 · 29 LLM turns · 50 tool calls · 190.2s · 1602502 in / 7369 out tokens
- Remove einops dependency: replace rearrange with reshape/permute
- Remove sglang-specific code from transformer forward
- Remove unused import inspect from transformer
- Fix hardcoded device_type="cuda" to use device.type
- Simplify scheduler sigma math: delegate to retrieve_timesteps
- Remove unused enable_denormalization parameter
- Fix callback latents variable binding
- Fix output_type="pt" to return stacked tensor
- Set return_dict default to True in transformer forward
- Add dummy objects for JoyImageEditPlus classes
- Add transformer and pipeline test files
6f2763a to
8a911e5
Compare
Description
We are the JoyAI Team, and this is the Diffusers implementation for the JoyAI-Image-Edit-Plus model.
GitHub Repository: [https://github.com/jd-opensource/JoyAI-Image]
Hugging Face Model: [https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus-Diffusers]
Original opensource weights: [https://huggingface.co/jdopensource/JoyAI-Image-Edit-Plus]
Fixes #14049
Model Overview
JoyAI-Image-Edit-Plus extends JoyAI-Image-Edit with multi-image editing capabilities. While JoyAI-Image-Edit operates on a single reference image, Edit-Plus accepts multiple reference
images as input and performs instruction-guided editing across them — enabling tasks such as subject composition, style transfer from multiple sources, and multi-view consistent editing.
It combines an 8B Multimodal Large Language Model (MLLM) with a 16B Multimodal Diffusion Transformer (MMDiT), supporting variable-resolution reference images that are independently
encoded and jointly denoised.
Key Features
and dog images).