leejet · leejet · May 16, 2026 · May 16, 2026
diff --git a/README.md b/README.md
@@ -133,9 +133,11 @@ API and command-line option may change frequently.***
 ## Performance
 
 If you want to improve performance or reduce VRAM/RAM usage, please refer to [performance guide](./docs/performance.md).
+For runtime and parameter backend placement, see the [backend selection guide](./docs/backend.md).
 
 ## More Guides
 
+- [Backend selection](./docs/backend.md)
 - [SD1.x/SD2.x/SDXL](./docs/sd.md)
 - [SD3/SD3.5](./docs/sd3.md)
 - [FLUX.1-dev/FLUX.1-schnell](./docs/flux.md)

diff --git a/docs/backend.md b/docs/backend.md
@@ -0,0 +1,122 @@
+# Backend selection
+
+`stable-diffusion.cpp` has two backend assignments:
+
+- `--backend` selects the runtime backend used to execute model graphs.
+- `--params-backend` selects the backend used to allocate model parameters.
+
+If `--params-backend` is not set, parameters use the same backend as their module runtime backend.
+
+## Syntax
+
+A backend assignment can be a single backend name:
+
+```shell
+sd-cli -m model.safetensors -p "a cat" --backend cpu
+```
+
+This applies to every module that does not have a more specific assignment.
+
+Assignments can also target individual modules:
+
+```shell
+sd-cli -m model.safetensors -p "a cat" --backend te=cpu,vae=cuda0,diffusion=vulkan0
+```
+
+The same syntax is used for parameter placement:
+
+```shell
+sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend te=cpu,vae=cpu
+```
+
+Module names are case-insensitive. Hyphens and underscores in module names are ignored, so `clip_vision`, `clip-vision`, and `clipvision` are equivalent.
+
+`all=`, `default=`, and `*=` can be used to set the default backend inside a mixed assignment:
+
+```shell
+sd-cli -m model.safetensors -p "a cat" --backend all=cuda0,te=cpu
+```
+
+## Modules
+
+| Module | Purpose | Accepted names |
+| --- | --- | --- |
+| `diffusion` | UNet, DiT, MMDiT, Flux, Wan, Qwen Image, and other diffusion models | `diffusion`, `model`, `unet`, `dit` |
+| `te` | Text encoders and conditioners | `te`, `clip`, `text`, `textencoder`, `textencoders`, `conditioner`, `cond`, `llm`, `t5`, `t5xxl` |
+| `clip_vision` | CLIP vision encoder | `clip_vision`, `clipvision`, `clip-vision`, `vision` |
+| `vae` | VAE and TAE | `vae`, `firststage`, `autoencoder`, `tae` |
+| `controlnet` | ControlNet | `controlnet`, `control` |
+| `photomaker` | PhotoMaker ID encoder and PhotoMaker LoRA | `photomaker`, `photomakerid`, `pmid`, `photo` |
+| `upscaler` | ESRGAN upscaler | `upscaler`, `esrgan`, `hires` |
+
+`te` is the preferred module name for text encoders. `clip` is kept as an accepted alias because many existing commands and model names use CLIP terminology.
+
+## Backend names
+
+Backend names are resolved against the GGML backend device list. Matching is case-insensitive and accepts exact names or unique prefixes, so common values include names such as:
+
+- `cpu`
+- `cuda0`
+- `vulkan0`
+- `metal`
+
+The special values `auto`, `default`, and an empty backend name select the default backend. The default preference is GPU, then integrated GPU, then CPU.
+
+The special value `gpu` selects the first GPU backend, falling back to the first integrated GPU backend.
+
+## Runtime backend vs. parameter backend
+
+The runtime backend controls where graph execution runs. The parameter backend controls where model weights are allocated.
+
+For example:
+
+```shell
+sd-cli -m model.safetensors -p "a cat" --backend cuda0 --params-backend cpu
+```
+
+This runs all modules on `cuda0`, but stores parameters in CPU RAM. During execution, parameters are moved to the runtime backend as needed.
+
+Per-module assignments can be mixed:
+
+```shell
+sd-cli -m model.safetensors -p "a cat" --backend diffusion=cuda0,te=cpu,vae=cpu --params-backend diffusion=cuda0,te=cpu,vae=cpu
+```
+
+This keeps text encoding and VAE execution on CPU while the diffusion model runs on GPU.
+
+## Backend sharing and lifetime
+
+Backends are managed by `SDBackendManager`.
+
+Within one manager, backend instances are cached by resolved backend device name. If multiple modules request the same backend, they share the same `ggml_backend_t`.
+
+For example:
+
+```shell
+--backend te=cpu,vae=cpu
+```
+
+uses one shared CPU backend for both `te` and `vae` runtime execution.
+
+Runtime and parameter assignments also share the same backend cache. If `--backend diffusion=cuda0` and `--params-backend diffusion=cuda0` resolve to the same device, both use the same backend instance.
+
+`SDBackendManager` owns the backend instances and frees them when the context or upscaler is destroyed. Model runners receive non-owning runtime and parameter backend pointers and do not free them.
+
+## Compatibility flags
+
+The older CPU placement flags are still supported:
+
+- `--clip-on-cpu`
+- `--vae-on-cpu`
+- `--control-net-cpu`
+- `--offload-to-cpu`
+
+`--clip-on-cpu`, `--vae-on-cpu`, and `--control-net-cpu` affect runtime backend assignment only when `--backend` is not set. They map to `te=cpu`, `vae=cpu`, and `controlnet=cpu`.
+
+`--offload-to-cpu` affects parameter backend assignment only when `--params-backend` is not set. It is equivalent to:
+
+```shell
+--params-backend cpu
+```
+
+Explicit `--backend` and `--params-backend` assignments are preferred for new commands.
diff --git a/examples/cli/main.cpp b/examples/cli/main.cpp
@@ -749,7 +749,9 @@ int main(int argc, const char* argv[]) {
                                                      ctx_params.offload_params_to_cpu,
                                                      ctx_params.diffusion_conv_direct,
                                                      ctx_params.n_threads,
-                                                     gen_params.upscale_tile_size));
+                                                     gen_params.upscale_tile_size,
+                                                     ctx_params.backend.c_str(),
+                                                     ctx_params.params_backend.c_str()));
 
         if (upscaler_ctx == nullptr) {
             LOG_ERROR("new_upscaler_ctx failed");

diff --git a/examples/common/common.cpp b/examples/common/common.cpp
@@ -380,6 +380,14 @@ ArgOptions SDContextParams::get_options() {
          "--upscale-model",
          "path to esrgan model.",
          &esrgan_path},
+        {"",
+         "--backend",
+         "runtime backend assignment, e.g. cpu or clip=cpu,vae=cuda0,diffusion=vulkan0",
+         &backend},
+        {"",
+         "--params-backend",
+         "parameter backend assignment, e.g. cpu or diffusion=cpu,clip=cpu",
+         &params_backend},
     };
 
     options.int_options = {
@@ -676,6 +684,8 @@ std::string SDContextParams::to_string() const {
         << "  sampler_rng_type: " << sd_rng_type_name(sampler_rng_type) << ",\n"
         << "  offload_params_to_cpu: " << (offload_params_to_cpu ? "true" : "false") << ",\n"
         << "  max_vram: " << max_vram << ",\n"
+        << "  backend: \"" << backend << "\",\n"
+        << "  params_backend: \"" << params_backend << "\",\n"
         << "  enable_mmap: " << (enable_mmap ? "true" : "false") << ",\n"
         << "  control_net_cpu: " << (control_net_cpu ? "true" : "false") << ",\n"
         << "  clip_on_cpu: " << (clip_on_cpu ? "true" : "false") << ",\n"
@@ -751,6 +761,8 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f
         chroma_t5_mask_pad,
         qwen_image_zero_cond_t,
         max_vram,
+        backend.c_str(),
+        params_backend.c_str(),
     };
     return sd_ctx_params;
 }

diff --git a/examples/common/common.h b/examples/common/common.h
@@ -110,14 +110,16 @@ struct SDContextParams {
     rng_type_t sampler_rng_type = RNG_TYPE_COUNT;
     bool offload_params_to_cpu  = false;
     float max_vram              = 0.f;
-    bool enable_mmap            = false;
-    bool control_net_cpu        = false;
-    bool clip_on_cpu            = false;
-    bool vae_on_cpu             = false;
-    bool flash_attn             = false;
-    bool diffusion_flash_attn   = false;
-    bool diffusion_conv_direct  = false;
-    bool vae_conv_direct        = false;
+    std::string backend;
+    std::string params_backend;
+    bool enable_mmap           = false;
+    bool control_net_cpu       = false;
+    bool clip_on_cpu           = false;
+    bool vae_on_cpu            = false;
+    bool flash_attn            = false;
+    bool diffusion_flash_attn  = false;
+    bool diffusion_conv_direct = false;
+    bool vae_conv_direct       = false;
 
     bool circular   = false;
     bool circular_x = false;

diff --git a/include/stable-diffusion.h b/include/stable-diffusion.h
@@ -206,6 +206,8 @@ typedef struct {
     int chroma_t5_mask_pad;
     bool qwen_image_zero_cond_t;
     float max_vram;  // GiB budget for graph-cut segmented param offload (0 = disabled, -1 = auto free VRAM minus 1 GiB)
+    const char* backend;
+    const char* params_backend;
 } sd_ctx_params_t;
 
 typedef struct {
@@ -427,7 +429,9 @@ SD_API upscaler_ctx_t* new_upscaler_ctx(const char* esrgan_path,
                                         bool offload_params_to_cpu,
                                         bool direct,
                                         int n_threads,
-                                        int tile_size);
+                                        int tile_size,
+                                        const char* backend,
+                                        const char* params_backend);
 SD_API void free_upscaler_ctx(upscaler_ctx_t* upscaler_ctx);
 
 SD_API sd_image_t upscale(upscaler_ctx_t* upscaler_ctx,

diff --git a/src/anima.hpp b/src/anima.hpp
@@ -526,10 +526,10 @@ namespace Anima {
         AnimaNet net;
 
         AnimaRunner(ggml_backend_t backend,
-                    bool offload_params_to_cpu,
+                    ggml_backend_t params_backend,
                     const String2TensorStorage& tensor_storage_map = {},
                     const std::string prefix                       = "model.diffusion_model")
-            : GGMLRunner(backend, offload_params_to_cpu) {
+            : GGMLRunner(backend, params_backend) {
             int64_t num_layers    = 0;
             std::string layer_tag = prefix + ".net.blocks.";
             for (const auto& kv : tensor_storage_map) {

diff --git a/src/auto_encoder_kl.hpp b/src/auto_encoder_kl.hpp
@@ -664,13 +664,13 @@ struct AutoEncoderKL : public VAE {
     AutoEncoderKLModel ae;
 
     AutoEncoderKL(ggml_backend_t backend,
-                  bool offload_params_to_cpu,
+                  ggml_backend_t params_backend,
                   const String2TensorStorage& tensor_storage_map,
                   const std::string prefix,
                   bool decode_only       = false,
                   bool use_video_decoder = false,
                   SDVersion version      = VERSION_SD1)
-        : decode_only(decode_only), VAE(version, backend, offload_params_to_cpu) {
+        : decode_only(decode_only), VAE(version, backend, params_backend) {
         if (sd_version_is_sd1(version) || sd_version_is_sd2(version)) {
             scale_factor = 0.18215f;
             shift_factor = 0.f;

diff --git a/src/clip.hpp b/src/clip.hpp
@@ -469,13 +469,13 @@ struct CLIPTextModelRunner : public GGMLRunner {
     std::vector<float> attention_mask_vec;
 
     CLIPTextModelRunner(ggml_backend_t backend,
-                        bool offload_params_to_cpu,
+                        ggml_backend_t params_backend,
                         const String2TensorStorage& tensor_storage_map,
                         const std::string prefix,
                         CLIPVersion version = OPENAI_CLIP_VIT_L_14,
                         bool with_final_ln  = true,
                         bool force_clip_f32 = false)
-        : GGMLRunner(backend, offload_params_to_cpu) {
+        : GGMLRunner(backend, params_backend) {
         bool proj_in = false;
         for (const auto& [name, tensor_storage] : tensor_storage_map) {
             if (!starts_with(name, prefix)) {