A hands-on implementation of Stable Diffusion v1.4 inference with custom DDIM sampling, classifier-free guidance, and inpainting — built piece by piece to understand what's actually happening under the hood.
Full walkthrough in
in-painting.ipynb
Mask-based latent blending: the original image is preserved outside the mask, and new content is diffused inside it.
Generate images from text prompts using a custom-built sampling pipeline, with working inpainting on top.
- DDIM Sampler — complete noise scheduling and denoising loop
- Classifier-Free Guidance — custom conditional/unconditional steering
- VAE Interface — latent encoding/decoding with proper scaling
- CLIP Text Pipeline — tokenization and embedding extraction
- Inpainting Logic — mask-based latent blending (
in-painting.ipynb)
- UNet Backbone —
UNet2DConditionModelfrom 🤗 Diffusers (pre-trained weights)
Initial work focused on injecting weights into a fully custom UNet architecture. 684/686 layers loaded successfully, but architectural mismatches (GEGLU vs GELU activations, upsampling order) prevented coherent outputs. Rather than paper over the issue, the pragmatic call was to use the proven Diffusers UNet as a stable backbone while keeping every other component custom — quality without sacrificing what was learned.
See
stable-diffusion.ipynbfor that experiment.
Prompt: "an astronaut riding a horse" — 35 steps each
| My Custom UNet (weight injection attempt) | Diffusers UNet (final pipeline) |
|---|---|
![]() |
![]() |
| Garbled / incoherent output | Coherent, prompt-following output |
- ✅ Text-to-image generation
- ✅ Configurable steps and guidance scale
- ✅ Custom DDIM sampling loop
- ✅ Inpainting with custom masks (
in-painting.ipynb)
python inference.py -c "your prompt" -s 50 -g 7.5
```<img width="512" height="512" alt="__results___9_12" src="https://github.com/user-attachments/assets/c5168e58-7eee-426e-99fd-4c7c80b13542" />
