fix(akv_video): preserve encoder logvar in KVAE video VAE (#13652) by Anai-Guo · Pull Request #13657 · huggingface/diffusers

Anai-Guo · 2026-04-30T07:22:56Z

Summary

Fixes Issue 1 in #13652.

AutoencoderKLKVAEVideo was discarding the encoder log-variance, so latent_dist.logvar was always zero and sample_posterior=True sampled from the wrong distribution. The static-image variant AutoencoderKLKVAE already keeps the full encoder output and lets DiagonalGaussianDistribution split it into mean/logvar — this PR aligns the video variant with that path.

Root cause

KVAECachedEncoder3D outputs 2 * z_channels per chunk (mean and logvar concatenated), matching the standard KL-VAE convention. But _encode() was discarding the second half:

sample, _ = torch.chunk(l, 2, dim=1)
latent.append(sample)

…and encode() then padded the missing half with zeros before constructing the posterior:

h_double = torch.cat([h, torch.zeros_like(h)], dim=1)
posterior = DiagonalGaussianDistribution(h_double)

So checkpoint logvar weights were silently ignored, posterior sampling used exp(0/2) = 1.0 as the std everywhere, and parity with the upstream KVAE 3D VAE was broken.

Fix

Mirror the autoencoder_kl_kvae.py pattern: keep the full encoder output and let DiagonalGaussianDistribution do the chunk-and-clamp itself.

# _encode(): keep full encoder output
latent.append(self.encoder(chunk, cache))

# encode(): no zero-pad
posterior = DiagonalGaussianDistribution(h)

DiagonalGaussianDistribution.__init__ does self.mean, self.logvar = torch.chunk(parameters, 2, dim=1), so this is functionally identical to the static-image variant.

Verification

The reproduction snippet from #13652 now shows non-zero logvar and posterior.mean matching the raw encoder mean half:

import torch
from diffusers import AutoencoderKLKVAEVideo

model = AutoencoderKLKVAEVideo(
    ch=32, ch_mult=(1, 2), num_res_blocks=1, z_channels=4, temporal_compress_times=2
).eval()
x = torch.randn(1, 3, 3, 16, 16)

with torch.no_grad():
    raw = model.encoder(x, model._make_encoder_cache())
    raw_mean, raw_logvar = raw.chunk(2, dim=1)
    posterior = model.encode(x).latent_dist

assert torch.allclose(posterior.mean, raw_mean, atol=1e-5)
assert torch.allclose(posterior.logvar, torch.clamp(raw_logvar, -30.0, 20.0), atol=1e-5)

posterior.sample() shape and dtype are unchanged; the only behavioral change is that posterior .logvar, .std, and stochastic samples now reflect the actual encoder output instead of constant zeros.

🤖 Generated with Claude Code

fix(akv_video): preserve encoder logvar in KVAE video VAE

f9bd783

github-actions Bot added models size/S PR with diff < 50 LOC labels Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(akv_video): preserve encoder logvar in KVAE video VAE (#13652)#13657

fix(akv_video): preserve encoder logvar in KVAE video VAE (#13652)#13657
Anai-Guo wants to merge 1 commit intohuggingface:mainfrom
Anai-Guo:fix-akv-video-logvar

Anai-Guo commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Anai-Guo commented Apr 30, 2026

Summary

Root cause

Fix

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant