Skip to content

fix(akv_video): preserve encoder logvar in KVAE video VAE (#13652)#13657

Open
Anai-Guo wants to merge 1 commit intohuggingface:mainfrom
Anai-Guo:fix-akv-video-logvar
Open

fix(akv_video): preserve encoder logvar in KVAE video VAE (#13652)#13657
Anai-Guo wants to merge 1 commit intohuggingface:mainfrom
Anai-Guo:fix-akv-video-logvar

Conversation

@Anai-Guo
Copy link
Copy Markdown

Summary

Fixes Issue 1 in #13652.

AutoencoderKLKVAEVideo was discarding the encoder log-variance, so latent_dist.logvar was always zero and sample_posterior=True sampled from the wrong distribution. The static-image variant AutoencoderKLKVAE already keeps the full encoder output and lets DiagonalGaussianDistribution split it into mean/logvar — this PR aligns the video variant with that path.

Root cause

KVAECachedEncoder3D outputs 2 * z_channels per chunk (mean and logvar concatenated), matching the standard KL-VAE convention. But _encode() was discarding the second half:

sample, _ = torch.chunk(l, 2, dim=1)
latent.append(sample)

…and encode() then padded the missing half with zeros before constructing the posterior:

h_double = torch.cat([h, torch.zeros_like(h)], dim=1)
posterior = DiagonalGaussianDistribution(h_double)

So checkpoint logvar weights were silently ignored, posterior sampling used exp(0/2) = 1.0 as the std everywhere, and parity with the upstream KVAE 3D VAE was broken.

Fix

Mirror the autoencoder_kl_kvae.py pattern: keep the full encoder output and let DiagonalGaussianDistribution do the chunk-and-clamp itself.

# _encode(): keep full encoder output
latent.append(self.encoder(chunk, cache))

# encode(): no zero-pad
posterior = DiagonalGaussianDistribution(h)

DiagonalGaussianDistribution.__init__ does self.mean, self.logvar = torch.chunk(parameters, 2, dim=1), so this is functionally identical to the static-image variant.

Verification

The reproduction snippet from #13652 now shows non-zero logvar and posterior.mean matching the raw encoder mean half:

import torch
from diffusers import AutoencoderKLKVAEVideo

model = AutoencoderKLKVAEVideo(
    ch=32, ch_mult=(1, 2), num_res_blocks=1, z_channels=4, temporal_compress_times=2
).eval()
x = torch.randn(1, 3, 3, 16, 16)

with torch.no_grad():
    raw = model.encoder(x, model._make_encoder_cache())
    raw_mean, raw_logvar = raw.chunk(2, dim=1)
    posterior = model.encode(x).latent_dist

assert torch.allclose(posterior.mean, raw_mean, atol=1e-5)
assert torch.allclose(posterior.logvar, torch.clamp(raw_logvar, -30.0, 20.0), atol=1e-5)

posterior.sample() shape and dtype are unchanged; the only behavioral change is that posterior .logvar, .std, and stochastic samples now reflect the actual encoder output instead of constant zeros.

🤖 Generated with Claude Code

@github-actions github-actions Bot added models size/S PR with diff < 50 LOC labels Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

models size/S PR with diff < 50 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant