Skip to content

Lumi-node/flowinone

Repository files navigation

FlowInOne

FlowInOne

Unified image-to-image generation via multimodal flow matching.

License Python Version Tests


FlowInOne introduces a unified framework for image-to-image generation by encoding diverse multimodal inputs—such as sketches, text, layout primitives, and symbolic instructions—into a shared 2D visual latent space. This enables a single flow matching model to generate photorealistic images conditioned on fused visual prompts, eliminating the need for modality-specific decoders or alignment losses.

By learning an isomorphic mapping from non-visual semantics (e.g., "remove grass") into denoisable visual representations, FlowInOne achieves semantic-preserving visual grounding and geometry-aware flow propagation. The system advances research in unified visual representation learning, offering a foundation for future multimodal generative models.


Quick Start

pip install flowinone
from PIL import GifImagePlugin, AvifImagePlugin

# Load an animated GIF
gif = GifImagePlugin.GifImageFile("animation.gif")
print(f"Frames: {gif.n_frames()}, Animated: {gif.is_animated()}")

# Read AVIF image
avif = AvifImagePlugin.AvifImageFile("image.avif")
avif.load()

What Can You Do?

Feature 1: Multimodal Input Encoding

Encode heterogeneous inputs (sketches, text, layout) into a shared visual latent space using PIL-based decoders and custom flow propagation.

from PIL import BmpImagePlugin, GbrImagePlugin

# Load BMP and GBR files as visual primitives
bmp = BmpImagePlugin.BmpImageFile("sketch.bmp")
gbr = GbrImagePlugin.GbrImageFile("brush.gbr")
# These are processed into the shared latent space

Feature 2: Flow Matching on Unified Visual Prompts

Generate photorealistic images from fused visual prompts using a single flow matching model, without modality-specific pipelines.

from PIL import DcxImagePlugin, FpxImageFile

# Multi-page DCX file as layout input
dcx = DcxImagePlugin.DcxImageFile("layout.dcx")
frame_count = dcx.tell()
dcx.seek(1)  # Navigate layout frames

Architecture

FlowInOne uses PIL’s modular image plugin system to ingest and decode diverse input formats into a unified tensor representation. Each input modality (e.g., sketch, text, layout) is processed through its respective ImageFile subclass (e.g., GifImageFile, AvifImageFile) into a common 2D latent space.

This latent space is then used to condition a single flow matching model that generates target images. The architecture avoids modality-specific decoders by projecting all inputs into a shared, denoisable visual domain where flow propagation respects both geometry and semantics.

graph LR
    A[Sketch] -->|BmpImageFile| D[Visual Latent Space]
    B[Text] -->|GdImageFile| D
    C[Layout] -->|DcxImageFile| D
    D --> E[Flow Matching Model]
    E --> F[Photorealistic Output]
Loading

API Reference

Key classes from the PIL plugin ecosystem used in FlowInOne:

class GifImagePlugin.GifImageFile(ImageFile.ImageFile)
def n_frames(self) -> int
def is_animated(self) -> bool
def data(self) -> bytes | None
class AvifImagePlugin.AvifImageFile(ImageFile.ImageFile)
def load(self) -> Image.core.PixelAccess | None
def seek(self, frame: int) -> None
class DcxImagePlugin.DcxImageFile(PcxImageFile)
def seek(self, frame: int) -> None
def tell(self) -> int
class GdImageFile.GdImageFile(ImageFile.ImageFile)
@staticmethod
def open(fp: StrOrBytesPath | IO[bytes], mode: str = "r") -> GdImageFile
class ContainerIO.ContainerIO(IO[AnyStr])
def read(self, n: int = -1) -> AnyStr
def seek(self, offset: int, mode: int = io.SEEK_SET) -> int
def write(self, b: AnyStr) -> NoReturn

Research Background

FlowInOne is inspired by recent advances in flow matching and multimodal representation learning. It builds on the idea of semantic-to-visual isomorphism, where non-visual instructions are mapped into a denoisable visual latent space compatible with diffusion-like generation.

While similar in spirit to models like FLUX and Stable Diffusion, FlowInOne eliminates modality-specific components by using visual encoding as a universal interface. This approach draws from research on perceptual alignment, neural rendering, and unified latent spaces.

Testing

FlowInOne includes 1195 test files ensuring robustness across input modalities and edge cases in image decoding and latent projection. Tests are located in the GitHub repository under /tests.

Run tests locally:

pytest tests/

Contributing

Contributions are welcome! Please open issues or PRs on GitHub. Ensure all new code includes tests and adheres to the existing API patterns.

Citation

@software{young_flowinone_2024,
  author = {Young, Andrew},
  title = {FlowInOne: Unified Image-to-Image Generation via Multimodal Flow Matching},
  url = {https://github.com/Lumi-node/flowinone},
  year = {2024},
  publisher = {Automate Capture Research}
}

License

MIT – see LICENSE for details.