Skip to content

Question: would a shape-aware preprocessor be a useful contrib/ tool? #4659

@rjamesy

Description

@rjamesy
## Question: would a shape-aware preprocessor be useful as a `contrib/` tool?

Hi zstd team,

I wanted to ask whether this kind of contribution would be of interest before opening a PR.

I have built a small open-source preprocessor library called **STRATA**:

https://github.com/rjamesy/strata

It is MIT-licensed and exposes a few reversible, shape-aware transforms:

- 2D predictors
- YCoCg-R colour decorrelation
- full octahedral cube rotation
- radial reordering for 3D voxels
- an auto-selector that keeps raw input as a candidate, so the selected output is never worse than the underlying codec alone

The result that prompted the question is that, for some shape-aware data, `STRATA-preprocess + zstd -1` beats `raw input + zstd -22` on both speed and ratio.

Examples:

```text
27 MB RGB photo:
YCoCg-R + zstd -1
= 463x faster, 5.4% smaller than raw zstd -22

Smooth 2D heightmap:
2D predictor + zstd -1
= 4.2x faster, 44.5% smaller than raw zstd -22

64^3 byte volume:
cube rotation + radial reordering + zstd -22
= 14.4% smaller than raw zstd -22, at roughly the same speed

The mechanism is not zstd-specific. The transforms expose structure that is hard for a byte-stream compressor to infer from raw input, especially when the redundancy is smoothness rather than exact repeated byte sequences.

The repo includes a reproduction script:

python3 bench/preprocess_demo.py

It writes a CSV covering the tested codec/dataset combinations.

My questions:

  1. Would this kind of shape-aware preprocessor be a plausible fit for contrib/?

    I am not proposing changes to zstd core. The transforms are separate reversible passes, roughly 30–300 LOC each, and the tool would emit zstd-compressible bytes.

  2. Is the design preference for contrib/ to remain fully byte-stream-general, or are optional shape-aware tools acceptable there?

    I saw contrib/seekable_format and contrib/pzstd, so I wanted to ask rather than assume the boundary.

  3. Are there specific data classes the team would want benchmarked before considering something like this?

    My current corpus is small: terrain, CT-style volumes, RGB photos, and sensor CSV.

I am also happy to keep this as an external library if that is the better fit. I mainly wanted to check before opening a PR.

Cheers,
Richard

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions