## Question: would a shape-aware preprocessor be useful as a `contrib/` tool?
Hi zstd team,
I wanted to ask whether this kind of contribution would be of interest before opening a PR.
I have built a small open-source preprocessor library called **STRATA**:
https://github.com/rjamesy/strata
It is MIT-licensed and exposes a few reversible, shape-aware transforms:
- 2D predictors
- YCoCg-R colour decorrelation
- full octahedral cube rotation
- radial reordering for 3D voxels
- an auto-selector that keeps raw input as a candidate, so the selected output is never worse than the underlying codec alone
The result that prompted the question is that, for some shape-aware data, `STRATA-preprocess + zstd -1` beats `raw input + zstd -22` on both speed and ratio.
Examples:
```text
27 MB RGB photo:
YCoCg-R + zstd -1
= 463x faster, 5.4% smaller than raw zstd -22
Smooth 2D heightmap:
2D predictor + zstd -1
= 4.2x faster, 44.5% smaller than raw zstd -22
64^3 byte volume:
cube rotation + radial reordering + zstd -22
= 14.4% smaller than raw zstd -22, at roughly the same speed
The mechanism is not zstd-specific. The transforms expose structure that is hard for a byte-stream compressor to infer from raw input, especially when the redundancy is smoothness rather than exact repeated byte sequences.
The repo includes a reproduction script:
python3 bench/preprocess_demo.py
It writes a CSV covering the tested codec/dataset combinations.
My questions:
-
Would this kind of shape-aware preprocessor be a plausible fit for contrib/?
I am not proposing changes to zstd core. The transforms are separate reversible passes, roughly 30–300 LOC each, and the tool would emit zstd-compressible bytes.
-
Is the design preference for contrib/ to remain fully byte-stream-general, or are optional shape-aware tools acceptable there?
I saw contrib/seekable_format and contrib/pzstd, so I wanted to ask rather than assume the boundary.
-
Are there specific data classes the team would want benchmarked before considering something like this?
My current corpus is small: terrain, CT-style volumes, RGB photos, and sensor CSV.
I am also happy to keep this as an external library if that is the better fit. I mainly wanted to check before opening a PR.
Cheers,
Richard
The mechanism is not zstd-specific. The transforms expose structure that is hard for a byte-stream compressor to infer from raw input, especially when the redundancy is smoothness rather than exact repeated byte sequences.
The repo includes a reproduction script:
It writes a CSV covering the tested codec/dataset combinations.
My questions:
Would this kind of shape-aware preprocessor be a plausible fit for
contrib/?I am not proposing changes to zstd core. The transforms are separate reversible passes, roughly 30–300 LOC each, and the tool would emit zstd-compressible bytes.
Is the design preference for
contrib/to remain fully byte-stream-general, or are optional shape-aware tools acceptable there?I saw
contrib/seekable_formatandcontrib/pzstd, so I wanted to ask rather than assume the boundary.Are there specific data classes the team would want benchmarked before considering something like this?
My current corpus is small: terrain, CT-style volumes, RGB photos, and sensor CSV.
I am also happy to keep this as an external library if that is the better fit. I mainly wanted to check before opening a PR.
Cheers,
Richard