Skip to content

feat: Add checksum-only API to ByteStorage (decouple integrity from compression) #13

@27Bslash6

Description

@27Bslash6

Summary

ByteStorage currently couples LZ4 compression with xxHash3-64 integrity checking. All store()/retrieve() operations require both features enabled together. This prevents using the Rust xxHash3 implementation for integrity-only use cases.

Current State

// StorageEnvelope::new() requires ALL features
#[cfg(all(feature = "compression", feature = "checksum"))]
pub fn new(data: Vec<u8>, format: String) -> Result<Self, ByteStorageError>

// No checksum-only path exists

The Python Arrow/Orjson serializers bypass ByteStorage entirely and use their own Blake3 checksums because:

  1. LZ4 compression is ineffective on Arrow IPC (columnar) and JSON (already compact)
  2. No way to get just the xxHash3 checksum without compression overhead

Proposed Change

Add a checksum-only API that provides xxHash3-64 integrity without compression:

// Option A: New feature-gated methods
#[cfg(feature = "checksum")]
impl ByteStorage {
    pub fn checksum(&self, data: &[u8]) -> [u8; 8];
    pub fn verify_checksum(&self, data: &[u8], expected: &[u8; 8]) -> bool;
}

// Option B: Separate IntegrityChecker struct
pub struct IntegrityChecker;
impl IntegrityChecker {
    pub fn compute(data: &[u8]) -> [u8; 8];
    pub fn verify(data: &[u8], expected: &[u8; 8]) -> bool;
}

Benefits

  1. Consistency: All serializers use same xxHash3-64 algorithm via Rust FFI
  2. Performance: Arrow/Orjson get 19x faster checksums (36 GB/s vs Blake3's 2 GB/s)
  3. Space: 8-byte checksums vs 32-byte Blake3 (24 bytes saved per item)
  4. No wasted cycles: Skip LZ4 where compression is ineffective

Current Workaround

Use xxhash Python package directly in Arrow/Orjson serializers (Option B from discussion). This provides algorithm consistency without Rust changes, but adds a Python dependency.

Context

  • Related discussion: xxHash3 migration in ByteStorage (2025-12-05)
  • Affected files: arrow_serializer.py, orjson_serializer.py currently use Blake3
  • Architecture doc: strategy/saas-protocol-v1.0.md

Acceptance Criteria

  • Checksum-only API available without enabling compression feature
  • PyO3 bindings expose checksum functions
  • Documentation updated
  • Benchmark comparing Python xxhash vs Rust FFI overhead

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions