CSHARP-6037: Replace SharpCompress with DeflateStream for zlib compression#1997
CSHARP-6037: Replace SharpCompress with DeflateStream for zlib compression#1997damieng wants to merge 10 commits into
Conversation
…ib compression Removes the SharpCompress 0.30.1 dependency (which has a known moderate vulnerability and dropped .NET platform targets we support) by implementing the RFC 1950 zlib framing using System.IO.Compression.DeflateStream via a new ZlibStream wrapper class. DeflateStream uses the same native zlib backend as SharpCompress with no performance regression. Header validation and Adler-32 verification on decompress are preserved.
|
Thanks for addressing this. I was about to report the vulnerability with SharpCompress but great to see you have it in hand already. Noted about the backwards compatibility requirement. |
|
hello @damieng just to be sure which PR to follow. i also see that it's still in draft. or if Mongodb.Driver can confirm or not what's being written in the other PR ? |
| return decompressed; | ||
| } | ||
|
|
||
| private static uint UpdateAdler32(uint adler, byte[] buf, int offset, int count) |
There was a problem hiding this comment.
We need unit tests for this method.
There was a problem hiding this comment.
The Adler values are covered in the zlib tests. If we want to indepdentely validate the Adelr32 we'll need to make this internal or public.
| private static MemoryStream ReadAndDecompress(Stream input) | ||
| { | ||
| byte[] compressed; | ||
| using (var ms = new MemoryStream()) |
There was a problem hiding this comment.
Do we really need this copy of steam? It seems like we can combine this one with MemoryStream instantiated on line 116.
There was a problem hiding this comment.
all of this needs to be tested on a real aspnetcore app such as a WebApplicationFactory where aspnetcore will throw at any attempt to use SyncIO for stream
this method isn't async i suspect it won't work
There was a problem hiding this comment.
The limitation with WebApplicationFactory and async specifically relates to ASP.NET Request and Response. If we were shipping a general-purpose public ZlibStream that would definitely be a scenario we'd want to support.
In our internal case we only run this on ByteBufferStream, an in-memory buffer that gets populated by NetworkStream (which supports sync and async) on the wire side, and is then consumed by
CompressedMessageBinaryEncoder (which is already only sync today the same as all our IMessageEncoder implementations).
| _stream.WriteByte((byte)(_adler32 >> 24)); | ||
| _stream.WriteByte((byte)(_adler32 >> 16)); | ||
| _stream.WriteByte((byte)(_adler32 >> 8)); | ||
| _stream.WriteByte((byte)_adler32); |
There was a problem hiding this comment.
I would say it's surprising to write something to the stream in dispose. I would prefer to have this code in Close method.
There was a problem hiding this comment.
you probably meant CloseAsync or even FlushAsync if this needs to write stuff ?
There was a problem hiding this comment.
Close for sync code-path and CloseAsync for async code-path. Flush/FlushAsync might be used to clean the buffer for some reasons, but more data could be expected.
There was a problem hiding this comment.
Writing the trailer on Dispose matches the BCL pattern (DeflateStream, GZipStream, CryptoStream all do the same). DeflateStream's Dispose calls PurgeBuffers.
Overriding Close() goes against Microsoft's guidance which recommends instead putting it in dispose.
| // Replacement for SharpCompress.Compressors.Deflate.ZlibStream. | ||
| // Implements RFC 1950 (zlib framing) using System.IO.Compression.DeflateStream for the | ||
| // inner RFC 1951 (deflate) payload, with manual Adler-32 checksum handling. | ||
| internal sealed class ZlibStream : Stream |
There was a problem hiding this comment.
For modern targets we have built-in ZLibStream. Should we use it and keep this implementation for net472 only?
There was a problem hiding this comment.
I think we should do that in a subsequent update otherwise we have double the surface area changing with two implementations potentially in a patch release. If/when we can drop 4.7.2 I'd like it if we can go full-native with the .NET 6 one and drop this work here entirely.
There was a problem hiding this comment.
If/when we can drop 4.7.2 I'd like it if we can go full-native with the .NET 6 one and drop this work here entirely.
you can already go native for anything above net6.0
#if !NET6_0_OR_GREATER
// ENTIRE FILE
#endifI feel like this must 100% what should be done
you would also
#if NET6_0_OR_GREATER
using System.IO.Compression
#else
using The.Other.Namespace
#endifThere was a problem hiding this comment.
I feel like it's becomming urgent to also stop targetting net6.0 and start targetting at least 10.0 as mentionned previously
as for support for netfx, that would be a breaking change
There was a problem hiding this comment.
I wonder if mongo starts targeting net8.0, net9.0, net10.0 explicitly
then you can release a nuget package for these TFM by removing SharpCompress there
and I wonder if consumer on net10.0 would still have the CVE warning.
In the meanwhile another PR could deal with supporting ZLib for Netfx
There was a problem hiding this comment.
pretty sure it's probably the best alternative to rely on native when native is supported, and use #if for legacy
There was a problem hiding this comment.
We'll have to leave the targetting conversations for a different place - that's not something that can be decided in a security patch release.
There is some benefit to providing .NET 6 or greater conditionals and using the built-in. I didn't initially want double the surface area but have come around enough to try it in another commit.
There was a problem hiding this comment.
cool to see the native code being used for existing modern TFM, thx a lot
(I wonder if net10.0 resolve to ns2.0 or net6.0)
|
can we please concider targeting explicitly net10.0 ? and start looking at net11.0 explicit tfm ? there's many changes that adds more support for Zlib so it would be cool to know how to anticipate changes ahead of time to avoid possible breaking changes that api would force/offer its 6 months ahead |
|
in a more personnal comment netstandard2.1 imply runtime that would both support it and that are not already "modern dotnet" |
ZlibStream now defers all I/O until the first Read/Write — the constructor no longer touches the underlying stream. Decompression streams output through DeflateStream incrementally instead of buffering the full decompressed payload into a MemoryStream, lowering peak memory for large messages. Adler-32 is accumulated as bytes are produced and validated when DeflateStream signals EOF. Also adds explicit rejection of zlib preset dictionaries (FDICT) with a clear error rather than letting the DICTID bytes be misinterpreted as deflate data, plus a test covering that case.
- ZlibCompressor: map zlibCompressionLevel 7-9 to CompressionLevel.SmallestSize on net6.0+ (zlib level 9) to preserve high-compression user intent. Older TFMs lack SmallestSize and continue to fall back to Optimal. - ZlibStream.UpdateAdler32: switch to the standard zlib NMAX (5552) batched- modulo form. Roughly halves the CPU cost vs the per-byte modulo loop, restoring parity with SharpCompress's Adler32 on the hot path. - ZlibStream.Dispose: null-guard _compressDeflate so a failed init (e.g. header write threw on a broken underlying stream) doesn't surface an NRE on top of the original exception. - ZlibStream.EnsureDecompressInitialized: raise the minimum length floor from 7 to 8 — RFC 1951's shortest deflate payload (an empty stored block) is "03 00", two bytes. - CompressorsTests: annotate Zlib_should_generate_expected_compressed_bytes explaining why the expected bytes differ from the pre-CSHARP-6037 baseline (SharpCompress's mid-stream sync-flush vs DeflateStream's single final block) and pointing to the inbound-compat test.
The new UpdateAdler32 batches its deferred modulo every NMAX (5552) bytes, but the existing __testMessage is only ~110 bytes long so no test exercised the chunk-boundary path. Add a round-trip over a >16KB payload with varied byte values that straddles three NMAX windows.
…gaps Adds three tests that exercise paths the existing suite missed: - Zlib_roundtrip_should_handle_empty_input: round-trip with zero bytes. - Zlib_decompress_read_should_handle_nonzero_offset_and_partial_reads: drives ZlibStream.Read directly with offset>0 and small counts to verify the offset is propagated correctly into UpdateAdler32, and that calling Read past EOF returns 0 without re-validating the trailer. - Zlib_compress_level_zero_should_emit_no_compression_header: pins the FLEVEL=0 header bytes (0x78 0x01) for level 0 so the header constant can't silently swap with the FLEVEL=2 default (both round-trip fine). The empty-input test surfaced a real bug: System.IO.Compression.DeflateStream emits zero bytes for empty input (SharpCompress emitted "03 00"), so our compress path was producing a 6-byte header+Adler-only stream with no deflate payload — malformed per RFC 1951 and rejected by our own decompress with the length floor raised to 8. Fix in ZlibStream.Dispose: when no Write ever happened, emit "03 00" (empty final deflate block) before the Adler-32 trailer to produce a well-formed RFC 1950 empty zlib stream. Split lazy init into EnsureHeaderWritten (always needed at dispose) and the existing EnsureCompressInitialized (only when there's actual data to deflate).
| return n; | ||
| } | ||
|
|
||
| public override void Write(byte[] buffer, int offset, int count) |
There was a problem hiding this comment.
If we intentionally support only sync code path, should we override async methods and throw there?
There was a problem hiding this comment.
this would be possible to have async path and would be used in IAsyncDisposable
assuming we want DisposeAsync, but I can't really tell as i'm not knowing in depth this repo.
It could possibly be a small performance boost
There was a problem hiding this comment.
Nothing in the driver would be calling the async methods - we would need to change a lot more than just the zlibstream to light them up. Might be worth it at some point but well beyond the scope of this piece of work.
Given this is not a general-purpose Zlib implementation but rather something we only use internally in a very specific path throwing on the Async doesn't get us anything. We completely control how this is used and there is no async path into it.
| { | ||
| // RFC 1950 headers: CMF=0x78 (deflate, 32K window) + FLG satisfying (CMF*256+FLG)%31==0. | ||
| // FLEVEL bits in FLG are informational only (RFC 1950 §2.2) and do not affect decompression. | ||
| private static readonly byte[] s_defaultHeader = { 0x78, 0x9C }; // FLEVEL=2 |
There was a problem hiding this comment.
Minor: rename to __defaultHeader and __noCompressionHeader
| using (var ms = new MemoryStream()) | ||
| { | ||
| _stream.CopyTo(ms); | ||
| compressed = ms.ToArray(); |
There was a problem hiding this comment.
As far as I understood we need this copy only to limit size of the stream. Right? Taking into account that this stream could be relatively big, can we avoid this copy somehow or at least use some pool? Can we implement something like a size limited stream? The stream which wraps another stream but restrict it's size?
Before this commit, EnsureDecompressInitialized read the entire compressed payload into a MemoryStream and then copied it again via ToArray() before DeflateStream produced a single output byte — peak allocation ~2-3N for an N-byte compressed wire message. This was a real regression versus the pre-PR SharpCompress-based code, which streamed the zlib stream end-to-end with ~constant working memory. Introduce a private nested TrailerReservingStream that wraps the input and always holds back the last 4 bytes (the Adler-32 trailer) while streaming everything before that to DeflateStream. Lazy: 4-byte priming happens on the first Read; the held-back trailer is exposed once source EOFs. EnsureDecompress now reads only the 2-byte zlib header up front, validates it, then constructs the TrailerReservingStream — no whole-input buffering. The Read trailer-validation block pulls one extra byte from TrailerReserving on EOF: DeflateStream can detect BFINAL=1 inside bytes it already buffered internally and signal EOF without ever asking our source for another byte; the forced pull makes the source observe EOF so the held-back 4 bytes become the validated trailer. Any byte produced by that pull means trailing data after the deflate payload, which we now reject. Peak input-side allocation is now ~8KB reusable workspace + 4-byte trailer, independent of message size — matching the pre-PR SharpCompress profile. Tests: - Zlib_decompress_should_handle_one_byte_at_a_time_input: round-trips through a source that returns one byte per Read, exercising the n<=4 rotation branch of TrailerReservingStream under pathological fragmentation. - Zlib_decompress_should_stream_without_buffering_entire_input: 1MB round-trip to confirm streaming works at realistic wire-message scale.
Summary
Removes the
SharpCompress 0.30.1dependency fromMongoDB.Driver. This package has a known moderate severity vulnerability (GHSA-6c8g-7p36-r338) and the fixed version drops .NET targets we support.Zlib compression support is replaced with a thin
ZlibStreamwrapper aroundSystem.IO.Compression.DeflateStream, implementing the same RFC 1950 framing (2-byte header + deflate payload + Adler-32 checksum) that SharpCompress provided.Header validation and Adler-32 checksum verification on decompress are preserved.
Compression level mapping
System.IO.Compressiondoes not expose zlib's full 0–9 integer level range. The mapping is:Levels 2–8 are rarely configured explicitly in practice. Level 9 is the most notable change. All output remains valid RFC 1950 zlib and interoperates correctly with the server.
Approach
This is preferrable to updating SharpCompress as we use a tiny fraction of its functionality but are flagged when a vulnerability appears even in functionality we don't use such as this latest one in archive reading. The fix is in a version that has dropped support for .NET targets we still support in this 3.x release.
Given how trivial the actual functionality we require is it makes no sense to continue with this dependency for the sake of adding a 2-byte header and Adler32 checksum over a DeflateStream.
Superceeds #1996