Gzip and Brotli Compression Level Estimator

benj111 · on Sept 10, 2022

Slightly off topic but is there a term for the idea that in order to adequately estimate something, you'd basically have to do all the work involved in doing the thing anyway?

I guess compression is something that would come under that?

lifthrasiir · on Sept 10, 2022

It is actually possible to estimate the compression level without actually compressing everything again, because different levels use different strategies which can be identified in some cases. zlib in particular has three strategies [1] and preflate [2] leverages this to store the deflate stream reconstruction data without much bits.

[1] https://github.com/madler/zlib/blob/21767c6/deflate.c#L134-L...

[2] https://github.com/deus-libri/preflate/blob/master/preflate_...

benj111 · on Sept 10, 2022

You need to scan the entire stream though to have an idea what kind of data it is. I suppose you could sample the first x%

lifthrasiir · on Sept 10, 2022

If you only see the static Huffman tree over all compressed deflate stream, that gives you an upper limit of the compression level and you don't actually need to decompress (i.e. undo LZ77) anything. If you meant that you still have to do the work proportional to the input length, yes of course, but that is very different from actual (de)compression.

klauspost · on Sept 10, 2022

A block can end at any time and another huffman block with another table can start. There is no way to know that without decoding all symbols.

Decoding the huffman codes is by far the most expensive operation, so you don't save much.

For decoding match lengths you also need to decode all symbols.

lifthrasiir · on Sept 10, 2022

I think you have missed the original context. The very problem proposed by the OP is to estimate the compression level for given compressed stream. For zlib and gzip there are 9 possible levels (excluding the uncompressed level 0 for zlib), so the naive approach is to first decompress the input and compress it again in nine different levels. I said that we can instead do the partial decompression to collect statistics and rule some levels out, so that we only need to compress it much less than 9 times. You still need to do the full decompression to be able to recompress it anyway, but that wasn't the point at all.

klauspost · on Sept 10, 2022

There is no way to "rule some levels out". The levels only differ in the quality of the lz matches. Slower levels search further potential matches - you cannot see that from a compressed stream.

You can compare the output from each level and that will be your guess, assuming of course that zlib was used in the first place.

lifthrasiir · on Sept 10, 2022

> The levels only differ in the quality of the lz matches. Slower levels search further potential matches - you cannot see that from a compressed stream.

If you see a match that would not be visible from given compression level, or conversely if you don't see a match that should have been selected at given compression level, then it is clear that the level couldn't be used to make that compressed stream. I think preflate actually does this, maintaining multiple hash tables over the course.

That said, I see where my explanation was slightly off in hindsight (sorry!). A bitstream only composed of blocks with the static Huffman table can be produced in multiple ways. One such case is of course a very specific input that is more beneficial to not use dynamic Huffman tables (that said, zlib tallying is AFAIK greedy so this should be very rare). Another case is Z_FIXED, which was actually what I had in mind at that writing---I should have said compression parameters instead. But I stand by the claim that many parameter combinations can be excluded from partial or full decompression, and for the practical purpose this is enough to be classified as the "estimation".

klauspost · on Sept 11, 2022

> If you see a match that would not be visible from given compression level[...]

Yes, but you'd need to reconstruct the hash table and the link table. For that you'd need to decompress fully. You could compare matches, but you are so far along at this point you might as well just decompress and recompress, comparing the output bit by bit.

benj111 · on Sept 10, 2022

I think we're talking past each other. I was talking about estimating how much space uncompressed data would take up, of you compressed it. Not the inverse.

lifthrasiir · on Sept 10, 2022

That's compression ratio estimation, not compression level (or parameter) estimation. I thought you were talking about the latter, as the former is virtually impossible (some archivers do sampling as you suggested, but it's only an opportunistic heuristic).

Aachen · on Sept 10, 2022

Huh, only "based on the content length"? As if the contents don't matter, I thought this would be some interesting work that made a good size estimate in 2ms based on contents so you don't have to spend seconds or minutes compressing large files to see which compression method works best for this content.

Why, then, not just let me enter a file size that I'm curious about, if that's the only input it uses? Heck, why not just show a result graph with size on the X axis so you don't need to enter anything at all but just get the info the tool has to offer at a glance?

truth_seeker · on Sept 10, 2022

Few things missing:

1. Compression time

2. De-compression time

3. Memory allocation during compression

4. Memory allocation during de-compression

mananaysiempre · on Sept 10, 2022

For Gzip, memory usage during decompression is bounded statically by 32K + constant decompressor overhead; for Zstandard, it’s 128M (tunable) on non-ultra levels (mentioned in the header); I don’t know about Brotli but it probably has something in that vein as well.

lifthrasiir · on Sept 10, 2022

Brotli LZ77 window is at most 16 MiB (RFC 7932, section 9.1), unless you are using the non-standard Large Window Brotli extension in which case it is virtually unlimited.

idiocrat · on Sept 10, 2022

Interesting would be to see the time tradeoff for each compression level.

danuker · on Sept 10, 2022

Programming language: C

Thank you, but I have enough nightmares with C.