Hacker Timesnew | past | comments | ask | show | jobs | submitlogin
Gzip and Brotli Compression Level Estimator (paulcalvano.com)
62 points by BonoboIO on Sept 9, 2022 | hide | past | favorite | 17 comments


Slightly off topic but is there a term for the idea that in order to adequately estimate something, you'd basically have to do all the work involved in doing the thing anyway?

I guess compression is something that would come under that?


It is actually possible to estimate the compression level without actually compressing everything again, because different levels use different strategies which can be identified in some cases. zlib in particular has three strategies [1] and preflate [2] leverages this to store the deflate stream reconstruction data without much bits.

[1] https://github.com/madler/zlib/blob/21767c6/deflate.c#L134-L...

[2] https://github.com/deus-libri/preflate/blob/master/preflate_...


You need to scan the entire stream though to have an idea what kind of data it is. I suppose you could sample the first x%


If you only see the static Huffman tree over all compressed deflate stream, that gives you an upper limit of the compression level and you don't actually need to decompress (i.e. undo LZ77) anything. If you meant that you still have to do the work proportional to the input length, yes of course, but that is very different from actual (de)compression.


A block can end at any time and another huffman block with another table can start. There is no way to know that without decoding all symbols.

Decoding the huffman codes is by far the most expensive operation, so you don't save much.

For decoding match lengths you also need to decode all symbols.


I think you have missed the original context. The very problem proposed by the OP is to estimate the compression level for given compressed stream. For zlib and gzip there are 9 possible levels (excluding the uncompressed level 0 for zlib), so the naive approach is to first decompress the input and compress it again in nine different levels. I said that we can instead do the partial decompression to collect statistics and rule some levels out, so that we only need to compress it much less than 9 times. You still need to do the full decompression to be able to recompress it anyway, but that wasn't the point at all.


There is no way to "rule some levels out". The levels only differ in the quality of the lz matches. Slower levels search further potential matches - you cannot see that from a compressed stream.

You can compare the output from each level and that will be your guess, assuming of course that zlib was used in the first place.


> The levels only differ in the quality of the lz matches. Slower levels search further potential matches - you cannot see that from a compressed stream.

If you see a match that would not be visible from given compression level, or conversely if you don't see a match that should have been selected at given compression level, then it is clear that the level couldn't be used to make that compressed stream. I think preflate actually does this, maintaining multiple hash tables over the course.

That said, I see where my explanation was slightly off in hindsight (sorry!). A bitstream only composed of blocks with the static Huffman table can be produced in multiple ways. One such case is of course a very specific input that is more beneficial to not use dynamic Huffman tables (that said, zlib tallying is AFAIK greedy so this should be very rare). Another case is Z_FIXED, which was actually what I had in mind at that writing---I should have said compression parameters instead. But I stand by the claim that many parameter combinations can be excluded from partial or full decompression, and for the practical purpose this is enough to be classified as the "estimation".


> If you see a match that would not be visible from given compression level[...]

Yes, but you'd need to reconstruct the hash table and the link table. For that you'd need to decompress fully. You could compare matches, but you are so far along at this point you might as well just decompress and recompress, comparing the output bit by bit.


I think we're talking past each other. I was talking about estimating how much space uncompressed data would take up, of you compressed it. Not the inverse.


That's compression ratio estimation, not compression level (or parameter) estimation. I thought you were talking about the latter, as the former is virtually impossible (some archivers do sampling as you suggested, but it's only an opportunistic heuristic).


Huh, only "based on the content length"? As if the contents don't matter, I thought this would be some interesting work that made a good size estimate in 2ms based on contents so you don't have to spend seconds or minutes compressing large files to see which compression method works best for this content.

Why, then, not just let me enter a file size that I'm curious about, if that's the only input it uses? Heck, why not just show a result graph with size on the X axis so you don't need to enter anything at all but just get the info the tool has to offer at a glance?


Few things missing:

1. Compression time

2. De-compression time

3. Memory allocation during compression

4. Memory allocation during de-compression


For Gzip, memory usage during decompression is bounded statically by 32K + constant decompressor overhead; for Zstandard, it’s 128M (tunable) on non-ultra levels (mentioned in the header); I don’t know about Brotli but it probably has something in that vein as well.


Brotli LZ77 window is at most 16 MiB (RFC 7932, section 9.1), unless you are using the non-standard Large Window Brotli extension in which case it is virtually unlimited.


Interesting would be to see the time tradeoff for each compression level.


Programming language: C

Thank you, but I have enough nightmares with C.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: