Tar doesn't use any sort of index like zip does, so to extract the specified fil...

remram · on Feb 16, 2022

You could pre-index them I suppose. Though even that would only work with a subset of compression methods or no compression.

klauspost · on Feb 17, 2022

We considered TAR, but indexing requires reading back and decompressing the entire archive.

This may be feasible on small TAR files, and for single PutObject you could index while uploading. However for multipart objects, parts can arrive in any order so you are forced to read it back. This would lead to unpredictable response times.

Compare that to reading the directory of a zip, which maybe on big files are a couple of megabytes max.

Add to that that tar.gz will require you to decompress from the start to reach any offset. You can recompress while indexing, but an object-store mutating your data is IMO a no-no.

remram · on Feb 17, 2022

S3 is "eventually consistent", so I don't think indexing in the background would be such a big deal. But yeah, like I said this would only work for no-compression or those schemes that are seekable (not gzip).

In any case it is definitely a lot more work than ZIP.

klauspost · on Feb 17, 2022

No, S3, as MinIO, has a read-after-write consistency.

So indexing would block on either writes or reads until it is done. We block when doing the zip indexing, but that is much more lightweight - and we limit to 100MB ZIP directory. That way we don't risk long-blocking index operations.

remram · on Feb 17, 2022

I see. Indeed that is a potentially long time to block.

danudey · on Feb 17, 2022

IIRC gzip can't handle this, but bzip2 can; a guy I know wrote an offline Wikipedia app for the original iPhone and had to crunch things down a lot, and he used bzip2 because you can skip ahead to a chunk without having to process the previous or subsequent chunks.

Then he just had to write some code to index article names based on which chunk(s) they were in, and boom, random-access compressed archive.

blacha · on Feb 16, 2022

This is basically exactly what we do we have created a cloud optimised tar (cotar)[1] by creating a hash index of the files inside the tar.

I work with serving tiled geospatial data [2] (Mapbox vector tiles) to our users as slippy maps where we serve millions of small (mostly <100KB) files to our users, our data only changes weekly so we precompute all the tiles and store them in a tar file in s3.

We compute a index for the tar file then use s3 range requests to serve the tiles to our users, this means we can generally fetch a tile from s3 with 2 (or 1 if the index is cached) requests to s3 (generally ~20-50ms).

To get full coverage of the world with map box vector tiles it is around 270M tiles and a ~90GB tar file which can be computed from open street map data [3]

> Though even that would only work with a subset of compression methods or no compression.

We compress the individual files as a work around, there are options for indexing a compressed (gzip) tar file but the benefits of a compressed tar vs compressed files are small for our use case

[1] https://github.com/linz/cotar (or wip rust version https://github.com/blacha/cotar-rs) [2] https://github.com/linz/basemaps or https://basemaps.linz.govt.nz [3] https://github.com/onthegomap/planetiler

remram · on Feb 16, 2022

Why not upload those files separately, or in ZIP format?

blacha · on Feb 16, 2022

> Why not upload those files separately,

Doing S3 put requests for 260M files every week would cost around $1300 USD/week which was too much for our budget

> or in ZIP format?

We looked at zip's but due to the way the header (well central file directory) was laid out it mean that finding a specific file inside the zip would require the system to download most of the CFD.

The zip CFD is basically a list of header entries where they vary in size of 30 bytes + file_name length, to find a specific file you have to iterate the CFD until you find the file you want.

assuming you have a smallish archive (~1 million files) the CFD for the zip would be somewhere in the order of 50MB+ (depending on filename length)

Using a hash index you know exactly where in the index you need to look for the header entry, so you can use a range request to load the header entry

  offset = hash(file_name) % slot_count

Another file format which is gaining popularity recently is PMTiles[1] which uses tree index, however it is specifically for tiled geospatial data.

[1] https://github.com/protomaps/PMTiles

klauspost · on Feb 17, 2022

Nice tools!

When it is serverside, reading a 50MB CFD is a small task. And once it is read we can store the zipindex for even faster access.

We made 'zipindex' to purposely be a sparse, compact, but still reasonably fast representation of the CFD - just enough to be able to serve the file. Typically it is around a 8:1 reduction on the CFD, but it of course depends a lot on your file names as you say (the index is zstandard compressed).

Access time from fully compressed data to a random file entry is around 100ms with 1M files. Obviously if you keep the index in memory, it is much less. This time is pretty much linear which is why we recommend aiming for 10K file per archive, which makes the impact pretty minimal.

remram · on Feb 17, 2022

You mean the cost of the PUT requests becomes significant. That makes sense since AWS doesn't charge for incoming bandwidth. Thanks!