Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I've been experimenting with taking this self-description paradigm even farther, for a file format I've cooked up for ephemeral data in my search engine.

Basically, since I ended up building a custom library for this, I wanted to solve the portability problem by making it stupidly simple to reverse engineer, so I cooked up a convention where each column (and supporting column) is a file, with a file name that describes its format and role.

So a real-world production table looks like this if you ls in the directory (omitting a few columns for brevity):

  combinedId.0.dat.s64le.bin
  documentMeta.0.dat.s64le.bin
  features.0.dat.s32le.bin
  size.0.dat.s32le.bin
  termIds.0.dat-len.varint.bin
  termIds.0.dat.s64le[].zstd
  termMetadata.0.dat-len.varint.bin
  termMetadata.0.dat.s8[].zstd

The design goal is that just based on an ls output, someone who has never seen the code of the library producing the files should be able to trivially write code that reads it.


Internally the design of Vortex is very similar. The file consists of a whole bunch of "messages" (your files), which then have some metadata attached, and the read logic decides which messages it needs when.


Do you have a deeper writeup of this anywhere?


Not yet, but I will compile one at some point. I'm in the middle of moving right now so I don't quite have the time to sit down and finish the write-up...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: