"By way of simple argument: would anyone advise using a source control system that only kept the last few hundreds of commits? Why do we treat our data differently?"
What? Come on, really? My database - only the latest version of all the data that gets updated frequently - has TB of data. My whole git repo with full history is about 500mb and that's excluding large assets. It's a completely different ball game. Sure, if I could buy 100TB disks cheaply and set it up so I could access data across hundreds of them in real time then there wouldn't be a problem. Obviously, I can't do that.
As the author of the blog post points out: the little thing called reality gets in the way of such nice and perfect arguments and ideas.
When talking about architecture at this level, you need to take a longer view. Once upon a time, a gigabyte database was mammoth beyond thought. Also, you're taking it as given that your data has its present storage footprint even though alternative architectures might have very different properties. It's worth noting that a purely additive database can be very aggressive about compression.
But also, petabyte datasets are downright cheap to deal with these days. They'll look more like that half gigabyte source code repo sooner than I think you're considering.
We're obviously coming from different backgrounds here - do you work at Google or similar ?
'Downright cheap' is a Heroku instance or two. 'Affordable' is 20 or so servers with a few TB of drive space each in a cluster with a few backups. Petabytes of data needs a huge number of servers and dedicated personal to look after - I don't know how that can be regarded as 'downright cheap'.
If you're in a position to handle petabytes of data you're probably going to roll your own data center with custom software ala Google's bigtable. This definitely isn't the market that these little startups are targeting as far as I can see. The OPs critique is very well founded.
petabyte datasets are downright cheap to deal with these days
Are you from the future? In no universe is a petabyte database remotely close to being cheap, even for large organizations. At this point I am seriously questioning your industry knowledge.
You're right, but the vast majority of databases aren't that big. You are very explicitly recommended against using something like Datomic if you think you will need to purge data with regular intervals.
"but the vast majority of databases aren't that big" probably because they don't keep all copies of everything forever. Even as basic blog system would grow huge just based on spam comments alone.
But you don't have to keep redundant copies around forever. Immutability means nothing in the past can change, which means it's perfectly fine for future entries to reference old values. New rows can essentially be deltas, much like version control systems.
I disagree that Datomic necessarily grows huge. I'm just saying don't use it if you know you're going to be storing terabytes and you don't have enough disk space to handle it without needing to truncate data with regular intervals.
Git is not append only though. It has a stop the world GC which compacts data and removes objects not referenced by any current branch, tag or similar object. So even if you want to keep the history you may not want to keep everything.
It's important to make a distinction between semantic and physical datasets. Git is append only (or more properly, strictly accumulative in an information sense) from a semantic perspective. From a user perspective we don't really care what the packfiles etc look like. Git also will forget timelines (transitive closure of a reference) you don't care about if you tell it to.
So it's important to distinguish purely persistent data structures in the sense of Okasaki et all, and append only datasets in the semantic sense. Git still qualifies as the latter even though it's implementation uses mutability as appropriate.
This is a good example of where the OP's critique is painting with too broad a brush. If we think from first principles: indexes are derived state, not primary state. The value log is the ultimate authority, the index can be regenerated at will. There is a huge design space of indexing over append only logs that remains largely unexplored.
I personally strongly believe databases will move emphatically in this direction because the benefits are so great. Given the typical tiered web application, wouldn't you like it if when you get an exception notification instead of just a text blob of a stack trace you can resume a continuation of the state of the system as the user saw it at the time of the error? Navigate forward and backward in time at will, even with nonlinear jumps? Edit the source code of the past and then replay toward the future to see if it resolves or reproduces the exception? There are huge opportunities here.
I will admit it's not immediately clear how to implement a database that reflects these ideas. It's going to take some experimentation and learning. That'll involve some failed experiments. What bugged me most about the OP was the attitude of "we're done, shut up with this other stuff." We are emphatically not done. It's ironic to note that at the moment MySQL was birthed it faced the same FUD style criticism.
Your comment triggered again a thought of mine: What if we had a perfect data store? What would it look like? I think it would have these properties:
- instant access to the most current data from anywhere in the world (no need for replication)
- infinite storage space (no need for gc)
- instant access to any value in the dataset (any joins are ok, indexing not needed)
Obviously we are not there, but let's suppose we had such a tool, I think going the append-only way would make sense, and data deletion would only happen proactively (e.g. for legal or privacy reasons).
Also, in fact a good database is trying to mimic this perfect storage, and some of its users are hitting the boudaries. Anyway, when you feel that a database is not perfect in this sense, and need to adjust your usage, it means you met a boundary, which may not be there anymore in a few years.
Another angle is that maybe there is a physical limit to data transfer that will stop the evolution of datastores. A bit like speed of light, or would the speed of light already put noticable boundaries to this hypothetical perfect datastore?
> wouldn't you like it if when you get an exception notification instead of just a text blob of a stack trace you can resume a continuation of the state of the system as the user saw it at the time of the error?
this, along with capturing headers (perhaps via webserver logs) as well as POST data would actually be incredibly powerful. There is a certain difficult class of bugs that this would greatly help mitigate.
What bugged me most about the OP was the attitude of "we're done, shut up with this other stuff."
The OP didn't say that at all, but instead observed (in a manner that you have quite overwhelmingly failed to refute at all, instead relying upon emotives and non sequiturs, betraying a bizarre and completely unsupported defensiveness that seems to be some variation "but it's new, man! It's new!") that everything old is new again over and over again.
People have been trying these things for literally decades, and many of the current products that lead categories have some elements of those designs. If someone is pitching a very old idea as (r)evolutionary, it is worthy of discussion, whether that offends your sensibilities or not.
And it's odd that you mentioned MySQL, given that yes, it did face to "criticisms" because it made the same old mistakes that so many products before made. FUD? Do you think since then Oracle became more like MySQL, or the opposite? I'll answer that for you -- MySQL abandoned all of the supposed "advantages" that were birthed largely in ignorance and adopted the designs of the products that came before.
Well said!