Hey really cool! Tcpdump co-author here. If you're interested in the origin story of the tcpdump language, bpf, and pcap, I had fun putting together a talk a number of years ago... https://www.youtube.com/watch?v=XHlqIqPvKw8
For sure... the use of JSON here seems orthogonal to Jamie's point of constructing complex nested values in SQL with a single backend query. Modern SQLs support all this as native types (presuming the result can fit in a homogeneous relational type, which it can in this case), e.g., Jamie's query can be written with a record expression (duckdb dialect):
with theTitle as (
from title.parquet
where tconst = 'tt3890160'
),
principals as (
select array_agg({id:principal.nconst,name:primaryName,category:category})
from principal.parquet, person.parquet
where principal.tconst = (from theTitle select tconst)
and person.nconst = principal.nconst
),
characters as (
select array_agg(c.character) as characters, p.u.name
from principal_character.parquet c
join (select unnest((from principals)) as u) p
on c.character is not null and u.id=c.nconst and c.tconst=(select tconst from theTitle)
group by p.u
)
select {
title: (select primaryTitle from theTitle),
director: list_transform(
list_filter((from principals), lambda elem: elem.category='director'),
lambda elem: elem.name),
writer: list_transform(
list_filter((from principals), lambda elem: elem.category='writer'),
lambda elem: elem.name),
genres: (select genres from theTitle),
characters: (select array_agg({name:name,characters:characters}) from characters),
} as result
And if you query typeof on the result, you'll get:
STRUCT(
title VARCHAR,
director VARCHAR[],
writer VARCHAR[],
genres VARCHAR,
characters STRUCT(
"name" VARCHAR,
characters VARCHAR[]
)[]
)
Decoding sum types into Go interface values is obviously tricky stuff, but it gets even harder when you have recursive data structures as in an abstract syntax tree (AST). The article doesn't address this. Since there wasn't anything out there to do this, we built a little package called "unpack" as part of the SuperDB project.
And somewhat ironically here, SuperDB not only implements sum-type decoding of JSON in package unpack, but it also implements native sum types in a superset of JSON that we call Super JSON (with a query language that understands how to rip and stitch sum types for columnar analytics... work in progress)
Necessity is the mother of invention. MapReduce-based systems were developed because the state-of-the-art RDBMS systems of that age could not scale to the
needs of the Googles/Yahoos/Facebooks during the phenomenal growth spurt of the early Web. The novelty here was the tradeoffs they made to scale out and up using the compute and storage footprints available at the time.
"We thought of that" vs "we built it and made it work".
MapReduce was never built to compete with RDBMS systems. It was built to compete with batch-scheduled distribution data processing, typically where there was no index. It was also built to build indices (the search index), not really use them during any of the three phases. It was also built to be reliable in the face of cheap machines (bad RAM and disk).
Google built MR because it was in an existential crisis: they couldn't build a new index for the search engine, and freshness and size of the index was important for early search engines. The previous tools would crash part-way through due to the cheap hardware that Google bought. If Google had based search indexing on RDBMS, they would not exist today.
Now Google did use RDBMS- they used MySQL at scale. It wasn't unheard-of for mapreduces to run against MySQL (typically doing a query to get a bunch of records, and then mapping over those records).
I worked on later mapreduce (long after it was mature) which used all sorts of tricks to extend the MapReduce paradigm as far as possible but ultimately nearly everything got replace with Flume, which is effectively a computational superset of what MR can do.
I think the paper must have been pulled because Stonebreaker must have gotten huge pushback for attacking MR for something it wasn't good at. See the original paper for what they proposed as good use cases: counting word occurences in a large corpus (far larger than the storage limits of postgres and others at the time), distributed grep (without an index), counting unique items (where the number of items is larger than the capacity of a database at the time), reversing a graph (convert (source, target) pairs to (target, [source, source, source]), term vectors, inverted index (the original use case for building the index) and distributed sort. None of the RDBMS of that day could handle the scale of the web. That's all.
> Stonebreaker must have gotten huge pushback for attacking MR for something it wasn't good at
I like this comment because it gets to the heart of a misunderstanding. I'd further correct it to say "for something it wasn't trying to be good at". DeWitt and Stonebraker just didn't understand why anyone would want this, and I can see why: change was coming faster than it ever did, from many angles. Let's travel back in time to see why:
The decade after mapreduce appeared - when I came of age as a programmer - was a fascinating time of change:
The backdrop is the post-dotcom bubble when the hype cycle came to a close, and the web market somewhat consolidated in a smaller set of winners who now were more proven and ready to go all in on a new business model that elevates doing business on the web above all else, in a way that would truly threaten brick and mortar.
Alongside that we have CPU manufacturers struggling with escalating clock speeds and jamming more transistors into a single die to keep up with Moore's law and consumer demand, which leads to the first commodity dual and multi core CPUs.
But I remember that most non-scientific software just couldn't make use of multiple CPUs or cores effectively yet. So we were ripe for a programming model that engineers who've never heard of lamport before can actually understand and work with: threads and locks and socket programming in C and C++ were a rough proposition, and MPI was certainly a thing but the scientific computing people who were working on supercomputers, grids, and Beowulf clusters were not the same people as the dotcom engineers using commodity hardware.
Companies pushing these boundaries were wanting to do things that traditional DBMSes could not offer at a certain scale, at least for cheap enough. The RDBMS vendors and priesthood were defending that it's hard to offer that while also offering ACID and everything else a database offers, which was certainly not wrong: it's hard to support an OLAP use case with the OLTP-style System-R-ish design that dominated the market in those days. This was some of the most complicated and sophisticated software ever made, imbued with magiclike qualities from decades of academic research hardened by years of industrial use.
Then there was data warehouse style solutions that were "appliances" that were locked into a specific and expensive combination of hardware and software optimized to work well together and also to extract millions and billions of dollars from the fortune 500s that could afford them.
So the ethos at the booming post-dotcoms definitely was "do we really need all this crap that's getting in our way?", and we would soon find out. Couching it in formalism and calling it "mapreduce" made it sound fancier than what it really was: some glue that made it easy for engineers to declaratively define how to split work into chunks, shuffle them around and assemble them again across many computers, without having to worry about the pedestrian details of the glue in between. A corporate drone didn't have to understand /how/ it worked, just how to fill in the blanks for each step properly: a much more viable proposition than thousands of engineers writing software together that involves finnicky locks and semaphores.
The DBMS crowd thumbed their noses at this because it was truly SO primitive and wasteful compared to the sophisticated mechanisms built to preserve efficiency that dated back to the 70s: indexes, access patterns, query optimizers, optimized storage layouts. What they didn't get was that every million dollar you didn't waste on what was essentially the space shuttle of computer software - fabulously expensive and complicated - could now buy a /lot/ more cheapo computing power duct taped together. The question was how to leverage that. Plus, with things changing at the pace that they did back then, last year's CPU could be obsolete by next year, so how well could the vendors building custom hardware even keep up with that, after you paid them their hefty fees? The value proposition was "it's so basic that it will run on anything, and it's future proof" - the democratization aspect could be hard to understand for an observer at that point, because the tidal wave hadn't hit yet.
What came was the start a transition from datacenters to rack mounts in colos and dedicated hosts to virtualization and very soon after the first programmable commodity clouds: why settle for an administered unixlike timesharing environment when you can manage everything yourself and don't have to ask for permission? Why deal with buying and maintaining hardware? This lowered the barrier for smaller companies and startups who previously couldn't afford access to such things nor markets that required them, which unleashed what can only be described as a hunger for anything that could leverage that model.
So it's not so much that worse was better, but that worse was briefly more appropriate for the times. "Do we really need all this crap that's getting in our way?" really took hold for a moment, and programmers were willing to dump anything and everything that was previously sacred if they thought it'd buy them scalability, schemas and complex queries to start.
Soon after, people started figuring out how to maintain all the benefits they'd gained (democratized massively parallel commodity computing) while bringing back some of the good stuff from the past. Only 2 years later, Google itself published the BigTable paper where it described a more sophisticated storage mechanism which optimized accesses better, and admittedly was tailored for a different use case, but could work in conjunction with mapreduce. Academia and the VLDB / CIDR crowd was more interested now.
Some years after that came out the papers for F1 and Spanner, which added back in a SQL-like query engine, transactions, secondary indexes etc on top of a similar distributed model in the context of WAN-distributed datacenters. Everyone preached the end of nosql and document databases, whitepapers were written about "newsql", frustrated veterans complained about yet another fad cycle where what was old was new again.
Of course that's not what happened: the story here was how a software paradigm failed to adapt to the changing hardware climate and business needs, so capitalism had its guts ripped apart and slowly reassembled in a more context-applicable way. Instead of storage engines we got so many things it's hard to keep up with, but leveldb comes to mind as an ancestor. With locks we got was chubby and zookeeper. With log structures we got kafka and its ilk. With query optimizer engines we got presto. With in-memory storage we got arrow. We got a cambrian explosion of all kinds of variations and combinations of these, but eventually the market started to settle again and now we're in a new generation of "no, really, our product can do it all". It's the lifecycle of unbundling and rebundling. It will happen again. Very curious what will come next.
It's worth noting, in addition to looking down upon MapReduce, Stonebraker and the academic crowd were equally unimpressed by other commodity hardware scale-out practices of the time – including the practical-minded RDBMS sharding and caching strategies used by all the booming massive-scale startups.
In 2011, in an interview with GigaOm, Stonebraker famously called Facebook's use of sharded MySQL and Memcached "a fate worse than death" [1]. He also claimed the company should redesign their entire infrastructure, seemingly without clarifying what specific problem he thought needed to be solved.
The reader comments on that post are also quite interesting in tone.
Edit to add a disclosure: I joined Facebook's MySQL team a couple years after this, and quite enjoyed their database architecture, which certainly colors my opinion of this topic.
Thanks! :-) It's been brewing in my head for a while and I've written out shorter snippets of similar ideas over the course of the past few years.
I've taken some liberties and likely made some mistakes, but it's aping the kind of "history of technology in its social context" that I love to read about.
My recollection of the time is that lots of people thought they needed to use MapReduce for their "big data" but their data was like 100GB of logs they wanted to run a O(N) analysis on.
Apologies if our examples in the article weren't rich enough or clear. The beauty of the error type is that you can wrap any value in an error and make the error as rich as you'd like, even stacking errors from different stages of an ingest pipeline so you can see the lineage of an error alongside data that wasn't subject to errors. e.g., imagine an error like this:
... and you can quickly deduce that your "metrics" stage is dividing by "n" even if n is 0 and you can fix up your logic as well as fix the errors in place after fixing the bug in the ingest pipeline.
My point is that you know the real source of the error, so there's no point for these terse messages. The example in the article, you know at point of conversion that you get a float, but instead you need to do a dance to get the original. Here it's a divide by zero - you can tell me the name of the variable too.
There's no need to repeat the infamous MSSQL "String or binary data would be truncated" message saga here - there's no reason you can't give much more verbose errors by default.
Your observations are definitely valid. Structured errors were introduced in Zed some time ago, but as you highlight, they're not yet being generated in all the places they could be and with appropriate detail. There were a couple existing to-do items in the Zed backlog to make these enhancements in packages where it would have a high impact, but we've now created an additional Epic https://github.com/brimdata/zed/issues/4608 to gather these up and remind ourselves to address this in a comprehensive way. Thanks for helping to shine a light on this!
Yes, great point! The idea is that you can fix up a problem with data in place, while you're updating your ingest pipeline to handle whatever is causing the problem. You can do a transform on the errors into clean data, delete the errors, and commit the changes atomically. In the meantime, queries and searches can still run on the data that isn't problematic and even if there are errors inside of a hierarchical value, queries can be run on the portions of the value that are clean and intact while the errors are being addressed.
Hey, thanks for bringing up storage efficiency. This is an often cited objection and a big motivation of the Zed project actually. In Zed, data is organized around a self-describing type system rather than schemas, and unlike the relation model, heterogeneous data can land side-by-side at the storage layer. And unlike NoSQL JSON-based systems, Zed is comprehensively typed, which allows us to form vectors for heterogeneous data and get OLAP-style analytics performance even with heterogenous data. We're working on the vector analytics layer as we speak and hope to have an "MVP" of the Zed lake with vector support in the coming months.
TL;DR Zed stores data efficiently based on types not schemas so first-class error values of type "error" fit in anywhere with ease.
reply