"""Now if you've designed your system to be resilient to failures, and you have ...

"""Now if you've designed your system to be resilient to failures, and you have because you are using cheap whitebox servers which you know are going to fail randomly. You know that you can deliver against your service level SLA with up to 15 machines down (85 out of a nominal 100 running)."""

All of this is great information and certainly quite useful, but it is all strongly was premised on that final "I know", which assumes a universe in which you are able to capacity plan pretty accurately ahead of time.

Imagine instead a world where it is entirely possible that, on the same day, and all over the United States, a new Harry Potter 7 trailer starts playing at the theaters, causing hundreds of thousands of families to suddenly decide that tonight is a great night to not only boot up Netflix, but to all watch the exact same movie.

(For the record, this sort of thing has totally happened to me a few times: suddenly you find out that jailbreaking the iPhone was mentioned on CNN, or a new tool is released from its secret development to that team's hundred-thousand-plus-follower Twitter feed, leading to a sudden multi-x spike in traffic.)

Now, not only do we suddenly have an unexpected spike in load, but that spike is all targeting the exact same data. It is in this kind of situation where, first off, if you really are sitting on racks at 85% capacity, you are probably screwed: no one is watching that movie tonight.

Luckily, we aren't talking about a world with a bunch of racks: we are talking about Amazon EC2, and we hopefully have an auto-scaling group setup to automatically increase the number of servers we have operating.

However, while that is spinning up, we may have a small set of nodes (and yes: these nodes may even be S3 nodes run by Amazon) that are /freaking out/ as they are the blessed 3-5 computers that are storing our mirrors of Harry Potter 6. 3-5 has, throughout the entire previous history of the service, been "safely more than adequate" to handle the load of any given movie (and even a random failure of one of the servers), but today it is running "a tad slow".

Now, even here, maybe we are setup to scale: maybe the system is designed to auto-scale the data as well, and is copying the movie to new servers as we contemplate this scenario (a situation that, unfortunately, leads to more a bit more load on these servers, although hopefully only marginally more).

However, as the service went from "safely over-provisioned" to "at capacity" within minutes (everyone sitting down at 7pm to watch a movie with their family as the game is over, or whatever), the new machines are still copying the data of as the old ones are now sufficiently slow that from raw number reports they look like "stragglers", with (due to the wonder of statistics) one of them randomly looking slightly worse than the others; maybe even sufficiently worse to hit some "unhealthy" threshold, and Doctor Monkey comes along and kills it.

This is the worst possible moment and the worse possible server to inflict that damage. Even if Doctor Monkey marks the entire world as "degraded" at this point and refuses to do anything else until there is manual intervention, it just drastically increased the load, possibly fatally (possibly so high that the copy operation spawning more of these servers grinds to a halt), to a small set of servers that was already "at capacity".

This is the specific kind of thing that causes cloud collapse scenarios like what killed Amazon EBS a few months ago: due to a busy network, servers decided prematurely that their replica pairs were "offline", causing attempts to find new buddies, causing further network congestion, leading to even a higher likelihood that servers will appear disconnected.

So yes: it may be possible to design a system that is able to tell "unhealthy" from "unpredictably mis-provisioned", but given how fine a line that is I can totally understand why moe chose to use the terminology "magic enterprise pixie dust" to describe the solution that minded that particular gap.