Why does it matter (in fully-general theory, which is what we're discussing here...

ants_a · on May 2, 2022

That's the availability part. If a system is unable to make progress it is not available.

di4na · on May 2, 2022

Because you cannot differentiate a slow node from a dead node. People expect different responses to these.

derefr · on May 2, 2022

Big assumption that a distributed system has to serve “people” and have “responses.”

A distributed system might be, for example, the ACH system: all batch, all store-and-forward, no replies flowing down the line, only processed message response batches dropped off “eventually” in outboxes.

Or, for another example: any workload manager, whether on an HPC cluster or your local Kubernetes node. No synchronous workload execution; just async batch scheduler enqueue with later best-effort status querying.

di4na · on May 2, 2022

Note that ACH expect a reponse under 3 days, so a blocked forever do not work. Because guess what. People expect an answer.

Saying to people "a system somewhere is blocked for possibly forever, so too bad we cannot do your thing" is our reality. Our system exist for their impact on people.

Otherwise they are art... which also exist for its impact on people.

derefr · on May 3, 2022

It's not necessarily that "we cannot do your thing." Just that "we cannot do your thing using your lock. To get around this, simply make a new resource, to get a new lock."

Think of how in e.g. an IaaS control plane, when you delete a VM, it may take an arbitrarily-long time before you can create another VM with the same ID. (Maybe forever!) But you can always create a VM with a different ID, that otherwise fulfills all the same purposes (e.g. has the old instance's IP, FQDN, etc.) The old ID essentially has a distributed lock on its use, with an unbounded release time — and that's perfectly fine for the use-case.

For an example of fail-stalled being not only practical but preferred, consider tag-out locking systems (exclusive-access locks used to prevent machines from being turned on while maintenance is being performed on them.) If there was a digital lock of that type, you wouldn't want to ever automatically time it out. A human put that lock there, to keep them alive. They'll take it off when they're done. If you really suspect someone forgot to unlock the tag-out lock, you can always go and check with the lock's acquirer. But if you can't get in contact with them, you can't know that they don't still have their hands up in the gears of the machine. And in this case, failing to auto-restart the assembly line (until the "partition" is over and you can just ask the maintenance worker why they're still holding the lock) is worth much less than said maintenance worker's life.

jfim · on May 2, 2022

I can't come up with a good reason as to why one would want to fail stalled. In what scenario would one want to have a distributed lock that fails in that way?