Crash-Only Software

software crash-only systems design

Crash-Only Software - Crazy?

Crashing is bad. If something must crash (because we are failable) we want it to crash early. Static types are a joy specifically because they kill our expectations of success before we run the program, rather than during execution.

Infrequently used code paths are bad. They may or may not be correct, but if nobody has triggered them recently, how would you know? Perhaps you have extensive code coverage, you know what mutation testing is and decided you like it. Perhaps you brush your teeth twice daily and floss as well.

So what happens if we accept a bad thing wholeheartedly, bathe in it and make it a necessary part of our software? We get crash-only software, where crash handling is the only path. As far as I’m aware, Crash-Only software conceptually started in 2023 because of George Candea and Armando Fox. Their original premise was simple - maybe it is faster to do an ugly crash than a graceful reboot? (This seems to come from a similar mindset to me when I yank the USB drive without disconnecting it, which has yet to burn me…) In their case, even in 2003, crashing was faster, and didn’t result in untoward data loss.

Okay, so it’s quick you say, but my software is special. And it probably is, which is why this is one philosophy and hard to find information on. Crash-Only does ask you to accept a mountain of difference in design unorthodoxy, and one hopes at the end the tradeoff was worth it.

What does that unorthodoxy look like in practice?

The Central Conceit(s)

  1. Systems should only crash, or initialize from a crash. One on-ramp, one off-ramp.
  2. If at all possible, the system should not crash itself. Kill -9 over some sort of manual intervention. It is probably permissible to have a supervisor-style exit, but we want quick and dirty.
  3. Because we intend to murder our darlings, state should either be ephemeral (and we really couldn’t care less about what it was 1 second ago) - or persisted sufficiently that yanking the power cord is a viable solution to our problems.
  4. Lease everything.
  5. Timeouts everywhere - We’d really prefer to be crashed, or OK. If we’re unsure which state we’re in, we might have to contemplate subjective states of ‘not quite feeling well but okay’… nope kill it with fire.

Okay what does this buy me?

These constraints force several beneficial outcomes. One, we know your ‘error handling’ code is being used violently, it isn’t in the scary part of the codebase that was last touched by an intern in 2011. Two, you’ve had to explicitly mark what state you care about and what can be discarded, which means you actually understand your system’s invariants. Three, you’ve likely decoupled components that don’t strictly need another component to run, because tight coupling means cascading failures when you’re crashing regularly.

The result is a system that’s simpler to reason about precisely because you can’t rely on graceful degradation or complex error recovery. You either work or you don’t, congestive failure is not an option.

State and Control Theory - A digression

I’ve been trying to read up on control theory, and I think Crash-Only might have something to recommend itself in that perspective, too. In control theory, you care about observability (can I see what state I’m in?) and controllability (can I drive the system to a desired state?). We have states, views of those states, some ways to nudge them, and hopefully our states, views, and nudges work out into a reliable system.

In a crash-happy system, component state should be simpler. Like Let-it-crash we’re taking a dim view of unexpected errors. Our feedback loops might be clearer, or at least able to pessimize onto “it isn’t running correctly” - and our runtime feedback mechanisms should devolve into normal interaction, and kill it with fire.

Of course, none of this works without the right infrastructure.

Some necessary tools.

You’ll need a database (in crash-only land they’re called a crash-only store) - because important state needs to be atomic. The application itself should be as stateless as possible, other than trivial data you can regenerate from the database.

Your requests will have to carry their own context. Time-to-live, idempotency flags, everything needed to retry them if a component crashes mid-processing. No hidden session information.

A note about why we crash

Normal variation in a program is fine. Crashes are still for broken invariants, but we’ve designed the system so that it isn’t apocalyptic to eagerly use asserts in production.

2006 LWT Article discussing Crash-Only (seemingly like people would have heard of it them)

A wonderful 2016 RootConf Presentation by Antoine Grondin