Premise

I’ve recently been thinking about different methods to improve my code quality. In this case quality equates to the number of failure states (crashes) a program has. One option I’ve been considering is programming through what I affectionately called Expected Failure. Or EF (pronounced F!)

Edit: It was pointed out that this is very similar to defensive programming. And that’s a very good point. If you want a more full fledged explanation of defensive programming, there are a variety of posts about it. In some ways this is a very small subset of defensive programming. (check out https://medium.com/web-engineering-vox/the-art-of-defensive-programming-6789a9743ed4 for a breakdown.)

Software is often optimistically written with the expectations that the functions you call will succeed. We then build on top of these successful calls; producing a tower of call dependencies. This means that a lot of functionality depends on those calls at the bottom of the stack to succeed, if they fail, then the whole stack of functionality comes crumbling down.

What do I mean by fail? I mean calls that have the potential to return in an error state. Such as: malloc() will return NULL if it fails to allocate your block of memory or many vulkan API calls that will return VK_SUCCESS if it succeeds in executing the function.

Instead, I wonder how software would look if we expected those calls to fail. In my field (game development) I see the following scenario quite often: “When code succeeds, do X; when it fails, scream and either try to recover (rare) or assert with no follow up”. This is “fine”, often we can’t/don’t want to add recovery code due to the large amount of complexity involved, for performance reasons, or other reasons. (I’m lazy)

The first step in a potential solution could be to expect our calls to fail. This means that even calls that you think “This can fail, but it shouldn’t” you would change your mindset to “This will fail”. Once you have that mindset, you can make clear decisions on how you will address that potential failure.

I don’t suggest adding a large amount of recovery code for every potential fail state. Instead, I hope that having this “Expected Failure” will lead to a clearer understanding of the code’s failure states and a clearer decision on how that failure will be handled.

In some regards, this is simply a mindset that would put value on thinking clearly about a failure instead of thinking of it as a nuisance. Once you’ve thought clearly through a fail state, determined that it’s valid to even have that fail state and made a conscious choice on how it is to be handled, then I consider that a success.

Potential Solutions

At the cost of potentially over stretching the idea; I suggest that once we’ve determined that a call will fail, then we can make the decision on how to handle it. I wonder if a form of rough classification would be beneficial. Something as simple as “Benign, Problematic and Critical”.

Benign suggests that the failure is simply from a misplaced assumption in an API. Something as simple as this:

AddEntry("foo");

... code here

Entry* entry = FindEntry("foo");
assert(entry != nullptr);

Now assume that the call later gets changed to:

if(chance() less than or equal to 99)
    AddEntry("foo");

... code here
Entry* entry = FindEntry("foo");
assert(entry != nullptr);

Now this might seem silly but add a dash of complexity and a large codebase and you’ve got a 1 in 100 assert and crash. Admittedly, this example is quite contrived, but it is enough to illustrate my point for this case.

This code smells fairly strongly of “FindEntry shouldn’t fail” if instead we though of it as “FindEntry will fail” we could consider rewriting our code to look like this:

if(chance() less than or equal to 99)
    EntryKey key = AddEntry("foo");

// We don't have a key here now, we have a clear dependency.
Entry* entry = GetEntry(key);
assert(entry != nullptr);

This doesn’t remove our problem completely, we could pass an uninitialized key to GetEntry but now the dependency is clear and it’s harder to make the earlier mistake.

Problematic failures suggest that a system cannot function correctly once a call fails. These types of errors can be handled through various means, and one method could be to build laterally instead of vertically. Instead of building systems on top of other systems, they could live independently, communicating with other systems only if they happen to be there and functioning. If this method would add complexity, then maybe deciding that a failed call would simply end the execution of a program would be acceptable.

Critical failures are just that, critical. They imply that something went very wrong and we can’t recover from it. For these, it is most likely appropriate to end the execution of the program.

These solutions are merely meant as an exercise in considering that the calls “will fail” not that they “might”. Whether they fit well for your problem is up to you to decide.

Potential Criticisms

What about productivity? Wouldn’t this additional step create more work for the programmer?

It might, but it also might not. Once it’s part of a regular workflow, I don’t imagine that taking a small amount of time to determine that when a call will fail that the failure response is the correct one would be significant. I also wonder if the time saved from making that decision might impact the next programmer to modify the system and the users of the system. If a 5% increase in time for the current programmer saves 5% of the time for the maintainer, testers and users, then that seems like a valid trade-off.

There’s also the question of morale, if this makes a programmer feel less productive, then they might benefit from another strategy to improve their quality (if that’s their goal). I don’t propose that this is a one size fits all, or even one size fits many.

Handling/recovering from every error would add a lot of complexity to the software.

It doesn’t have to. This is up to the programmer’s discretion. This strategy simply promotes conscious consideration of the failure states of a program. The resolution of those failure states is a much larger and more difficult problem.

Does this strategy even work?

I have absolutely no idea. It’s my goal to try it out and see what kind of results I get from it. Even if you don’t end up using it, maybe it will get you thinking about how you could improve the quality of your code. I would love to hear about it!

Conclusion

Is this a good mindset to have when programming? I don’t know. I’ve only recently started trying it out and I look forward to seeing where it leads me. At worst, I’ll know it doesn’t work well, at best my code quality will have improved!

Have you read something similar or even better? Hit me up! Always happy to hear about new ways to challenge my productivity and quality.

Further Reading