Handling Failures From First Principles

7 min readOct 22, 2022

Keywords Failure, Failure Tolerance, Failure Transparence, Failure Detection, Failure Mitigation, Application vs platform-level, Failure Presentation, Transient, Intermittent, Permanent, Recovery, Forward Recovery, Backward Recovery, Repair, Manual Repair, Auto Repair

This blog post presents a blueprint for a principled failure handling strategy that guarantees correctness and completeness while maximizing the chance of success even in the presence of failure.

You will learn about the taxonomy to holistically reason about failures and a customizable strategy to correctly handle failures.

Introduction

Your business processes are the value-generating entities of your business, therefore a business process execution shall preferably run successfully to completion － even in the presence of failure.

Figure 1. A business process illustrated as a sequence of steps

For this post, I will use an e-commerce business process as an example: A business process is a sequence of steps called actions, here, one action is charging a customer’s credit card.

In order to maintain consistency, after a business process is executed the system must be in a state equivalent to the business process having executed either exactly once or not at all:

In the absence of failure, the system trivially transitions into a state equivalent to the business process executing exactly once.
In the presence of failure, the system has to take mitigating steps in order to ensure the transition into a state equivalent to the business process executing either exactly once or not at all.

Here the action Charge may raise one of two exceptions

an InsufficientFundsException, indicating the failure that the charge could not be processed due to the customer’s credit, or
a ConnectionException, indicating the failure that a connection to the payment provider could not be established. In addition neither exception provides any further details.

function Charge(...) throws InsufficientFunds, CouldNotConnect

If Charge fails some steps of the business process will have been already executed and some steps of the business process will not have been executed yet threatening the consistency of the system.

Failure handling consists of Failure detection as well as failure mitigation. In this blog post I will focus on failure mitigation. In order to correctly mitigate a failure, one has to correctly classify the failure.

Classifying Failures

Failures may be classified along different dimensions. This blog post will focus on two orthogonal dimensions, the spatial dimension and the temporal dimension.

Spatial Dimension

On the one hand, failures may be classified by where they occur. I will assume a layered architectural style: Components are arranged in a layered fashion where components at a higher layer can make a downcall to components at a lower layer, generally expecting a response. Less frequently, components at a lower layer can make an upcall to components at a higher layer, generally via a previously registered callback. [1]

Figure 2. Application-level vs Platform-level

The End-to-End Argument states that in a layered system, looking from the top down, failure handling should be implemented in the lowest layer at which failure detection and failure mitigation can be implemented correctly and completely.

Application and Platform

If we can only correctly and completely detect and mitigate a failure on an application-level, we classify the failure as an application-level failure.

In our example, InsufficientFundsException is an application-level failure that may only be detected and mitigated on an application-level.

However, if we may correctly and completely detect and mitigate a failure on a platform-level, we classify the failure as a platform-level failure and the failure itself, failure detection, and failure mitigation may be transparent to the application-level.

In our example, ConnectionException is a platform-level failure that may be detected and mitigated on a platform-level, transparent to the application-level.

Temporal Dimension

On the other hand, failures may be classified by when they occur and how often we expect them to occur.

Transient, Intermittent, and Permanent

A transient failure is a failure that “comes and goes”. Formally, the defining characteristic of a transient failure is that the probability of a failure t2 occurring after a failure t1 occurred is the same probability of a failure t2 occurring on its own.

Figure 3.1. Past failures do not increase the probability of future failures.

In our example ConnectionException may be a transient failure if the cause of the failure is a router restart.

An intermittent failure is a failure that “comes, sticks around, and goes”. Formally, the defining characteristic of an intermittent failure is that the probability of a failure t2 occurring after a failure t1 occurred is higher then the probability of a failure t2 occurring on its own.

Figure 3.2. Past failures do increase the probability of future failures.

In our example ConnectionException may be an intermittent failure if the cause of the failure is a router with an out of date routing table.

A permanent failure is a failure that “comes and sticks around”. Formally, the defining characteristic of a permanent failure is that the probability of a failure t2 occurring after a failure t1 occurred is 100%.

Figure 3.3. Past failures result in future failures.

In our example, ConnectionException may be a permanent failure if caused by the client or server presenting an invalid certificate when initiating the connection.

By definition, transient failures and intermittent failures are auto-repair failures; that is, the underlying cause of the failure does resolve without manual intervention.

Permanent failures are manual-repair failures; that is, the underlying cause of the failure does not resolve without manual intervention.

Furthermore, manual-repair failures may be consumer side; that is, we may repair the cause of the failure ourselves or may be provider side; that is, someone else must repair the cause of the failure for us.

Failure Mitigation

With this failure classification we are able to construct an ideal failure handling strategy, that is, a failure handling strategy that achieves the ideal outcome in the presence of a given failure and its cause.

On the highest level, failure handling can be classified as backward recovery or forward recovery.

Figure 4. Forward Recovery vs Backward Recovery

Backward failure recovery refers to failure mitigation strategies that transition the system from the intermediary state to a state that is equivalent to the initial state (move the process backward). As a rule of thumb, backward failure recovery does not require repairing the underlying cause of the failure.

Backward failure recovery is a common application-level failure mitigation strategy in the form of compensation.

Forward failure recovery refers to failure mitigation strategies that transition the system from the intermediary state to a state that is equivalent to the final state (move the process forward). As a rule of thumb, forward failure recovery requires repairing the underlying cause of the failure.

Forward failure recovery is a common platform-level failure mitigation strategy in the form of retries.

Putting it all together

This section provides a blueprint for an ideal failure handling strategy, one which maximizes the probability of a business process execution completing successfully in the presence of a failure.

Application-level Failure Mitigation

If a failure may unambiguously be classified as an application-level failure, the application is responsible to mitigate that failure.

In our example InsufficientFundsException is an application-level failure that may be addressed by backwards recovery, such as compensating steps that have already occurred or by forward recovery, such as asking the customer for an alternate payment method.

Platform-level Failure Mitigation

If a failure can not unambiguously be classified as an application-level failure or a failure may unambiguously be classified as a platform-level failure, the platform is responsible to mitigate the failure.

First, we assume a failure is a platform-level, transient, auto-repair failure; to mitigate the failure we will issue an immediate retry.

Next, we assume a failure is a platform-level, intermittent, auto-repair failure; to mitigate the failure, we will schedule multiple retries with a backoff strategy.

Next, we assume a failure is a platform-level, permanent, manual-repair failure; to allow for mitigation, we will suspend the process, repair the underlying condition of the failure, and resume the process to retry.

If these mitigation efforts are ultimately unsuccessful, we have to elevate the failure to an application-level failure, abandoning failure transparence, presenting the failure to the application

Summary

Failure handling consists of failure detection and failure mitigation. Failure mitigation is either forward recovery or backward recovery. Failures may be classified on a spatial as well as a temporal dimension. On the spatial dimension we have application-level and platform-level. On the temporal dimension, we have transient, intermittent, and permanent failures. Transient and intermittent failures repair themselves, while permanent failures require manual repair. Without further information about a failure, you start mitigation assuming the failure is a platform-level transient failure and elevate your understanding to intermittent, then permanent, and eventually application-level as you try to mitigate the failure.

References

[1] M. van Steen and A.S. Tanenbaum, Distributed Systems, 3rd ed., distributed-systems.net, 2017.