Paper Summary: Fundamentals of Fault-Tolerant Distributed Computing
Felix C. Gärtner. 1999. Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput. Surv. 31, 1 (March 1999), 1–26.
Keywords: Failure, Failure Detection, Failure Mitigation, Failure Tolerance, Failure Transparence
This paper explores the very foundations of failure, failure tolerance, and failure transparence.
In Fundamentals of fault-tolerant distributed computing in asynchronous environments, Felix C. Gärtner presents a formal model to define important terms such as fault, fault tolerance, and redundancy. This leads to four distinct kinds of fault tolerance and to two main phases in achieving them, detection and mitigation. Instead of a vague and fuzzy notion of failure and failure tolerance, the paper draws a clear and concise mental model — you will gain the foundations to reason about failure and failure tolerance on a new level.
Note
We avoid the term fault or error and use the term failure to denote the fact that a system has not behaved according to its specification.
System Model
A System Model defines the domain of discourse, that is, the universe and its inhabitants, we use to reason. Here, we will adopt a System Model defined by Leslie Lamport. In [1] Leslie Lamport defines Definition and Execution:
- The Definition of a software system or component can be modeled as a state action machine.
- The Execution of a software system or component can be modeled as a state action behavior.
Definition and Execution are the very foundations of software engineering, the Definition corresponds to Code, the Execution corresponds to Computation. Therefore, you may think of a Definition as a generator of Executions and of an Execution as an instance of a Definition.
State Action Machine
A state action machine, the definition, is specified by a set S of states, a set I of initial states so I ⊆ S, and a next-state relation N on S so N ⊆ S ×A×S
State Action Behavior
A state action behavior, the execution, is specified by a trace σ i.e. a sequence of steps, where a step represents a transition from state sₙ to state sₘ performed by action aₙ
In summary, the definition of a system may be understood as a generator for the possible executions of a system.
Correctness, Safety, and Liveness
In [2] Leslie Lamport defines the Correctness of a systems in terms of Safety and Liveness Properties:
- Informally, a Safety Property states that something bad will never happen.
- Informally, a Liveness Property states that something good will eventually happen.
A system is correct if every possible execution of the system meets its Safety and its Liveness Property.
Failure
A formal approach to defining Failure is based on the observation that a system changes its state as a result of two transition types:
- Normal Transitions
- Failure Transitions
Informally, a failure is an unwanted but nevertheless possible state transition in an execution.
1. State ∈ Legal State, Transition ∈ Normal Transition
If a system that is currently in a legal state sₙ executes a normal transition, then the system will transition to a legal state sₘ again.
2. State ∈ Legal State, Transition ∈ Failure Transition
If a system that is currently in a legal state sₙ (repeatedly) executes a failure transition, then the system will transition to an illegal state sₘ.
3. State ∈ Illegal State, Transition ∈ Normal Transition
If a system that is currently in a failure state sₙ (repeatedly) executes a normal transition, then the system will transition to a legal state sₘ.
The sequence of transitions that lead from an illegal state to a legal state is also called the recovery transitions.
4. State ∈ Illegal State, Transition ∈ Failure Transition
If a system that is currently in a failure state sₙ (repeatedly) executes a failure transition, then the system will transition to an illegal state sₘ again.
Failure Tolerance
Informally, failure tolerance is the ability of a system to behave in a well-defined manner when failures occur:
Masking Failure Tolerance
If a system guarantees both safety and liveness in the presence of failure then the system guarantees Masking Fault Tolerance.
Masking Failure Tolerance is the most desirable form of Failure Tolerance: Masking Failure Tolerance amounts to Failure Transparency!
Non-Masking Failure Tolerance
If a system guarantees liveness but does not gurantee safety in the presence of failure then the system guarantees Non-Masking Failure Tolerance.
Informally speaking, the system does not guarantee not to make any mistakes but the system does guarantees to make progress.
Fail-Safe Failure Tolerance
If a system guarantees safety but does not guarantee liveness in the presence of failure then the system guarantees Fail-Safe Failure Tolerance.
Informally speaking, the system does guarantee not to make any mistakes but the system does not guarantee to make progress.
Example
The following example illustrates a simple Process Definition: The process has one variable x with the possible values {0, 1, 2, 3}.
process
var x ∈ {0, 1, 2, 3} init 1
transitions
// normal transition
❶ x = 1 → x := 2
❷ x = 2 → x := 1
// normal, repair transition
❸ x = 0 → x := 1
// failure transition
❹ x ≠ 0 → x := 0
endend
We define the the process’ legal states as x ∈ {1, 2}, its illegal state as x ∈ {0}, and its untolerated state as x ∈ {3}; accordingly the process normal transitions are { ❶, ❷, ❸} and its failure transition is {❹}.
By definition, the correctness specification of the process is
- Safety. The value of x is either 1 or 2
- Liveness. There is a point in the future where x will be one and there is a point in the future where x will be 2.
There is a single failure transition ❹ that may transition the system from a legal state x ∈ {1, 2} to an illegal state x ∈ {0}.
Additionally, there is a single normal ❸ that may transition the system from an illegal state x ∈ {0} to a legal state x ∈ {1}, so ❸ is a recovery transition.
In legal states, the process will guarantee both safety and liveness properties; however in an illegal state the process does not guarantee its safety, only its liveness — the system is Non-Masking Failure tolerant.
Conclusion
Intuitively, a Failure may be defined as an undesired but nevertheless possible state transition of a system while Failure Tolerance may be defined as the ability to maintain some form of correctness in the presence of failure.
(*) Notes
- Flaviu Christian remarks that “what one person calls a failure, a second person calls a fault, and a third person might call an error.”
Flaviu Cristian. 1991. Understanding fault-tolerant distributed systems. Commun. ACM 34, 2 (Feb. 1991), 56–78.
References
- [1] Leslie Lamport, 2008, Computation and State Machines
- [2] L. Lamport, “Proving the Correctness of Multiprocess Programs,” in IEEE Transactions on Software Engineering, vol. SE-3, no. 2, pp. 125–143, March 1977, doi: 10.1109/TSE.1977.229904.