H. Garcia-Molina and K. Salem. 1987. Sagas. In Proceedings of the 1987 ACM SIGMOD international conference on management of data (SIGMOD ‘87). Association for Computing Machinery, New York, NY, USA, 249–259.
Keywords: Long Running Processes, Sagas, Transactions, Compensation, Failure Mitigation, Failure Tolerance, Failure Transparency
Although the original paper on Sagas was written 35 years ago, the problem is still just as relevant today. This post explores the origins of the concept of Sagas, long-running processes that span multiple database transactions.
In their paper Sagas, Hector Garcia-Molina and Kenneth Salem introduce the concept of sagas to implement “long-running” operations.
Molina and Salem refer to processes that consist of a single step as “short-running” or “short-lived” processes and processes that consist of multiple steps as “long-running” or “long-lived”. Therefore, short-running or long-running is not a statement about physical time but a statement about logical time with each step being a clock tick (Figure 1).
In an ecosystem that largely rejects the idea of distributed transactions or holding locks for an extended period of time, any long-running process cannot simply be mapped to a long-running, distributed transaction. Instead, we need a different mechanism to compose long-running processes from short-running, non-distributed transactions.
The paper defines a process P as a sequence of steps a, b, etc, where each step is a database transaction. To use more familiar terms, you can think of a process as a function and the sequence of steps as a sequence of API calls to the database.
We refer to the sequence of steps a, b, etc as a trace:
trace(P) = a • b • …
The authors identify that the trace, that is, the sequential composition of transactions is not closed under atomicity (Figure 2). Here, the term atomicity denotes that an operation happens entirely or not at all.
In other words: A sequence of two transactions is not a transaction!
Trivially, if a process consists of a single transaction, the process exhibits the atomicity guarantees of a transaction (Figure 3).
However, if a process consists of multiple transactions, the process does not exhibit the atomicity guarantees of a transaction: The process may execute partially (Figure 4).
However, unlike other transactions, a process’ transactions are related and should be executed as a unit, any partial executions of a process are undesirable.
The paper discusses a travel reservation application, where a trip consists of multiple flights potentially on different airlines. Here, booking a trip involves multiple transactions, one per trip leg. As travelers, we want all transactions to be executed as a unit: We either want to book all flights or no flights, booking some flights is unacceptable.
The paper introduces Sagas, a protocol for implementing long-running processes. The Saga protocol ensures that a process is atomic, that is, a process executes observably equivalent to completely or not at all.
To mitigate partial execution, a process may implement either forward or backward recovery :
Forward failure recovery refers to failure mitigation strategies that transition the system from the intermediary state to a state that is equivalent to the final state (moving the process forward).
Backward failure recovery refers to failure mitigation strategies that transition the system from the intermediary state to a state that is equivalent to the initial state (moving the process backward).
Backward Recovery & Compensation
To enable backward recovery, each process should register a compensating step for each step (Figure 6).
The compensating step undoes — from a semantic point of view — the step, but does not necessarily return the database to the state that existed when the step began. Compensation will restore the world to a state which is an acceptable approximation to the state that it had before the start of the transaction.
For example, coming back to the travel reservation application, if a trip consists of two flights and reserving the first flight succeeded yet reserving the second flight failed, the saga needs to cancel the first reservation to return the database to an acceptable approximation to the state it had before the start of booking the trip.
So in the absence of failure, a process P produces the desired trace
trace(P) = s₁ • s₂ • … sₙ • ✔️
while in the presence of failure after step sᵢ, the process P produces the less desirable yet correct trace
trace(P) = s₁ • s₂ • … sᵢ • × • cᵢ … • c₂ • c₁ • ✔️
When a failure interrupts a Saga, a Saga must be able to recover its current state and either perform forward or backward recovery. Therefore, a Saga must be able to record the beginning of a Saga, the completion of a Saga, and record checkpoints and compensations.
If you own the database you may record checkpoints and compensations transactionally with each step. If you cannot record checkpoints and compensations transactionally with each step, you have to devise a mechanism to record checkpoints compensations reliably.
Although the paper was written more than 35 years ago, the problem and Sagas as the proposed solution are still relevant today. However, attention has shifted from composing database transactions to composing service invocations.
Keep in mind that services do not exhibit transactional properties: Services may not be atomic, after a service invocation, and if the service crashes, you may not know if the service took effect or not.
Therefore, you should register a compensating action before invoking a service and design compensation logic to be correct whether the service took effect or not.
Sagas remain relevant, despite significant advances in theory and practice since the paper’s publication. The basic concept of a Saga as a protocol for implementing atomicity in a distributed environment across transactions or services the techniques described in the paper continue to be widely used today.
-  A trace semantics for long-running transactions, M. Butler, C. A. R. Hoare & C. Ferreira, 2005
-  Handling Failures From First Principles, Dominik Tornow, 2022