Photo by Nathan Riley on Unsplash

Paper Summary: End-to-End Arguments in System Design

Dominik Tornow

--

J. H. Saltzer, D. P. Reed, and D. D. Clark. 1984. End-to-end arguments in system design. ACM Transactions on Computer Systems 2, 4 (Nov. 1984), 277–288.

Key Words Function, Completeness, Correctness, Application Layer, Platform Layer, Failure, Failure Detection, Failure Mitigation.

The End-to-End Argument is a great, thought provoking paper that explores the limits of pushing functionality from the application layer into the platform layer: Which functions can you push into the platform, increasing reusability, maturity, and performance, while still maintaining both completeness and correctness — especially in the presence of failure? The End-to-End Argument will aid you in making the optimal choice for your architecture.

In their 1984 paper “End-to-End Arguments in System Design”, Saltzer, Reed, and Clark present a design principle that helps guide the placement of functions among the modules of a distributed system. In this paper, the term functions refers to functionality, not a particular function definition in a programming language. Similarly, the term modules refers to layers, not a particular organizational construct in a programming language.

Saltzer, Reed, and Clark assume a layered architectural style. Components are arranged in a layered fashion where components at a higher layer can make a downcall to components at a lower layer, generally expecting a response. Less frequently, components at a lower layer can make an upcall to components at a higher layer, generally via a previously registered callback. [1]

For this blog post, I limit our discussion to two layers. I will refer to the top layer as the application layer and the lower layer as the platform layer.

The End-to-End Argument

The End-to-End Argument states that some functions may “completely and correctly be implemented only” on an application level: implementing said functions completely and correctly on a platform level is not possible. This impossibility is rooted in the fact that the application layer has total information, whereas the platform layer may only have partial information — informally, the platform layer lacks context.

However, the End-to-End Argument does not preclude to provide a partial, incomplete implementation of a function or to duplicate a function on a platform level, not for completeness and correctness but strictly as an optimization.

In addition, the paper stresses that the End-to-End Argument is a guideline that helps in application and platform design analysis; however, identifying the endpoints to which the End-to-End Argument should be applied requires a nuanced of analysis of application requirements.

Example

The significance of the End-to-End Argument is most apparent when reasoning about layers and failures: Is a layer able to detect a failure? If a layer is able to detect a failure, what should that layer do? Should the layer mitigate the failure? Should the layer present the failure to the next higher layer?

Reliable File Transfer

The paper discusses several examples, here I will focus on the example of Reliable File Transfer for brevity.

The objective is to copy a file from computer A’s storage to computer B’s storage without damage, with the knowledge that failures can occur at various points along the way.

A popular file transfer implementation is to transfer the file in chunks, for example to meet data transmission size restrictions or to increase data transmission concurrency:

  • At the source, the application layer splits the data into chunks before handing each chunk downwards to the platform layer for transfer.
  • At the target, the platform layer receives a chunk before handing the chunk upwards to the application layer for assembly.

So now the question arises, can you implement file transfer completely and correctly by limiting failure detection and failure mitigation to the platform layer or do you need failure detection and failure mitigation (also) on the application layer?

Failure Detection and Mitigation

While the platform layer may indeed detect transmission failures of a chunks via checksums on chunks and mitigate failures via retransmissions of chunks, only the application layer may detect assembly failures on files via checksums on files and mitigate failures via retransmission of files.

Failure Presentation

While the platform layer may indeed try to mitigate transmission failures via retransmissions, eventually, in order to avoid an infinite loop, the platform level has to present repeated transmission failures to the application level.

In summary, although the platform layer implements partial failure detection and mitigation, ultimately only the application layer is able to implement total failure detection and mitigation - only the application layer may determine if a file transfer was a success, was a failure, and how to handle that failure.

Conclusion

The End-to-End Argument states that some functions may “completely and correctly be implemented only” on an application level, even though the End-to-End Argument does not preclude partially implementing functions on a platform level as an optimization.

For example, failure detection and mitigation of a file transfer can (and should) happen on an application level and a platform level but only the application layer can ensure completeness and correctness of the transfer.

Types of Optimization

  • Reusability & Maturity. Even though some functions may completely and correctly be implemented only on the application layer, duplicating functionality on the platform layer may aid correctness; some functions are complex and therefore error prone, encapsulating these functions in the platform layer enables us to take advantage of their maturity and “fill in the gap” of these functions in the application layer.
  • Performance. If the application layer detects a failure in the transmission of the file, the application layer may mitigate that failure by retransmitting the file. However, if the platform layer detects a failure in the transmission of a chunk, the platform layer may mitigate that failure by retransmitting only the chunk. We may be able to avoid retransmitting the file if retransmission of the chunk is successful.

References

[1] M. van Steen and A.S. Tanenbaum, Distributed Systems, 3rd ed., distributed-systems.net, 2017.

--

--