The Role of Distributed State, John Ousterhout, 1990.
This winter break, I’m running a reading group with some friends for foundational papers in computer systems. For week 1, we read “The Role of Distributed State” and “In Search of an Understandable Consensus Algorithm”, without initially realising that John Ousterhout was an author for both papers. This was serendipitous because both papers complement each other, and we have an opportunity to see how Ousterhout’s ideas in 1990 develop and come into play in 2014, two-and-a-half decades later.
To summarise this paper, Ousterhout discusses the upsides and downsides of distributed state, then applies these ideas by evaluating and comparing the NFS and Sprite distributed file systems. He concludes that distributed state comes with unavoidable trade-offs, and system designers should be cognisant of their options and design accordingly. While state is essential for achieving performant systems, we should also strive to achieve the system’s goals with as little distributed state as possible.
The paper starts off with a definition of distributed state as “information retained in one place that describes something, or is determined by something, somewhere else in the system”. This is perhaps the weakest point of the paper, because this definition is nebulous and thus not very helpful for thinking about distributed state. Ousterhout’s main point here is that there’s a lot of state in a distributed system that’s not distributed state. For example: the values stored in registers on a machine, or the stack of a running process on a machine.
For a more useful and perhaps more memorable definition, I would describe distributed state as “information which may exist in more than one place that is used by the distributed system to achieve its cooperative goals”. So, for example, cached data from another computer, knowledge about other nodes in a cluster, information about data items spread out across the system etc.
Ousterhout then describes the benefits of distributed state:
- Performance. Distributed state involves caching locally some information that exists elsewhere in the system. This removes the continued cost of communication.
- Cooperation1. For multiple machines (or multiple components of the same machine) to work together, they need to be aware of common goals and have a means of coordinating effort.
- Reliability. Replication of state in multiple places gives us redundancy, which gives us fault tolerance because the system doesn’t break down if a copy of some replicated data is lost.
However, distributed state also comes with some costs:
- Inconsistency. State that exists in multiple places cannot be updated simultaneously. There are a few options to deal with this. A system could be pessimistic and not allow inconsistencies to happen; it could be lazy and fix inconsistencies when needed; it could be optimistic and say eh inconsistencies are fine. A system will usually involve a mix of consistency requirements and different components of the system will use different strategies for dealing with inconsistency.
- Crash sensitivity. Distributed systems don’t achieve fault tolerance by default. In fact, instead of a single point of failure, they can often have multiple points of failure, making them more fragile. To avoid this, state must be replicated in a way that masks failures. Specifically, there must be the properties of discoverability2 (replacement replicas must be available after a primary replica fails), verifiability (the system must be able to determine which replicas are up-to-date because failures can occur during windows of inconsistency), and recoverability (the system must be able to heal and bring replicas into consistency when a machine fails).
- Overheads. Maintaining consistency required communication which takes time, and maintaining fault tolerance requires redundancy which takes space. Moreover, the cost depends on the distributed algorithm used — the degree of sharing, the rate of change (for example, write-heavy workloads), the level of consistency, the degree of replication. At some point, distributed state might be so expensive to maintain that it would be better to centralise the state3.
- Complexity. Distributed state is hard to reason about. There’s all the issues of consistency and fault tolerance4 to grapple with, and concurrency is hard to debug5. There are very real costs to complexity: a lot of effort goes into getting a system right, and fixing errors with it, rather than optimising the system. This is why Understandability is an important property for systems, and leads to many of the design decisions in Raft.
Following this, Ousterhout describes the NFS file system. He describes it as “stateless” but immediately qualifies that this is a misnomer. Here’s a summary of features of NFS:
- Distributed state is kept to a minimum
- “Stateless”, but only in the sense that correctness depends entirely on the server’s disk (write-through caching of disk blocks in main memory is permitted only to improve performance)
- Remote Procedure Calls (RPCs), including read, write, and lookup of a file inside a directory, are idempotent (except for the mkdir function call)
- Distributed state is almost entirely client-side, and servers only maintain authorisation data. Clients keep track of file identifies, file data, file attributes, and name translations (cached results from lookups)
The idempotency of NFS RPCs and the absence of buffering writes on the client-side or in the server’s memory makes NFS a simple system to reason about. Server crashes appear only as delays, and result in no loss of state.
However, NFS has three weaknesses: performance, consistency, and semantics. Performance suffers because writes go through the write-through cache, and is only as fast as the speed of the disk. Moreover, edited data tends to be edited again very soon, so the lack of buffering writes results in an extremely high volume of disk writes.
Consistency suffers because servers don’t keep track of which clients have files open, so concurrent writes can happen, and stale reads happen because clients aren’t informed of writes. Clients solve this issue by polling the server for the last-modified-time of a file, but this has performance costs and still allows for periods of inconsistency.
Finally, semantics suffers due to the need to maintain statelessness and idempotency. Features that conflict with these properties can either be implemented in a cumbersome way (Ousterhout gives the example of implemented locks for files, which have to be done on disk to preserve statelessness), or they have to violate one of the properties (for example, mkdir breaks idempotency). In some cases, features are simply not implemented (which was the final fate of file locks).
In contrast with NFS, Sprite keeps track of more distributed state, namely:
- Servers maintain information, in main memory, about clients that are reading or writing to files
- Servers maintain modified file blocks in main memory and flush only after the data has aged for 30 seconds
- Clients retain modified file blocks in main memory and pass on the dirty data to servers only after it has aged for 30 seconds, or when file data is requested by another client
These pieces of state allow Sprite to maintain consistency between concurrent clients and prevent stale reads. For example, if a client writes to a file, the server is able to notify all clients with the file open to stop caching file data.
Additionally, consistency improves performance because we remove the need to sync everyone all the time (i.e. NFS’ polling issue). Moreover, since dirty blocks are buffered, they can accumulate changes before being flushed.
However, state introduces complexity and can lead to frustratingly subtle race conditions. Ousterhout gave the example of client A opening a file, but the return message that the file is cacheable is delayed. Meanwhile client B opens to the file to write to it, so the server sends an update to client A that the file is no longer cacheable, which arrives out-of-order with the first return message. So client A first sees that the file is not cacheable, and then that it is. More state solves this issue.
Another issue is with data loss: since data is buffered in main memory, they can be lost on crashes. Ousterhout points out that only the last minute of data is lost, and since main memory can be flushed on crashes, only crashes due to power loss pose any real issue. Whether this is acceptable depends on the application and the tolerance for data loss.
Additionally, a crashed server loses information about clients and their open files. Here Ousterhout remarks that “distributed state was not just the cause of the recovery problem, but also the solution”, and introduces the “reopen” protocol that allows clients to reopen files to help the server reconstruct its state. The problem is that this can lead to a recovery storm: when a server comes back online, all clients simultaneously reopens all their files, which overloads the server, causes some requests to time out, which prompts the clients to resend the reopen request, and exacerbate the problem. Most clients thus take a few attempts before they can reopen the file, but one can imagine that this problem might become more intense with more clients, and may even lead to the server crashing once more. And thus we enter the world of metastable failures.
The final issue is that distributed state comes with memory overheads. At the time of writing, state information occupied about 15-20% as much space as used for caching file data. This is sizeable, and could become more of a concern as the system scales.
Ousterhout’s paper gives us a good overview of the challenges faced by distributed systems — performance, consistency, fault-tolerance, complexity — and gives us the sobering conclusion that the trade-offs of distributed state are inevitable. Distributed systems are, by definition, systems spread out across space, and hence performance is only achievable with distributed state to cache information and cut down the cost of communicating over this space. Since we can’t do away with distributed state, we have to work with the next best option and deliberately minimise the amount of state used by the system.
- Ousterhout calls this coherence, but I believe coherence is simply a property of the system that allows it to achieve cooperation.
- My terms.
- From my observations, this is the strategy used in most scalable distributed systems. Distributed state is costly to scale, so we “centralise” it. However, centralisation doesn’t mean to store information on a single machine, it just means to store it in a single logical entity that could comprise of multiple machines (usually kept in sync with something like Raft or Paxos). This gives us the simplicity and performance of centralisation on a logical layer, along with the fault tolerance of decentralisation on a physical layer.
- Ousterhout jokes (then again, this may not be a joke) that “one of the best things about a centralized system is that the whole system stops whenever any component stops”.
- For an example, see this post on deadlocks.