Last week The Climate Corporation invited Kyle Kingsbury — inventor of Jepsen and distributed-systems-expert-about-town — to lead a workshop on distributed computing theory and practice at our San Francisco office.
Kyle’s work focuses on the necessity of theory to distributed computing practice — it’s a task that’s almost impossible to get right on a first try, and you need all the theoretical assistance you can get. You can see empirical application of distributed computing theory to real-world system in his series of Jepsen blog posts.
This post will follow a rough outline of Kyle’s workshop, starting with theoretical definitions and concepts, and ending with applying those definitions and concepts to discuss and evaluate distributed computing practice. Enjoy!
What is a Distributed System?
Kyle began the discussion with the most fundamental question: what does it mean when we call a system “distributed”? His definition — any system made up of parts that interact slowly or unreliably — is familiar, but by noting that the definition of “slow” is highly dependent on context, he opens up the “distributed” label to a wide range of systems that don’t immediately appear to be distributed. This could include:
- NUMA architectures
- making plans via SMS
- Point-of-sale terminals
- mobile applications
- communicating threads or processes on a single computer
- and of course, computers communicating via TCP/UDP over a LAN or the internet.
If the inter-part communication is, say, an order of magnitude slower than the intra-part communication, then you have a distributed system.
Since all networks in the real world take time to propagate information (and that information can be arbitrarily delayed or lost), understanding timing (and uncertainty of timing) is crucial to being able to reason about the behavior of the system. An important tool Kyle introduced for that purpose is the light cone diagram (by analogy to relativity and spacetime).
Because the operations take time to execute (and return a result), the ordering of those operations is ambiguous in this example. Thread 2 can know the time it initiated the read and the time it received the result, but it can’t know the time the read operation was applied on the memory — it could return either the old or new value.
We want to be able to define an ordering for events. However, keeping track of time in a distributed system is difficult. There are a number of types of clocks used for ordering events in a distributed system.
Wall clocks. Nodes track “actual” (human) time, likely by talking with an NTP (Network Time Protocol) server. However, in practice the synchronization doesn’t work well enough to track a precise enough time. The hardware frequently drifts, sometimes by centuries, not to mention that POSIX time is not monotonic.
Lamport clocks. Each node has a counter that it increments with each state transition. The counter is included with every message and, on receiving a message, a node sets its counter to the one included on the message, if greater than its own. This gives a partial ordering of events. If event b is causally dependent on event a, then C(a) < C(b).
Vector clocks. Vector clocks generalize Lamport clocks. The clock is a vector of counters, one for each node. When executing an operation, a node increments its counter in the vector, and again includes the clock with every message to other nodes. When receiving a message, a node takes the pairwise maximum of its clock and the one in the message. Orders more events than Lamport clocks.
Dotted version vectors. “Better vector clocks.” Still a partial ordering, but orders more events.
GPS & Atomic clocks. Much better than NTP. Gives a globally distributed total orders on the scale of milliseconds, which means you can perform one operation (that’s allowed to conflict) per uncertainty window. But it’s really expensive.
As software developers, we expect the systems we program against to behave correctly. But there are many possible definitions of correct behavior, which we call consistency models.
Each consistency model defines what sequence of operations on a state machine are valid. In the diagram below, a parent-child relationship between two models means any valid sequence of operations in the parent is valid in the child. That is, requirements are stricter as you move up the tree, which makes it easier to reason about the system in question, but incurs performance and availability penalties.
Kyle himself has an excellent write-up of these consistency models on his blog, but I’ll give a brief description of a few here.
Linearizability – There appears to be a single global state where operations are applied linearly.
Sequential consistency – Writes appear in the same order to all readers. (But a read may not reflect the “current” state.)
Causal consistency – Only the order of *causally related* operations is linear. Unrelated (concurrent) operations may appear in any order to readers.
Linearizability/Strong serializability is the “golden standard” of consistency models. But there are performance and availability tradeoffs as you move up the tree. It is impossible to have total availability of a linearizable (or even sequentially consistent) system.
Convergent Replicated Data Types
CRDTs are a class of eventually consistent data types that work by supporting an idempotent, commutative and associative merge operation. The operation merges the histories of two (diverging) versions of the data structure. Some thorough and accessible examples of CRDTs can be found in Kyle’s project here.
To achieve stricter consistency models in a distributed system, it is necessary to achieve consensus on certain facts among a majority of nodes. The Paxos algorithm is the notoriously difficult to understand gold standard of consensus algorithms, introduced by Leslie Lamport in his 1970 paper, The Part-Time Parliament.
Since it’s publication, it has inspired a family of similar algorithms with optimizations in performance or accessibility, including Multi-Paxos, Fast Paxos and Generalized Paxos.
In Search of an Understandable Consensus Algorithm (Ongaro & Ousterhout, 2014) is a recent publication describing the Raft consensus algorithm that, like it says on the tin, is intended to be (more) easily understood than Paxos, while retaining similar guarantees and performance characteristics.
A Pattern Language
Kyle’s number one piece of advice for building a distributed system is don’t do it if you don’t have to. Maybe your problem can be solved by getting a bigger box, or tolerating some downtime.
Number two is to use someone else’s distributed system. Pay Amazon to do it, or at least use battle-tested software.
Number three is don’t fail. You can build really reliable hardware and networks at the cost of moving slowly, buying expensive hardware and hiring really good talent.
But, if your system runs long enough, it will eventually fail, so accept it gracefully.
Take frequent backups. Done correctly, they give you sequential consistency. When taking a backup of a database, understand the transactional semantics. (For example, copying filesystem state of a MySQL database may not reflect a consistent transactional state at every point in time, or filesystem may change during backup process. Better to use mysqldump which preserves database semantics.)
Use immutable values when possible. Data that never changes is trivial to store because it does not require coordination.
If you require mutable values, instead use mutable identities (pointers to immutable values). Can fit a huge number of mutable pointers in a linearizable RDBMS while immutable values are stored in S3 or other eventually consistent stores. This is similar to how Datomic works.
Some operations can return fractional results or tolerate fractional availability. (E.g., search results, video or voice call quality, analytics, map tiles.) This lets you gracefully degrade the app experience while trying to recover, rather than suddenly crashing. Feature flags that turn off application features in response to system health metrics can be used to the same end.
Service owners should write client libraries for their services, and the libraries should include mock I/O to allow consumers to test locally without having to set up (and keep in sync) a local instance of the dependency. Include the library version in the request headers so you can track uptake of new versions and go after the slowpokes.
Particularly important when moving from monolithic to distributed architectures is to test everything, all the time. This includes correctness invariants of course, but also performance invariants. Simulate network failures, performance test against large data sets. Test against production data sets since they are always different from test data.
Always bound the size of queues in your application, or you can get in a situation where work takes arbitrarily long to complete. Monitor enqueue and dequeue rates on the queue (throughput) as well as the time it takes a message to be worked after being enqueued (latency).
Thanks to Sebastian Galkin for organizing the event, Dylan Pittman for his notes and diagrams, and of course, Kyle Kingsbury for his time, energy and expertise.