I’m often asked how to get started with Distributed Systems, so this post documents my path and some of the resources I found most helpful. It is by no means meant to be an exhaustive list.
It is worth noting that I am not classically trained in Distributed Systems. I am mostly self taught via independent study and on the job experience. I do have a B.S. in Computer Science from Cornell, but focused mostly on graphics and security in my specialization classes. My love of Distributed Systems and education in it came once I entered industry. The moral of this story is that understanding distributed systems doesn’t require academic intervention to learn and excel at.
Books on Theory & Background
- Introduction to Reliable and Secure Distributed Programming: This book is an excellent introduction to the fundamentals of distributed computing. It definitely takes an academic approach. But is a good place to start to understand the terminology and challenges in the field.
- Replication: Theory and Practice: This book is a summary of 30 years of distributed systems research on replication up to 2007. Its a great starter and contains all the references to the original work. Each chapter is incredibly dense, and led me down multiple paper rabbit holes.
This is by no means an exhaustive list, but these papers I keep coming back to, and they have significantly shaped the way I think about Distributed Systems.
- Time, Clocks, and the Ordering of Events in Distributed Systems
- Impossibility of Distributed Consensus with One Faulty Process
- Unreliable Failure Detectors for Reliable Distributed Systems
- CAP Twelve Years Later: How the Rules Have Changed
- Harvest, Yield and Scalable Tolerant Systems
- Dynamo, Amazon’s Highly Available Key Value Store
- The Chubby Lock Service for Loosely-Coupled Distributed System
- Fallacies of Distributed Computing
A Note on Reading Papers
A note on reading papers: I start with the Abstract, if I find in interesting I’ll proceed onto the Introduction, then the Conclusion. Only then if I am incredibly interested in the implementation or details will I read the whole thing. Also the References are a gold mine, they cite related and foundational work. Often times reading papers is a recursive process. I’ll start on one then find a concept I’m unfamiliar with or don’t understand, so I’ll read the referenced paper and so on. This often times results in going down the paper rabbit holes, and one time resulted in me reading a dissertation from the 1980s but it is a great way to learn.
I also highly recommend Michael Bernstein’s blog post “Should I Read Papers?” for more on the motivations and how to read an academic paper.
Blog Posts & Talks
Below is a list of some of my favorite blog posts and talks that shaped how I think about building Distributed Systems. Most of these are old, but I keep coming back to them, and still find them relevant today.
- Notes on Distributed Systems for Young Bloods by Jeff Hodges
- Jepsen Blog Posts by Kyle Kingsbury
- Everything Will Flow: Distributed Queues & Backpressure by Zach Tellman
- Bad As I Wanna Be: Coordination and Consistency in Distributed Systems by Peter Bailis
Learning from Industry
The art of building, operating, and running distributed systems in industry is orthogonal to the theory of Distributed Systems. I truly believe that the best way to learn about Distributed Systems is to get hands on experience working on one.
In addition Post Mortems are another great source of information. Large tech companies, like Amazon, Netflix, Google, and Microsoft, often publish a post mortem after a major outage. These are usually pretty dry to read, but contain some hard learned lessons.
You should follow me on Twitter here