Resources for Getting Started with Distributed Systems

I’m often asked how to get started with Distributed Systems, so this post documents my path and some of the resources I found most helpful. It is by no means meant to be an exhaustive list.

It is worth noting that I am not classically trained in Distributed Systems. I am mostly self taught via independent study and on the job experience. I do have a B.S. in Computer Science from Cornell, but focused mostly on graphics and security in my specialization classes. My love of Distributed Systems and education in it came once I entered industry. The moral of this story is that understanding distributed systems doesn’t require academic intervention to learn and excel at.

Books on Theory & Background

Introduction to Reliable and Secure Distributed Programming: This book is an excellent introduction to the fundamentals of distributed computing. It definitely takes an academic approach. But is a good place to start to understand the terminology and challenges in the field.
Replication: Theory and Practice: This book is a summary of 30 years of distributed systems research on replication up to 2007. Its a great starter and contains all the references to the original work. Each chapter is incredibly dense, and led me down multiple paper rabbit holes.

Papers

This is by no means an exhaustive list, but these papers I keep coming back to, and they have significantly shaped the way I think about Distributed Systems.

Time, Clocks, and the Ordering of Events in Distributed Systems
Impossibility of Distributed Consensus with One Faulty Process
Unreliable Failure Detectors for Reliable Distributed Systems
CAP Twelve Years Later: How the Rules Have Changed
Harvest, Yield and Scalable Tolerant Systems
Dynamo, Amazon’s Highly Available Key Value Store
The Chubby Lock Service for Loosely-Coupled Distributed System
Fallacies of Distributed Computing

A Note on Reading Papers

A note on reading papers: I start with the Abstract, if I find in interesting I’ll proceed onto the Introduction, then the Conclusion. Only then if I am incredibly interested in the implementation or details will I read the whole thing. Also the References are a gold mine, they cite related and foundational work. Often times reading papers is a recursive process. I’ll start on one then find a concept I’m unfamiliar with or don’t understand, so I’ll read the referenced paper and so on. This often times results in going down the paper rabbit holes, and one time resulted in me reading a dissertation from the 1980s but it is a great way to learn.

I also highly recommend Michael Bernstein’s blog post “Should I Read Papers?” for more on the motivations and how to read an academic paper.

Blog Posts & Talks

Below is a list of some of my favorite blog posts and talks that shaped how I think about building Distributed Systems. Most of these are old, but I keep coming back to them, and still find them relevant today.

Notes on Distributed Systems for Young Bloods by Jeff Hodges
Jepsen Blog Posts by Kyle Kingsbury
Everything Will Flow: Distributed Queues & Backpressure by Zach Tellman
Bad As I Wanna Be: Coordination and Consistency in Distributed Systems by Peter Bailis

Learning from Industry

The art of building, operating, and running distributed systems in industry is orthogonal to the theory of Distributed Systems. I truly believe that the best way to learn about Distributed Systems is to get hands on experience working on one.

In addition Post Mortems are another great source of information. Large tech companies, like Amazon, Netflix, Google, and Microsoft, often publish a post mortem after a major outage. These are usually pretty dry to read, but contain some hard learned lessons.

Tech Insights

Books on Theory & Background

Papers

A Note on Reading Papers

Blog Posts & Talks

Learning from Industry

Related Posts

Leave a Reply Cancel reply