Resources for Getting Started with Distributed Systems Posted on 13.02.202413.02.2024 By caiti335 I’m often asked how to get started with Distributed Systems, so this post documents my path and some of the resources I found most helpful. It is by no means meant to be an exhaustive list. It is worth noting that I am not classically trained in Distributed Systems. I am mostly self taught via independent study and on the job experience. I do have a B.S. in Computer Science from Cornell, but focused mostly on graphics and security in my specialization classes. My love of Distributed Systems and education in it came once I entered industry. The moral of this story is that understanding distributed systems doesn’t require academic intervention to learn and excel at. Books on Theory & Background Introduction to Reliable and Secure Distributed Programming: This book is an excellent introduction to the fundamentals of distributed computing. It definitely takes an academic approach. But is a good place to start to understand the terminology and challenges in the field. Replication: Theory and Practice: This book is a summary of 30 years of distributed systems research on replication up to 2007. Its a great starter and contains all the references to the original work. Each chapter is incredibly dense, and led me down multiple paper rabbit holes. Papers This is by no means an exhaustive list, but these papers I keep coming back to, and they have significantly shaped the way I think about Distributed Systems. Time, Clocks, and the Ordering of Events in Distributed Systems Impossibility of Distributed Consensus with One Faulty Process Unreliable Failure Detectors for Reliable Distributed Systems CAP Twelve Years Later: How the Rules Have Changed Harvest, Yield and Scalable Tolerant Systems Dynamo, Amazon’s Highly Available Key Value Store The Chubby Lock Service for Loosely-Coupled Distributed System Fallacies of Distributed Computing A Note on Reading Papers A note on reading papers: I start with the Abstract, if I find in interesting I’ll proceed onto the Introduction, then the Conclusion. Only then if I am incredibly interested in the implementation or details will I read the whole thing. Also the References are a gold mine, they cite related and foundational work. Often times reading papers is a recursive process. I’ll start on one then find a concept I’m unfamiliar with or don’t understand, so I’ll read the referenced paper and so on. This often times results in going down the paper rabbit holes, and one time resulted in me reading a dissertation from the 1980s but it is a great way to learn. I also highly recommend Michael Bernstein’s blog post “Should I Read Papers?” for more on the motivations and how to read an academic paper. Blog Posts & Talks Below is a list of some of my favorite blog posts and talks that shaped how I think about building Distributed Systems. Most of these are old, but I keep coming back to them, and still find them relevant today. Notes on Distributed Systems for Young Bloods by Jeff Hodges Jepsen Blog Posts by Kyle Kingsbury Everything Will Flow: Distributed Queues & Backpressure by Zach Tellman Bad As I Wanna Be: Coordination and Consistency in Distributed Systems by Peter Bailis Learning from Industry The art of building, operating, and running distributed systems in industry is orthogonal to the theory of Distributed Systems. I truly believe that the best way to learn about Distributed Systems is to get hands on experience working on one. In addition Post Mortems are another great source of information. Large tech companies, like Amazon, Netflix, Google, and Microsoft, often publish a post mortem after a major outage. These are usually pretty dry to read, but contain some hard learned lessons. Tech Insights
Design Docs, Markdown, and Git Posted on 13.02.202413.02.2024 About a year ago my software engineering team, the Azure Sphere Security Services (AS3) team, found ourselves struggling with our design document process. So we ran an experiment, moving all our design documents to be written in Markdown, checked into Git, and reviewed via a pull request (PR). The experiment… Read More
2017 a Year in Review Posted on 13.02.202413.02.2024 2017 was a year of change, personal and professional. I started the year in San Francisco, working at Twitter as an Individual Contributor, and in a long term relationship. I ended the year in Seattle, working at Microsoft Research as a Lead, sans that long term relationship, and a brand… Read More
Clients are Jerks: aka How Halo 4 DoSed the Services at Launch & How We Survived Posted on 13.02.202413.02.2024 At 3am PST November 5th 2012 I sat fidgeting at my desk at 343 Industries watching graphs of metrics stream across my machine, Halo 4 was officially live in New Zealand, and the number of concurrent users began to gradually increase as midnight gamers came online and began to play…. Read More