Clients are Jerks: aka How Halo 4 DoSed the Services at Launch & How We Survived

At 3am PST November 5th 2012 I sat fidgeting at my desk at 343 Industries watching graphs of metrics stream across my machine, Halo 4 was officially live in New Zealand, and the number of concurrent users began to gradually increase as midnight gamers came online and began to play.  Two hours later at 5am Australia came online and we saw another noticeable spike in concurrent users.

With AAA video games, especially multiplayer games, week one is when you see the most concurrent users.  Like Blockbuster movies, large marketing campaigns, trade shows, worldwide release dates, and press all converge to create excitement around launch.  Everyone wants to see the movie or play the game with their friends the first week it is out.  The energy around a game launch is intoxicating.  However, running the services powering that game is terrifying.  There is nothing like production data, and we were about to get a lot of it over the next few days.  To be precise Halo 4 saw 4 million unique users in the first week who racked up 31.4 million hours of gameplay.

At midnight on November 6th PST I stood in a parking lot outside of a Microsoft Store in Seattle surrounded by 343i team members and fans who came out to celebrate the launch with us and get the game at midnight PST.  I checked in with the on call team, Europe and the East Coast of the US had also come online smoothly.  In addition the real time Cheating & Banning system I wrote in a month and half before launch had already caught and banned 3 players who had modded their Xbox in the first few hours, I was beyond thrilled.  Everything was going according to plan so after a few celebratory beers, I headed back into the office to take over the graveyard shift and continue monitoring the services.  The next 48 hours were critical and likely when we would be seeing our peak traffic.

As the East Coast of the United States started playing Halo after work on launch day we hit higher and higher numbers of concurrent users.  Suddenly one of our APIs related to Cheating & Banning was hitting an abnormally high failure rate, and starting to affect other parts of the Statistics Service.  As the owner of the Halo 4 Statistics Service and the Cheating & Banning Service I Ok’d throwing the kill switch on the API and then began digging in.

The game was essentially DoSing us.  We were receiving 10x the number of expected requests to our service on this particular API, due to a bug in the client which reported suspicious activity for almost all online players.  The increased number of requests caused us to blow through our IOPS limit in Azure Storage, which correctly throttled and rejected our exorbitant number of operations.  This caused the request from the game to fail, and then the game would retry the request three times, creating a retry storm, only exacerbating the attack.

Game Over Right?  Wrong.  Halo 4 had no major outages during launch week, the time notorious for games to have outages.  The Halo 4 Services survived because they were architected for maximum availability and graceful degradation.  The core APIs and component of the Halo Services necessary to play the game were explicitly called out and extra measures were taken to protect them.  We had a game plan to survive launch, which involved sacrificing everything that was not those core components if necessary.  Our team took full ownership of our core service’s availability, we did not just anticipate failure, we expected it.  We backed up our backups for statistics data requiring multiple separate storage services to fail before data loss would occur, built in kill switches for non essential features, and had a healthy distrust of our clients.

The kill switch I mentioned earlier saved the services from the onslaught of requests made by the game.  We had built in a dynamically configurable switch into our routing layer, which could be tuned per API.  By throwing the kill switch, we essentially re-routed traffic to a dummy handler which returned a 200 and dropped the data on the floor or logged it to a storage account for later analysis.  This stopped the retry storm, stabilized the service, and alleviated the pressure on the storage accounts used for Cheating & Banning.  In addition, the Cheating & Banning service continued to function correctly because we had more reliable data coming in via game events on a different API.

The game clients were being jerks (a bug in the code caused an increase in requests) so I had no qualms about lying to them (sending back an HTTP 200 and then promptly dropping the data on the floor) especially since this API was not one of the critical components for playing Halo 4.  In fact had we not built in the ability to lie to the clients we most certainly would have had an outage at launch.

But the truth is the game devs I worked closely with over countless tireless hours leading up to launch weren’t jerks, and they weren’t incompetent.  In fact they were some of the best in the industry.  We all wanted a successful launch so how did our own in house client end up DoSing the services? The answer is Priorities.  The client developers for Halo 4 have a much different set of priorities: gameplay, graphics, and peer to peer networking were at the forefront of their mind and resource allocations, not how many requests per second they were sending to the services.

Client priorities are often very different than the services they consume, even in house clients.  This is true for games, websites, mobile apps, etc…  In fact it is not only limited to pure clients it is even true for microservices communicating with one another.  These priority differences manifest in a multitude of ways: sending too much data on a request, sending too many requests, asking for too much data or an expensive query to be ran, etc… The list goes on and on, because the developers consuming your service are often focused on a totally different problem and not your failure modes and edge cases.  In fact one of the major benefits of SOA and microservices is to abstract away the details of a service’s execution to reduce the complexity one developer has to think about at any given time.

Bad client behavior happens all over the place not just in games. Astrid Atkinson just said in her Velocity Conf talk “Google’s biggest DoS attacks always comes from ourselves.”  In addition, I’m currently working on fixing a service at Twitter which is completely trusting of internal clients allowing them to make exorbitant requests.  These requests result in the service failing, a developer getting paged with no means of remediating the problem, and the inspiration for finally writing this post.  Misbehaving clients are common in all stacks, and are not the bug.  The bug is the implicit assumption that because the clients are internal they will use the API in the way it was designed to be used.

Implicit assumptions are the killer of any Distributed System.

Truly robust, reliable services must plan for bad client behavior and explicitly enforce assumptions.  Implicitly assuming that your clients will “do the right thing” makes your services vulnerable.  Instead explicitly set limits and enforce them either manually via alerting, monitoring and operational runbooks, or automatically via backpressure, and flow control.  Halo 4 launch was successful because we did not implicitly trust our clients instead we assumed they were jerks.

Much thanks to Ines Sombra for reviewing early drafts

You should follow me on Twitter here