Clients are Jerks: aka How Halo 4 DoSed the Services at Launch & How We Survived

At 3am PST November 5th 2012 I sat fidgeting at my desk at 343 Industries watching graphs of metrics stream across my machine, Halo 4 was officially live in New Zealand, and the number of concurrent users began to gradually increase as midnight gamers came online and began to play.  Two hours later at 5am Australia came online and we saw another noticeable spike in concurrent users.

With AAA video games, especially multiplayer games, week one is when you see the most concurrent users.  Like Blockbuster movies, large marketing campaigns, trade shows, worldwide release dates, and press all converge to create excitement around launch.  Everyone wants to see the movie or play the game with their friends the first week it is out.  The energy around a game launch is intoxicating.  However, running the services powering that game is terrifying.  There is nothing like production data, and we were about to get a lot of it over the next few days.  To be precise Halo 4 saw 4 million unique users in the first week who racked up 31.4 million hours of gameplay.

At midnight on November 6th PST I stood in a parking lot outside of a Microsoft Store in Seattle surrounded by 343i team members and fans who came out to celebrate the launch with us and get the game at midnight PST.  I checked in with the on call team, Europe and the East Coast of the US had also come online smoothly.  In addition the real time Cheating & Banning system I wrote in a month and half before launch had already caught and banned 3 players who had modded their Xbox in the first few hours, I was beyond thrilled.  Everything was going according to plan so after a few celebratory beers, I headed back into the office to take over the graveyard shift and continue monitoring the services.  The next 48 hours were critical and likely when we would be seeing our peak traffic.

As the East Coast of the United States started playing Halo after work on launch day we hit higher and higher numbers of concurrent users.  Suddenly one of our APIs related to Cheating & Banning was hitting an abnormally high failure rate, and starting to affect other parts of the Statistics Service.  As the owner of the Halo 4 Statistics Service and the Cheating & Banning Service I Ok’d throwing the kill switch on the API and then began digging in.

The game was essentially DoSing us.  We were receiving 10x the number of expected requests to our service on this particular API, due to a bug in the client which reported suspicious activity for almost all online players.  The increased number of requests caused us to blow through our IOPS limit in Azure Storage, which correctly throttled and rejected our exorbitant number of operations.  This caused the request from the game to fail, and then the game would retry the request three times, creating a retry storm, only exacerbating the attack.

Game Over Right?  Wrong.  Halo 4 had no major outages during launch week, the time notorious for games to have outages.  The Halo 4 Services survived because they were architected for maximum availability and graceful degradation.  The core APIs and component of the Halo Services necessary to play the game were explicitly called out and extra measures were taken to protect them.  We had a game plan to survive launch, which involved sacrificing everything that was not those core components if necessary.  Our team took full ownership of our core service’s availability, we did not just anticipate failure, we expected it.  We backed up our backups for statistics data requiring multiple separate storage services to fail before data loss would occur, built in kill switches for non essential features, and had a healthy distrust of our clients.

The kill switch I mentioned earlier saved the services from the onslaught of requests made by the game.  We had built in a dynamically configurable switch into our routing layer, which could be tuned per API.  By throwing the kill switch, we essentially re-routed traffic to a dummy handler which returned a 200 and dropped the data on the floor or logged it to a storage account for later analysis.  This stopped the retry storm, stabilized the service, and alleviated the pressure on the storage accounts used for Cheating & Banning.  In addition, the Cheating & Banning service continued to function correctly because we had more reliable data coming in via game events on a different API.

The game clients were being jerks (a bug in the code caused an increase in requests) so I had no qualms about lying to them (sending back an HTTP 200 and then promptly dropping the data on the floor) especially since this API was not one of the critical components for playing Halo 4.  In fact had we not built in the ability to lie to the clients we most certainly would have had an outage at launch.

But the truth is the game devs I worked closely with over countless tireless hours leading up to launch weren’t jerks, and they weren’t incompetent.  In fact they were some of the best in the industry.  We all wanted a successful launch so how did our own in house client end up DoSing the services? The answer is Priorities.  The client developers for Halo 4 have a much different set of priorities: gameplay, graphics, and peer to peer networking were at the forefront of their mind and resource allocations, not how many requests per second they were sending to the services.

Client priorities are often very different than the services they consume, even in house clients.  This is true for games, websites, mobile apps, etc…  In fact it is not only limited to pure clients it is even true for microservices communicating with one another.  These priority differences manifest in a multitude of ways: sending too much data on a request, sending too many requests, asking for too much data or an expensive query to be ran, etc… The list goes on and on, because the developers consuming your service are often focused on a totally different problem and not your failure modes and edge cases.  In fact one of the major benefits of SOA and microservices is to abstract away the details of a service’s execution to reduce the complexity one developer has to think about at any given time.

Bad client behavior happens all over the place not just in games. Astrid Atkinson just said in her Velocity Conf talk “Google’s biggest DoS attacks always comes from ourselves.”  In addition, I’m currently working on fixing a service at Twitter which is completely trusting of internal clients allowing them to make exorbitant requests.  These requests result in the service failing, a developer getting paged with no means of remediating the problem, and the inspiration for finally writing this post.  Misbehaving clients are common in all stacks, and are not the bug.  The bug is the implicit assumption that because the clients are internal they will use the API in the way it was designed to be used.

Implicit assumptions are the killer of any Distributed System.

Truly robust, reliable services must plan for bad client behavior and explicitly enforce assumptions.  Implicitly assuming that your clients will “do the right thing” makes your services vulnerable.  Instead explicitly set limits and enforce them either manually via alerting, monitoring and operational runbooks, or automatically via backpressure, and flow control.  Halo 4 launch was successful because we did not implicitly trust our clients instead we assumed they were jerks.

Much thanks to Ines Sombra for reviewing early drafts

You should follow me on Twitter here

Orleans Preview & Halo 4

orleans + halo feature image

On Wednesday at Build 2014 Microsoft announced the preview release of Orleans.  Orleans is a runtime and programming model for building distributed systems, based on the actor model.  It was created by the eXtreme computing group inside Microsoft Research, and was first deployed into production by 343 Industries (my team!) as a core component of the Halo Services built in Azure.

I am beyond excited, that Orleans is now available for the rest of the development community to play with.  While I am no longer at 343 Industries and Microsoft, I still think it is one of the coolest pieces of tech I have used to date.  In addition getting Orleans and Halo into production was truly a labor of love, and a collaborative effort between the Orleans team and the Halo Services team.

In the Summer of 2011, the services team at 343 Industries began partnering with the Orleans team in Microsoft Research to design and implement our new services.  We worked side by side with the eXtreme computing group, spending afternoons pair programming, providing feedback on the programming model, and what other features we needed to ship Halo 4 Services.  Working with the eXtreme computing group was an amazing experience, they are brilliant developers and were great to work with, always open to feedback and super helpful with bug fixes and new feature requests.

Orleans was the perfect solution for the user centric nature of the Halo Services.  Because we required high throughput and low latency requests we needed state-full services.  The Location Transparency of actors provided by Orleans and the Asynchronous “single-threaded” programming model, made developing scalable, reliable, and fault tolerant services easy.  Developers working on features only had to concentrate on the feature code, not message passing, fault tolerance, concurrency issues, or distributed resource management.

By the Fall of 2011, a few months after our partnership began, Orleans was first deployed into production to replace the already existing Halo Reach presence system, in order to power the realtime Halo Waypoint Atlas experience.  The new presence service built on Orleans in Azure had parity with the old services (presence updates every 30 seconds), and the ability to push updates every second to provide realtime views of players in a match on a connected ATLAS second screen.

After proving out the architecture, Orleans, and Azure the Halo the team moved into full production mode re-writing and improving upon the existing Halo Services including Statistics Processing, Challenges, Cheating & Banning, and Title Files.

On November 6 2012, Halo 4 was released to the world, and the new Halo Services went from a couple hundred users to hundreds of thousands of users in the span of a few hours.  So if you want to see Orleans in action go play Halo 4, or checkout Halo Waypoint, both of those experiences are powered by Orleans and Azure.

Now here is the fun part, Orleans has been opened up for preview by the .NET Team.  You can go and download the SDK.  In addition a variety of samples and documentation are available on Codeplex (I know its not Github sad times, but the samples are great).  I spent Wednesday night playing around with the samples and getting them up and running.

I highly recommend checking Orleans out, and providing feedback to the .NET team.  Stay tuned to the .NET blog for more info, and feel free to ask me any questions you may have.  In addition I have a few blog posts in the work to help share some of the knowledge I gained while building the Halo 4 Services using Orleans!

References

More Halo, Azure, Orleans Goodness

You should follow me on Twitter here

Flexible Security Policies

Last weekend I made a rather bold statement on Twitter.

This sparked off a conversation in 140 character installments during which I found it difficult to fully convey my point. This is my attempt at clarity.

In the services world taking dependencies on third party services is increasingly necessary, especially as more services move into the cloud. However, irregardless of third party failures, I firmly believe that YOU own your services availability.

I’d like to examine a single point of failure that communication with most third party services have: SSL Certificates. Most services implement an all or nothing policy regarding SSL Certificates. Either it meets the criteria (correct host name, acceptable signature algorithm, valid date, etc…) or it does not. However, this black and white policy forces your service to have a single point of failure outside of your control.

Implementing a flexible policy, that can be updated on the fly without deploying new code, reduces the damage this single point of failure can cause. Instead of having calls fail until the certificate issue has been resolved by the third party a flexible policy gives the DevOps team the ability to analyze the security risk of accepting the bad certificate versus the business cost of having down time.

Some examples where this may be appropriate:

  • An SSL Certificate has expired: If the certificate recently expired, and there is no evidence that the private key has been compromised the risk of accepting SSL traffic from this endpoint is low. The communication is still encrypted, and likely secure. Accepting the expired certificate for a few hours or days might be worthwhile to avoid service degradation.
  • An SSL Certificate has a weak signature algorithm: If the certificate has been recently renewed with a weaker signature algorithm than expected, the communication is still encrypted. Accepting the weaker certificate for a few hours to avoid down time may be acceptable, while a new certificate is rolled out.

If the security policy is flexible, I may decided to accept a recently expired SSL certificate for a configurable duration of time, allowing my services to stay up. Inversely I may analyze the security risk and decided that I cannot tolerate it for any period of time. In this case I will decide to have a degraded services experience and continue to reject the certificate.

In either scenario the power is in my hands. I am owning the availability of my service. I am making a conscious decision on whether to be available or not. With the standard security policy there is no option. My service’s availability is not in my control.

An ideal solution would allow for the security policy to be adjusted for a single certificate. Critical or error level logs would still be created as long as the certificate did not meet the default security standards. In addition the decision to accept the certificate should be revisited periodically by the DevOps and Business teams, until a valid SSL certificate is provided by the third party service.  There should not be a blanket policy, instead each certificate failure should be evaluated on a case by case basis.

I am by no means advocating that services should blindly ignore SSL certificate failures, or that the third party services should not be held responsible when failures occur. I am instead advocating for the ability to make the decision for myself and update our security policy on the fly if needed. The goal of a flexible policy is not to blindly tolerate security risks, but provide the ability to make trade offs in real time: Service Availability vs Security Risk.

You should follow me on Twitter here