2015: A Year in Review

2015 has been a whirlwind of a year, which started off in a new city, with a new job as the Tech Lead of  Observability at Twitter.  The year was full of travel spanning 10 states, 3 different countries, and 2 continents.  This year I also had numerous opportunities to share my experiences with programming and distributed systems via, talks, blog posts, podcasts, and articles.  Below is the recap.

Articles

Interviews & Podcasts

Programming Committees

Talks

Orleans: A Framework for Cloud Computing

Presented at Papers We Love SF: Video & Slides [February 19th 2015]

Abstract

Caitie McCaffrey stops by and talks about the Orleans: Distributed Virtual Actors for Programmability and Scalability paper by Bernstein, Bykov, Geller, Kliot, and Thelin.

Orleans is a runtime and programming model for building scalable distributed systems, based on the actor model.  The Orleans programming model introduces the abstraction of Virtual Actors.  Orleans allows applications to obtain high performance, reliability, and scalability.  This technology was developed by the eXtreme Computing Group at Microsoft Research and was a core component of the Azure Services that supported that powered Halo 4, the award winning video game.

gotochicago
Distributed Systems Track at Goto Chicago: Neha Narula, Caitie McCaffrey, Chris Meiklejohn, Kyle Kingsbury

Building the Halo 4 Services with Orleans

Presented at Qcon London: Video & Slides [March 5th 2015]

Abstract

Halo 4 is a first-person shooter on the Xbox 360, with fast-paced, competitive gameplay. To complement the code on disc, a set of services were developed to store player statistics, display player presence information, deliver daily challenges, modify playlists, catch cheaters and more. As of June 2013 Halo 4 had 11.6 million players, who played 1.5 billion games, logging 270 million hours of gameplay.

Orleans, Distributed Virtual Actors for Programmability & Scalability, is an actor framework & runtime for building high scale distributed systems. It came from the eXtreme computing group in Microsoft Research, and is now Open Source on Github.

For Halo 4, 343 Industries built and deployed a new set of services built from the ground up to support high demand, low latency, and high availability using using Orleans and running in Window Azure. This talk will do an overview of Orleans, the challenges faced when building the Halo 4 services, and why the Actor Model and Orleans in particular were utilized to solve these problems.

Architecting & Launching the Halo 4 Services

Presented as the Closing Keynote of SRECon15: VideoSlides [March 17th 2015]

Abstract

The Halo 4 services were built from the ground up to support high demand, low latency, and high availability.  In addition, video games have unique load patterns where the majority of the traffic and sales occurs within the first few weeks after launch, making this a critical time period for the game and supporting services. Halo 4 went from 0 to 1 million users on day 1, and 4 million users within the first week.

This talk will discuss the architectural challenges faced when building these services and how they were solved using Windows Azure and Project Orleans. In addition, we’ll discuss the path to production, some of the difficulties faced, and the tooling and practices that made the launch successful.

12029606_10102616259451885_1497979433309626479_o
On stage during Strange Loop 2015 at the Peabody Opera House

The Saga Pattern

Presented at Craft Conf 2015 & Goto: Chicago 2015 Video & Slides [April 23rd 2015 & May 12th 2015]

Abstract

As we build larger more complex applications and solutions that need to do collaborative processing the traditional ACID transaction model using coordinated 2-phase commit is often no longer suitable. More frequently we have long lived transactions or must act upon resources distributed across various locations and trust boundaries. The Saga Pattern is a useful model for long lived activities and distributed transactions without coordination.

Sagas split work into a set of transactions whose effects can be reversed even after the work has been performed or committed. If a failure occurs compensating transactions are performed to rollback the work. So at its core the Saga is a failure Management Pattern, making it particularly applicable to distributed systems.

In this talk, I’ll discuss the fundamentals of the Saga Pattern, and how it can be applied to your systems. In addition we’ll discuss how the Halo 4 Services successfully made use of the Saga Pattern when processing game statistics, and how we implemented it in production.

Scaling Stateful Services

Presented at StrangeLoop 2015 Video & Slides [September 25th 2015]

This talk was incredibly well received, and I was flattered to see write-ups of it featured in High Scalability and InfoQ

Abstract

The Stateless Service design principle has become ubiquitous in the tech industry for creating horizontally scalable services. However our applications do have state, we just have moved all of it to caches and databases. Today as applications are becoming more data intensive and request latencies are expected to be incredibly low, we’d like the benefits of stateful services, like data locality and sticky consistency. In this talk I will address the benefits of stateful services, how to build them so that they scale, and discuss projects from Halo and Twitter of highly distributed and scalable services that implement these techniques successfully.

12237996_10102700397259045_5884837663522330773_o
Ines Sombra & Caitie McCaffrey’s Evening Keynote  at QconSF

On the Order of Billions

Presented at Twitter Flight: Video & Slides [October 21st 2015]

Abstract

Every minute Twitter’s Observability stack processes 2+ billion metrics in order to provide Visibility into Twitter’s distributed microservices architecture. This talk will focus on some of the challenges associated with building and running this large scale distributed system. We will also focus on lessons learned and how to build services that scale that are applicable for services of any size.

So We Hear You Like Papers

Presented as the Evening Keynote at QconSF with Ines Sombra: Video, Slides, Resources, & Moment [November 16th 2015]

Abstract

Surprisingly enough academic papers can be interesting and very relevant to the work we do as computer science practitioners. Papers come in many kinds/ areas of focus and sometimes finding the right one can be difficult. But when you do, it can radically change your perspective and introduce you to new ideas.

Distributed Systems has been an active area of research since the 1960s, and many of the problems we face today in our industry have already had solutions proposed, and have inspired new research. Join us for a guided tour of papers from past and present research that have reshaped the way we think about building large scale distributed systems.

Clients are Jerks: aka How Halo 4 DoSed the Services at Launch & How We Survived

At 3am PST November 5th 2012 I sat fidgeting at my desk at 343 Industries watching graphs of metrics stream across my machine, Halo 4 was officially live in New Zealand, and the number of concurrent users began to gradually increase as midnight gamers came online and began to play.  Two hours later at 5am Australia came online and we saw another noticeable spike in concurrent users.

With AAA video games, especially multiplayer games, week one is when you see the most concurrent users.  Like Blockbuster movies, large marketing campaigns, trade shows, worldwide release dates, and press all converge to create excitement around launch.  Everyone wants to see the movie or play the game with their friends the first week it is out.  The energy around a game launch is intoxicating.  However, running the services powering that game is terrifying.  There is nothing like production data, and we were about to get a lot of it over the next few days.  To be precise Halo 4 saw 4 million unique users in the first week who racked up 31.4 million hours of gameplay.

At midnight on November 6th PST I stood in a parking lot outside of a Microsoft Store in Seattle surrounded by 343i team members and fans who came out to celebrate the launch with us and get the game at midnight PST.  I checked in with the on call team, Europe and the East Coast of the US had also come online smoothly.  In addition the real time Cheating & Banning system I wrote in a month and half before launch had already caught and banned 3 players who had modded their Xbox in the first few hours, I was beyond thrilled.  Everything was going according to plan so after a few celebratory beers, I headed back into the office to take over the graveyard shift and continue monitoring the services.  The next 48 hours were critical and likely when we would be seeing our peak traffic.

As the East Coast of the United States started playing Halo after work on launch day we hit higher and higher numbers of concurrent users.  Suddenly one of our APIs related to Cheating & Banning was hitting an abnormally high failure rate, and starting to affect other parts of the Statistics Service.  As the owner of the Halo 4 Statistics Service and the Cheating & Banning Service I Ok’d throwing the kill switch on the API and then began digging in.

The game was essentially DoSing us.  We were receiving 10x the number of expected requests to our service on this particular API, due to a bug in the client which reported suspicious activity for almost all online players.  The increased number of requests caused us to blow through our IOPS limit in Azure Storage, which correctly throttled and rejected our exorbitant number of operations.  This caused the request from the game to fail, and then the game would retry the request three times, creating a retry storm, only exacerbating the attack.

Game Over Right?  Wrong.  Halo 4 had no major outages during launch week, the time notorious for games to have outages.  The Halo 4 Services survived because they were architected for maximum availability and graceful degradation.  The core APIs and component of the Halo Services necessary to play the game were explicitly called out and extra measures were taken to protect them.  We had a game plan to survive launch, which involved sacrificing everything that was not those core components if necessary.  Our team took full ownership of our core service’s availability, we did not just anticipate failure, we expected it.  We backed up our backups for statistics data requiring multiple separate storage services to fail before data loss would occur, built in kill switches for non essential features, and had a healthy distrust of our clients.

The kill switch I mentioned earlier saved the services from the onslaught of requests made by the game.  We had built in a dynamically configurable switch into our routing layer, which could be tuned per API.  By throwing the kill switch, we essentially re-routed traffic to a dummy handler which returned a 200 and dropped the data on the floor or logged it to a storage account for later analysis.  This stopped the retry storm, stabilized the service, and alleviated the pressure on the storage accounts used for Cheating & Banning.  In addition, the Cheating & Banning service continued to function correctly because we had more reliable data coming in via game events on a different API.

The game clients were being jerks (a bug in the code caused an increase in requests) so I had no qualms about lying to them (sending back an HTTP 200 and then promptly dropping the data on the floor) especially since this API was not one of the critical components for playing Halo 4.  In fact had we not built in the ability to lie to the clients we most certainly would have had an outage at launch.

But the truth is the game devs I worked closely with over countless tireless hours leading up to launch weren’t jerks, and they weren’t incompetent.  In fact they were some of the best in the industry.  We all wanted a successful launch so how did our own in house client end up DoSing the services? The answer is Priorities.  The client developers for Halo 4 have a much different set of priorities: gameplay, graphics, and peer to peer networking were at the forefront of their mind and resource allocations, not how many requests per second they were sending to the services.

Client priorities are often very different than the services they consume, even in house clients.  This is true for games, websites, mobile apps, etc…  In fact it is not only limited to pure clients it is even true for microservices communicating with one another.  These priority differences manifest in a multitude of ways: sending too much data on a request, sending too many requests, asking for too much data or an expensive query to be ran, etc… The list goes on and on, because the developers consuming your service are often focused on a totally different problem and not your failure modes and edge cases.  In fact one of the major benefits of SOA and microservices is to abstract away the details of a service’s execution to reduce the complexity one developer has to think about at any given time.

Bad client behavior happens all over the place not just in games. Astrid Atkinson just said in her Velocity Conf talk “Google’s biggest DoS attacks always comes from ourselves.”  In addition, I’m currently working on fixing a service at Twitter which is completely trusting of internal clients allowing them to make exorbitant requests.  These requests result in the service failing, a developer getting paged with no means of remediating the problem, and the inspiration for finally writing this post.  Misbehaving clients are common in all stacks, and are not the bug.  The bug is the implicit assumption that because the clients are internal they will use the API in the way it was designed to be used.

Implicit assumptions are the killer of any Distributed System.

Truly robust, reliable services must plan for bad client behavior and explicitly enforce assumptions.  Implicitly assuming that your clients will “do the right thing” makes your services vulnerable.  Instead explicitly set limits and enforce them either manually via alerting, monitoring and operational runbooks, or automatically via backpressure, and flow control.  Halo 4 launch was successful because we did not implicitly trust our clients instead we assumed they were jerks.

Much thanks to Ines Sombra for reviewing early drafts

You should follow me on Twitter here

Clarifying Orleans Messaging Guarantees

There has been some confusion around Orleans messaging guarantees, that I wanted to take a second to clarify.  In past talks on Halo 4 and Orleans I mistakenly mention that Orleans supports At Least Once Messaging Guarantees.  However this is not the default mode.  By default Orleans delivers messages At Most Once.

Its also worth pointing out that the paper Orleans: Distributed Virtual Actors for Programability and Scalability in section 3.10 says “Orleans provides at-least-once message delivery, by resending messages that were not acknowledged after a configurable timeout,” which identifies the non-default configurable behavior.  This along with some of my talks has led to some of the confusion.

In Orleans when messages are sent between grains, the default messaging passing is request/response.  If a message is acknowledge with a response, it is guaranteed to have been delivered.  Internally, Orleans does best effort delivery.  In doing so it may retry certain internal operations however this does not impact the overall application messaging level guarantees of At Least Once Messaging.  This is similar to TCP, TCP may retry internally but the application code using the protocol will receive the message once or zero times.

Orleans can be configured to do automatic retries upon timeout, up to a maximum amount of retries.  In order to get at least once messaging you would need to implement infinite retries.  Enabling retries is not the recommended configuration since in some failure scenarios it can create a storm of failed retries in the system.  It is recommended that the application level logic handles retries when necessary.

In the Halo Services we ran Orleans in the default mode, At Most Once Message Delivery.  This guarantee was sufficient for some services like the Halo Presence Service.  However, the Halo Statistics Service needed to process every message to guarantee that player data was correct. So in addition to using Orleans to process the data we utilized Azure Service Bus to durably store statistics and enable retires to ensure that all statistics data was processed.  The Orleans grains processing player stats were designed to get messages at least once, leading us to design Idempotent operations for updating players statistics.

I hope this helps clarify Orleans messaging guarantees.  This has also been documented on the Orleans Github Wiki.

You should follow me on Twitter here

Creating RESTful Services using Orleans

After the announce of the Orleans preview, there was a lot of discussion on Twitter.  One comment in particular caught my eye.

I think this is a bit of a misunderstanding of how Orleans can and should be used in production services, this blog post is an attempt to clarify and demonstrate how to build RESTful, loosely coupled services using Orleans.

Orleans Programming Model

Orleans is a runtime and programming model for building distributed systems, based on the actor model.  In the programming model there are a few key terms.

  • Grains – The Orleans term for an actor.  These are the building blocks of Orleans based services.  Every actor has a unique identity and encapsulates behavior and mutable state.  Grains are isolated from one another and can only communicate via messages.  As a developer this is the level you write your code at.
  • Silos – Every machine Orleans manages is a Silo.  A Silo contains grains, and houses the Orleans runtime which performs operations like grain instantiation and look-up.
  • Orleans Clients – clients are non-silo code which makes calls to Orleans Grains.  We’ll get back to where this should live in your architecture later.

In order to create Grains developers write code in two libraries.  GrainInterfaces.dll and Grains.dll.  The GrainInterfaces library defines a strongly-typed interface for a grain.  The Method names and Properties must all be asynchronous, and these define what types of messages can be passed in the system. All Grain Interfaces must inherit from Orleans.IGrain.

/// <summary>
/// Orleans grain communication interface IHello
/// </summary>
public interface IHello : Orleans.IGrain
{
    Task<string> SayHello();

    Task<string> SayGoodbye();
}

The Implementation of the Grains should be defined in the a separate Grains Library. All Grain implementations should implement its corresponding Grain Interface, and inherit from Orleans.GrainBase.

/// <summary>
/// Orleans grain implementation class HelloGrain.
/// </summary>
public class HelloGrain : Orleans.GrainBase, HelloWorldInterfaces.IHello
{
    Task<string> HelloWorldInterfaces.IHello.SayHello()
    {
        return Task.FromResult(" I say: Hello! " + DateTime.UtcNow.ToLongDateString());
    }

    Task<string> HelloWorldInterfaces.IHello.SayGoodbye()
    {
        return Task.FromResult("I say: Goodbye! " + DateTime.UtcNow.ToLongDateString());
    }
}

At compile time code is generated in the GrainInterfaces dll, to implement the code needed by the Silos to perform message passing, grain look-up etc… This code, by default will be under GrainInterfaces/properites/orleans.codegen.cs  There are a lot of interesting things happening in this file, I recommend taking a look if you want to understand the guts of Orleans a bit more.  Below I’ve pulled out snippets of the generated code.

Every GrainInterface defined in the library will have a corresponding Factory Class and GrainReference Class generated.  The Factory Class contains GetGrain methods.  These methods take in the unique grain identifier and creates a GrainReference.  If you look below you will see that the HelloGrainReference has corresponding SayHello and SayGoodbye methods with the same method signature as the Interface.

public class HelloFactory
{
    public static IHello GetGrain(long primaryKey)
    {
        return Cast(GrainFactoryBase.MakeGrainReferenceInternal(typeof(IHello), 1163075867, primaryKey));
    }

    public static IHello Cast(IAddressable grainRef)
    {

        return HelloReference.Cast(grainRef);
    }

    [System.SerializableAttribute()]
    [Orleans.GrainReferenceAttribute("HelloWorldInterfaces.IHello")]
    internal class HelloReference : Orleans.GrainReference, IHello, Orleans.IAddressable
    {
        public static IHello Cast(IAddressable grainRef)
        {

            return (IHello) GrainReference.CastInternal(typeof(IHello), (GrainReference gr) => { return new HelloReference(gr);}, grainRef, 1163075867);
        }

        protected internal HelloReference(GrainReference reference) :
                    base(reference)
        {
        }

        public System.Threading.Tasks.Task<string> SayHello()
        {
            return base.InvokeMethodAsync<System.String>(-1732333552, new object[] {}, TimeSpan.Zero );
        }

        public System.Threading.Tasks.Task<string> SayGoodbye()
        {
            return base.InvokeMethodAsync<System.String>(-2042227800, new object[] {}, TimeSpan.Zero );
        }
    }
}

In an Orleans Client you would send a message to the HelloGrain using the following code.

IHello grainRef = HelloFactory.GetGrain(0);
string msg = await grainRef.SayHello("Hello Orleans!");

So at this point if you are thinking, this looks like RPC, you are right. Orleans Clients and Orleans Grains communicate with one another via Remote Procedure Calls, that are defined in the GrainInterfaces. Messages are passed via TCP connections between Orleans Clients and Grains. Grain to Grain calls are also sent over a TCP connection if they are on different machines.  This is really performant, and provides a nice programming model.  As a developer you just invoke a method, you don’t care where the code actually executes, one of the benefits of Location Transparency.

Ok stay with me, Deep Breaths.  Project Orleans is not trying to re-create WCF with hard coded data contracts and tight coupling between services & clients. Personally I hate tight coupling, ask me about BLFs, the wire-struct in the original Halo games, if you want to hear an entertaining story, but I digress…

RESTful Service Architectures

Orleans is a really powerful tool to help implement the middle tier of a traditional 3-tiered architecture. The Front-End, which is an Orleans Client, The Silos running your Grains and performing application level logic, and your Persistent Storage.

On The Front-End you can define a set of RESTful APIs (or whatever other protocol you want for that matter), which then routes incoming calls to Orleans Grains to handle application specific logic, by using the Factory methods generated in GrainInterfaces dll.  In addition the Front-End can Serialize/Deserialize messages into the loosely coupled wire-level format of your choosing (JSON, Protocol Buffers, Avro, etc…).

restfulOrleansArchitecture

By structuring your services this way, you are completely encapsulating the dependency on Orleans within the service itself, while presenting a RESTful API with a loosely coupled wire struct format.  This way the clients can happily communicate with your service without fear of tight coupling or RPC.

The below code uses ASP.NET WebApi to create a Front End Http Controller that interacts with the Hello Grain.

public class HelloController : ApiController
{
    // GET api/Hello/{userId}
    public async Task<string> Get(long userId)
    {
        IHello grain = HelloFactory.GetGrain(userId);
        var response = await grain.SayHello();
        return response;
     }

     // DELETE api/Hello/{userId}
     public async Task Delete(long userId)
     {
        IHello grain = HelloFactory.GetGrain(userId);
        var response = await grain.SayGoodbye();
        return;
     }
}

While this is a contrived example, you can see how you can map your REST resources to individual grains.

This is the architectural approach Halo 4 Services took when deploying Orleans.  We built a custom, light weight, super fast front-end that supported a set of Http APIs.  Http Requests were minimally processed by the front-end and then routed to the Orleans Grains for processing.  This allowed the game code and the services to evolve independently from one another.

The above example uses ASP.NET Web API, if you want something lighter weight checkout OWIN/Project Katana.

*HelloGrain Code Samples were taken from Project “Orleans” Samples available on Codeplex, and slightly modified.

You should follow me on Twitter here

Orleans Preview & Halo 4

orleans + halo feature image

On Wednesday at Build 2014 Microsoft announced the preview release of Orleans.  Orleans is a runtime and programming model for building distributed systems, based on the actor model.  It was created by the eXtreme computing group inside Microsoft Research, and was first deployed into production by 343 Industries (my team!) as a core component of the Halo Services built in Azure.

I am beyond excited, that Orleans is now available for the rest of the development community to play with.  While I am no longer at 343 Industries and Microsoft, I still think it is one of the coolest pieces of tech I have used to date.  In addition getting Orleans and Halo into production was truly a labor of love, and a collaborative effort between the Orleans team and the Halo Services team.

In the Summer of 2011, the services team at 343 Industries began partnering with the Orleans team in Microsoft Research to design and implement our new services.  We worked side by side with the eXtreme computing group, spending afternoons pair programming, providing feedback on the programming model, and what other features we needed to ship Halo 4 Services.  Working with the eXtreme computing group was an amazing experience, they are brilliant developers and were great to work with, always open to feedback and super helpful with bug fixes and new feature requests.

Orleans was the perfect solution for the user centric nature of the Halo Services.  Because we required high throughput and low latency requests we needed state-full services.  The Location Transparency of actors provided by Orleans and the Asynchronous “single-threaded” programming model, made developing scalable, reliable, and fault tolerant services easy.  Developers working on features only had to concentrate on the feature code, not message passing, fault tolerance, concurrency issues, or distributed resource management.

By the Fall of 2011, a few months after our partnership began, Orleans was first deployed into production to replace the already existing Halo Reach presence system, in order to power the realtime Halo Waypoint Atlas experience.  The new presence service built on Orleans in Azure had parity with the old services (presence updates every 30 seconds), and the ability to push updates every second to provide realtime views of players in a match on a connected ATLAS second screen.

After proving out the architecture, Orleans, and Azure the Halo the team moved into full production mode re-writing and improving upon the existing Halo Services including Statistics Processing, Challenges, Cheating & Banning, and Title Files.

On November 6 2012, Halo 4 was released to the world, and the new Halo Services went from a couple hundred users to hundreds of thousands of users in the span of a few hours.  So if you want to see Orleans in action go play Halo 4, or checkout Halo Waypoint, both of those experiences are powered by Orleans and Azure.

Now here is the fun part, Orleans has been opened up for preview by the .NET Team.  You can go and download the SDK.  In addition a variety of samples and documentation are available on Codeplex (I know its not Github sad times, but the samples are great).  I spent Wednesday night playing around with the samples and getting them up and running.

I highly recommend checking Orleans out, and providing feedback to the .NET team.  Stay tuned to the .NET blog for more info, and feel free to ask me any questions you may have.  In addition I have a few blog posts in the work to help share some of the knowledge I gained while building the Halo 4 Services using Orleans!

References

More Halo, Azure, Orleans Goodness

You should follow me on Twitter here

Node Summit 2013 Retrospective

This year I attended Node Summit 2013 from December 3rd-4th in San Francsico.  I went to support Michael Shim, an HBO coworker who was speaking on a panel about Node.js in the Digital Media Universe, and to get a crash course into the culture, tech, and community surrounding this framework.  Being brand new to Node.js Node Summit was a great opportunity to get a feel for the community and how other companies were using it.

Below are some of the thoughts and overall themes that I took away from the conference.

Node Can Perform at Enterprise Scale

There were dozens of examples of companies using and deploying Node.js at enterprise scale.  Walmart had great success this past Black Friday.  Groupon, LinkedIn, Ebay, & Paypal have all recently made the switch over to running Node in their services’ stack, and had great success.  Media companies like NPR, CondeNast, and Direct TV are also starting to use Node with success.

The big takeaway was this framework is being deployed by several companies, at large scale, and it is performing.  The fledgling tech does have some challenges to work through, especially surrounding security and patching, which were addressed in a panel on day 2, but overall Node.js is a very viable option to build large scale services on.

Node & Functional Programming

On Day One of the summit, many presentations and panels talked a lot about the process of incorporating Node.js into their tech stacks, which was not surprising.  However, I was surprised to hear that as part of the process many companies, like LinkedIn and PayPal, also switched from Object Oriented Programming models to Functional Programming models.

Thinking back on this it shouldn’t be surprising since the asynchronous and event-driven nature of Node.js applications fits extremely well with Functional Programming models.  Now if only message passing between node processes was easier…

Fast Mean Time to Recovery 

An overall theme throughout several talks was how easy it is to bring up and tear down Node.js processes.  This is not only great for developer efficiency, but it is also incredibly important in production.  The ability to kill a misbehaving process and quickly replace it with a new instance means the Mean Time To Recovery (MTTR) is low.  Having a low MTTR means you can fail fast and often.

It was also cited that Mean Time to Failure (MTTF) is no longer as important of a metric.  This is a bit of a paradigm shift, when thinking about more traditional services.

No More Monoliths

Node.js is really useful when you structure your services into small modular systems.  Node services should be broken down into lots of little processes that communicate via messages passed through queues or streams.

Small processes are more easy to reason about, make your application more testable, and the code more manageable.  The Node Package Module (NPM) makes breaking your code into modules and managing dependencies extremely easy.  NPM was touted throughout the conference as Node’s killer feature.

The Node Community really loves Node 

This may sound like a no brainer, but the community was incredibly passionate.  Its no wonder since the use of Node.js has grown tremendously over the past year, and the number of modules in NPM has also been growing exponentially.  It was awesome to be around such a group of motivated passionate developers.

In addition companies are finding it easy to hire developers, because developers really love Node.js.  Even if they don’t have prior experience developers want to learn and work with Node.js.  I should know I’m one of them :)

and not a lot else…

The down side of this passion came with a lot of other tech bashing, that felt more religious than factual.  Java was lambasted in every talk and panel.  To be fair asynchronous programming in Java is quite cumbersome, whereas the asynchronous model is built into the Node.js framework.  But at its core Java is a language and Node.js is a framework, so apples to apples comparisons seemed odd without the mention of Java frameworks, like Typesafe, that also try and solve this problem.

.Net tech was barely mentioned beyond snide remarks.  I found these remarks odd since I have worked rather extensively with the Async/Await programming model introduced in .Net 4.5, and in my experience the async/await model is easy to use and resulted in clean concise code.

Ruby on Rails also took a lot of heat especially in the Why Large Scale Mobile & E-Commerce Apps Use Node.js panel.  Groupon and LinkedIn both started with a Ruby on Rails stacks.  Both companies switched to a Node stack citing that the Ruby on Rails services could not scale and that the code base was unmanageable.  However, in a later talk Sean McCullogh, a Groupon Engineer, did mention that Groupon’s original architecture using Ruby on Rails was broken and that there were changing business needs which led them to switch to Node.js.  I greatly appreciated this honesty instead of just blaming the Ruby stack for all their problems.

As a developer I think Node.js is fantastic.  The framework embraces a lot of the core principles that make developing and deploying large scale web services easy.  And to be fair, most tech communities, and tech specific conferences tend to run into this problem.  However, there are tradeoffs in any technology and lessons to be learned from other solutions.  While this was a Node conference I wish there would have been a more honest discussion about these tradeoffs, and an openness to other technology.

Favorite Talks

Reflections on Three Years of Node.js in Production – Bryan Cantril

This was one of the more technical talks,  and Bryan was immensely entertaining.  Dropping quotes like,  “You gotta love open source sometimes a magical pony comes and poops out a rainbow on your code.”

The majority of his talk focused on logging and debugging Node.js services.  Javascript core dumps are really hard to trace and debug, but Joyent has done a lot of work to demystify this process by creating tools that work on SmartOS.  However general Linux users hadn’t been able to take advantage of this functionality…Until Now!  Bryan did a live demo of taking a core dump from an Ubuntu Linux box and getting the same deep analysis using ephemeral compute within the Joyent Manta Storage Service.  More details are available on the Joyent Blog.  Pretty cool!

Node From the Battlefield – Eran Hammer 

Eran, a Senior Architect at Walmart, gave a highly entertaining talk on how Walmart moved a large portion of their service’s stack over to Node.js and the lead up to their biggest test, Black Friday.

While preparing for Black Friday the Walmart team encountered a suspected memory leak.  To help diagnose the problem the team increased analytics and monitoring which signaled that the problem was actually in the core of Node.js.  TJ Fontaine at Joyent tracked down the memory leak fixed it and released a new version of Node which the Walmart team picked up just in time for holiday shopping.

On Black Friday Eran live tweeted Walmart Devops under the hashtag #nodebf, and it was surprisingly boring.  With the memory leak fixed, their services performed well consuming low amounts of memory and CPU despite high amounts of traffic.

At the end of his talk Eran read a “bed time story” to TJ on stage called the Leek Seed.  It was quite comical and had the audience in stitches.

This talk demonstrated the ability for Node to scale and perform in large production environments.  In addition this talk reinforced my opinion that rigorous logging, monitoring, and testing at load is the only way to discover some issues of the most nefarious issues in distributed systems.

Conclusion

I had an excellent time at NodeSummit, I learned a lot.  Personally I like more technical talks and code at conferences, and less panels, but there are other Node.js related conferences that are tailored to that.  Overall it was a great crash course into the Node world and community, and I look forward to applying what I learned, getting more involved in the community, and developing services on Node in the future.

You should follow me on Twitter here

Enforcing Idempotency at the Data Layer

Idempotency

In Computer Science idempotent operations are defined as operations that produce the same result if they are executed once or multiple times.

Practically in an application or service this means that idempotent operations can be retried or replayed without the fear of processing the data multiple times or causing unwanted side effects.  As a Web Service Developer having Idempotent operations allows us to have simpler logic for handling failures.  If a request fails we can simply retry the request by replaying it.  In services and messaging systems having idempotent operations is the easiest way to handle “at least once messaging”  Jimmy Bogard has written a great post on this topic: Los Techies (Un) Reliability in Messaging Idempotency and De-Duplication

In services and messaging systems having idempotent operations is the easiest way to handle “at least once messaging”

Most operations are not mathematically idempotent, so developers must write application level logic to enforce Idempotency of requests.  However, If we can enforce Idempotency in our operations at the Data Storage Layer than the need for special case logic is minimized.

Using Azure Table to Enforce Idempotency

Azure Table Storage provides Transaction support inside of a single Table Partition.  We can take advantage of this to enforce Idempotency at the Data Storage layer of applications by using the Aggregate-Event Data Model described in a pervious post.

In order for this to work the data needs to be structured in the following way.

  1. The Aggregate Entity and Event Entities must be stored in the same Partition.
  2. Updating the Aggregate Entity and Adding the Event Entity to storage must occur in the same Batch Operation.

By updating the Aggregate Entity and adding the Event Entity in the same TableBatchOperation either both writes will succeed or both writes will fail, leaving your data in a consistent state whether you have received the event once, or many times.

If the Batch Operation fails, you can determine that it was because the data was already processed, and not caused by some other failure, by checking the Storage Exceptions HTTP Status Code.  If the Status Code equals 419 – Conflict, then one of Entities marked for Add in the Batch Operation already exists in the table.

Simple Stats Example

To see this in practice, we’ll go through a simple example for storing game statistics for a game like Galaga, using Azure Table Storage.  Each player has a unique PlayerId and at the end of each game players will upload data to store their statistics for that game.

public class PlayerGameData
{
   public Guid GameId { get; set; }
   public Int32 GameDurationSeconds { get; set; }
   public bool Win { get; set; }
   public Int32 Points { get; set; }
   public Int32 Kills { get; set; }
   public Int32 Deaths { get; set; }
}

In the previous post on Immutable Data Stores I also share code for the TableEntities used in the Simple Stats Example, which will not be repeated here.

Below is an example of how to Process a Simple Stats Game and store it such that Idempotency is enforced at the Data Layer.

public static void ProcessGame(Int64 playerId, PlayerGameData gameData)
{
   // Create the batch operation.
   TableBatchOperation batchOperation = new TableBatchOperation();

   PlayerEntity player = PlayerEntity.GetPlayerEntity(playerId);
   UpdatePlayerEntity(player, gameData)
   batchOperation.InsertOrReplace(player);

   //Create PlayerGame Row
   PlayerGameEntity playerGame = new PlayerGameEntity(playerId, gameData);
   batchOperation.Insert(playerGame);

   try
   {
      StorageManager.Instance.PlayersTable.ExecuteBatch(batchOperation);
   }
   catch (Exception ex)
   {
       //Check if the error occurred because we already processed the data
       if (ex is Microsoft.WindowsAzure.Storage.StorageException)
       {
          var storageEx =  (Microsoft.WindowsAzure.Storage.StorageException)ex;
          if(storageEx.RequestInformation.HttpStatusCode == (int)HttpStatusCode.Conflict)
             return;
       }
       throw ex;
   }
}

There are two options when a game is processed:

1. The system has not processed this Game: In this case the above code will create a new PlayerGame Entity and update the in memory copy of the Player Entity.  The Batch Operation will succeed, and the PlayerGame along with the updated Player Entity will be stored in the table.

2. The system has processed this Game: In this case the above code will create a new PlayerGame Entity and update the in memory copy of Player Entity.  However when the Batch Operation executes it will fail since the Event Entity already exists in this partition.  Because of the per partition transaction support provided by Azure Storage, the updates to the Player Entity will also not be stored.  The data in storage will be the same as if the Game had only been processed once.

By using Azure Table Store to enforce Idempotency of event processing you no longer have to write application level logic to handle at least once messaging.  In addition by using this pattern you get all the benefits of Immutable Data Stores as well.

You should follow me on Twitter here

Origin Story: Becoming a Game Developer

becoming a gamer_featureimage

Over the past few weeks I have been asked over a dozen times how I got into the Games Industry, so I thought I would write it down.

TLDR; My first Console was a SNES.  I learned to program in High School. I attended Cornell University and got a B.S. in Computer Science.  My first job out of college was as a network tester on Gears of War 2 & 3.  I joined 343 industries as a Web Services Developer in January of 2010, and recently shipped Halo 4 on November 6th 2012.

In the Beginning

My story starts out in the typical fashion I fell in love with Video Games after my parents got me an SNES as a kid.  However, here is where my story diverges, my career in the games industry was not decided at 7.

In fact I had already chosen my career a few years earlier.  When I was 5, I announced to my mother that I did not need to learn math because I was going to be a writer when I grew up.  I had an active imagination, and loved exercising it by writing stories of my own.  My first major work was a story about ponies entitled “Hores.”  Luckily my parents would not let me give up on math, and helped me with my spelling.

It turned out that I actually did enjoy math, I just was ahead of my classmates in comprehension which is why I found it boring in grade school.  In Middle School I was placed into the Advanced Math program along with about 25 other students selected to take accelerated courses.  I enjoyed the problem sets and challenges, and more importantly I excelled at them.  This put me on Mrs. Petite’s short list of students to recruit.

The Way of the Code

Mrs. Petite taught Computer Science at my High School, and she notoriously recruited any advanced math or science student to take her class.  She was stubborn and didn’t take no for an answer so Sophomore year instead of having an extra period of study hall, like I originally intended, I was in her Intro to programming class, writing a “Hello World” application in Visual Basic.

Mrs. Petite quickly  became my favorite teacher and I took AP level Computer Science classes Junior and Senior year learning C++ and Java, respectively.  We learned programming basics, object oriented programming, and simple data structures with fun assignments like writing AI for a Tic-Tac-Toe competition, programming the game logic in Minesweeper, and creating a level in Frogger.

During High School I began to realize that I wasn’t just good at programming, but I truly enjoyed it.  Computer Science wasn’t just a science, it was a means of creation.  Like writing, programming gave me the power to start with a blank canvas and bring to life anything I could imagine.

“Programming gave me the power to start with a blank canvas and bring to life anything I could imagine.”

Throughout Middle School and High School I played my fair share of video games.  Most notably I acquired a PlayStation and raided dozens of tombs with Lara Croft, and played Duke Nukem 3D my first First Person Shooter, but games were still not my main focus.  I ended up spending more of my time programming, playing lacrosse, singing in choir, participating in student council, and spending time with my friends.  Video Games were great, but I still had not decided to pursue a career in the Games Industry.

I graduated from High School not only having learned to program in Visual Basic, C++, and Java, but with a passion for programming.  In the Fall of 2004 I decided to continue on my coding adventure by enrolling in the Engineering School at Cornell University focusing on Computer Science.

College

I entered Cornell University expecting to major in Computer Science, but to be sure I dabbled in other subjects Philosophy, Evolutionary Biology, and Civil Engineering before declaring my major.  To this day I still have a diverse set of interests and I enjoyed all of these subjects immensely, but none of them lived up to the joys of coding.

We Made It!
Computer Science Best Friends at Graduation

College was this beautiful, wonderful, stressful blur.  I ran on massive amounts of caffeine and memories of crazy weekends spent with friends.  We worked really hard, but played really hard too.  Even with all the pressure, stress, and deadlines I was having the time of my life.  The classes were fast paced, I was being challenged, and I was learning an immense amount from Data Structures to Functional Programming to Graphics to Security.

Sophomore year I declared myself for CS, and also became a Teaching Assistant for CS 211 (Object Oriented Data Structures and Programming).  In addition another immensely important event happened in the fall of my Sophomore year: I bought an Xbox 360, and Gears of War.  I loved the game, and spent many nights during winter break staying up till 2am chainsawing locusts.  I also spent a significant amount of time playing Viva Piñata that break, like I said diverse set of interests.  This new console, some fantastic games, and the Xbox Live enabled social experiences reignited my passion for gaming.  Now I began to consider Game Development as a career.

Internships

After Sophomore year I took a somewhat unconventional but completely awesome internship at Stanford’s Linear Accelerator Center (SLAC).  I lived in a house with 20 brilliant physics majors, learned about black holes, dark matter, and quantum computing while helping to manage the Batch farm which provided all the computing power for the physicists working at the center.  It was an absolutely amazing experience.

After Junior year I once again went West for the summer.  This time to Redmond Washington as a Microsoft intern working on Windows Live Experiences (WEX).  During that summer I got to exercise my coding chops and most importantly fully solidified the opinion that I wanted to be a developer.  I left the Pacific North West at the end of summer with two job offers in WEX, but by then I knew I really wanted to work on games.  So after some negotiation and another round of interviews I managed to secure a 3rd offer in Microsoft Game Studios as a Software Engineer in Test working on the Networking and Co-op of Gears of War 2.  I was beyond thrilled.

I graduated from Cornell in 2008 with a Bachelors of Science in Computer Science from the Engineering School.  It was a bittersweet moment, I had loved my time at Cornell and most of my friends were staying on the East Coast, but I knew exciting things were waiting for me in Seattle.

The Real World (Seattle)

In July of 2008 I moved out to Seattle, and joined the Microsoft Game Studios team working on Gears of War 2.  I quickly was thrown into the fire as I was assigned ownership of testing the co-op experience.  It was terrifying and exciting to be given so much responsibility right away.  I eagerly jumped into the project and joined the team in crunching immediately after starting.

The first few months in Seattle were a whirlwind as we pushed to get the game through to launch.  The hours were long but I was passionate about the project and I was learning a lot.  It was an amazingly gratifying experience the day Gears of War 2 went Gold.  When the game launched I had another immensely satisfying moment; my computer science best friend from college and I played through the game in co-op and at the end we saw my name in the credits. Life Achievement Unlocked!

Midnight Launch Halo 4
Midnight Launch Halo 4

I love social game experiences, both collaborative and competitive; So post launch I focused a lot of my energy on improving my skills in the areas of networking and services.  So as we moved into sustain on Gears of War 2 I began focusing on the matchmaking and networking experience.  I spent my free time diving through the Xbox XDK, learning about the networking stack, and playing around with Xbox Live Services.  As work began on Gears of War 3 I took ownership of testing the matchmaking code and became very involved in dedicated servers for multiplayer.

In the Fall of 2009 I was asked to temporarily help the fledging 343 Industries studio ship one of the first Xbox Title Applications, Halo Waypoint.  I knew it would mean extra hours and a lot of work, but the opportunity to work on new technology, and make connections in other parts of Microsoft Game Studios was too good to pass up.  I dove headfirst into the transport layer of the Waypoint Console app, and helped get them through launch in November 2009.

The next few months I began to evaluate what I wanted to do next in my career.  Working on Gears of War 3 was a great opportunity, but I really wanted to do be a developer.  The parts of my testing job that I found most satisfying were designing systems, coding internal tools, and researching new technology.  So when the opportunity to join 343 Industries as a developer appeared in January 2010 I jumped at it.  It was a perfect fit.  After reaching out to my contacts in 343 and then participating in a full round of interviews I was offered a position on the team as a web services developer to write code that would power the Halo Universe and enable social experiences; I excitedly accepted!

One of my first tasks at the studio was working on the Spartan Ops prototype.  I was elated that I got to utilize both my technical and creative skills to help create a brand new experience; my Spartan adventures were off to an amazing start!  The rest is history and a few years later we shipped Halo 4.  After launch I once again had a intense moment of elation after playing through Halo 4 on co-op with my college bff and seeing my name in the credits.  It never gets old.

Final Thoughts

Some thoughts, all my own and anecdotal. To be successful as a Game Developer first and foremost you have to be passionate about what you do, whether it is programming, art, design, writing, or something else.  You need to be passionate about games and your chosen field.  In addition I believe my love of learning has been a huge asset in my career development and growth.  I am not afraid to dive into new technologies, or get my hands dirty in a code base I do not understand.  I believe doing this helped me get into the industry, and continuing to do so makes me valuable.  Lastly do not be afraid to ask for what you want, no one is going to just hand you your dream job.  Of course there is a bit of luck and timing involved in breaking into the Industry, but working incredibly hard is the best way I know to help create those opportunities.

You should follow me on Twitter here

Flexible Security Policies

Last weekend I made a rather bold statement on Twitter.

This sparked off a conversation in 140 character installments during which I found it difficult to fully convey my point. This is my attempt at clarity.

In the services world taking dependencies on third party services is increasingly necessary, especially as more services move into the cloud. However, irregardless of third party failures, I firmly believe that YOU own your services availability.

I’d like to examine a single point of failure that communication with most third party services have: SSL Certificates. Most services implement an all or nothing policy regarding SSL Certificates. Either it meets the criteria (correct host name, acceptable signature algorithm, valid date, etc…) or it does not. However, this black and white policy forces your service to have a single point of failure outside of your control.

Implementing a flexible policy, that can be updated on the fly without deploying new code, reduces the damage this single point of failure can cause. Instead of having calls fail until the certificate issue has been resolved by the third party a flexible policy gives the DevOps team the ability to analyze the security risk of accepting the bad certificate versus the business cost of having down time.

Some examples where this may be appropriate:

  • An SSL Certificate has expired: If the certificate recently expired, and there is no evidence that the private key has been compromised the risk of accepting SSL traffic from this endpoint is low. The communication is still encrypted, and likely secure. Accepting the expired certificate for a few hours or days might be worthwhile to avoid service degradation.
  • An SSL Certificate has a weak signature algorithm: If the certificate has been recently renewed with a weaker signature algorithm than expected, the communication is still encrypted. Accepting the weaker certificate for a few hours to avoid down time may be acceptable, while a new certificate is rolled out.

If the security policy is flexible, I may decided to accept a recently expired SSL certificate for a configurable duration of time, allowing my services to stay up. Inversely I may analyze the security risk and decided that I cannot tolerate it for any period of time. In this case I will decide to have a degraded services experience and continue to reject the certificate.

In either scenario the power is in my hands. I am owning the availability of my service. I am making a conscious decision on whether to be available or not. With the standard security policy there is no option. My service’s availability is not in my control.

An ideal solution would allow for the security policy to be adjusted for a single certificate. Critical or error level logs would still be created as long as the certificate did not meet the default security standards. In addition the decision to accept the certificate should be revisited periodically by the DevOps and Business teams, until a valid SSL certificate is provided by the third party service.  There should not be a blanket policy, instead each certificate failure should be evaluated on a case by case basis.

I am by no means advocating that services should blindly ignore SSL certificate failures, or that the third party services should not be held responsible when failures occur. I am instead advocating for the ability to make the decision for myself and update our security policy on the fly if needed. The goal of a flexible policy is not to blindly tolerate security risks, but provide the ability to make trade offs in real time: Service Availability vs Security Risk.

You should follow me on Twitter here