Creating Immutable Data Stores


Immutability is the property of data to not change once it is created.  Most storage systems provide basic C.R.U.D. (Create, Read, Update, Delete) functionality.  However, to have truly Immutable data we should only use the Create and Read functionality provided by storage systems.

Nathan Marz makes an excellent case for why we need Immutable data in his talk at Strata Conf 2013, Human Fault – Tolerance.  The basic premise is people make mistakes.  We are fallible beings: we write code with bugs, we accidentally delete data from production, etc… He goes on to posit that the worst kinds of errors are those which cause Data Loss or Data Corruption, because it is impossible to recover from them.  By having Immutable Data stores we minimize the impact bugs or mistakes can have on our systems.

By having Immutable Data stores we minimize the impact bugs or mistakes can have on our systems.

Aggregate-Event Data Model

Building truly Immutable data stores can be inefficient.  For instance in a game statistics service calculating the total number of games a player has ever played would require reading all of the data from storage about a player’s past games and computing the number on every request.  For an active player who has played hundreds or thousands of games this can be time consuming and expensive.  Caching can help solve this problem, but keeping in memory aggregates for every user can also be costly depending on the size of your audience.

What we really want is the benefits of Immutable data stores, but with the efficiency of precomputed aggregates.  We can achieve this using a hybrid model, made up of Event Entities and Aggregate Entities

Event Entities store information about the systems events.  These Entities are Immutable, and should only support Create and Read operations.  The Event Entities are the source of truth for your application.

Aggregate Entities store the precomputed aggregates.  These Entities are Mutable, and will support all C.R.U.D. operations.  These can be used in your application to respond to requests and should be updated when new events are received.

agg-event model

In this model the Aggregate Entity is really just a function over a set of the corresponding Event Entities.  The Aggregate Entity can be recomputed at any time from the corresponding Event Entities if the Aggregate Entity is deleted, a bug is found in the aggregate computation code, or a new aggregate is desired to support a new feature.

Simple Stats Example

To see this in practice, we’ll go through a simple example for storing game statistics for a game like Galaga, using Azure Table Storage.  Each player has a unique PlayerId and at the end of each game players will upload data to store their statistics for that game.

public class PlayerGameData
  public Guid GameId { get; set; }
  public Int32 GameDurationSeconds { get; set; }
  public bool Win { get; set; }
  public Int32 Points { get; set; }
  public Int32 Kills { get; set; }
  public Int32 Deaths { get; set; }

The system needs to store the player statistics for each game.  In addition players will want to view their aggregate statistics across all of their games.

In this example the Events in the system are the game statistic uploads.  The Aggregates in this system will be aggregate statistics players will see across all their games, like Total Points or Total Kills.

The Event Entities in the system will store the information for each game a player plays. The PartitionKey will be based on the PlayerId and the RowKey will be based off of the GameId. These will be Created at the end of each game. These Entities are Immutable, they are the source of truth for the system.

public class PlayerGameEntity : TableEntity
   private static string RowKeyString = "game_{0}";

   public PlayerGameEntity(Int64 playerId, Guid gameId)
      PartitionKey = String.Format("{0:x16}", playerId);
      RowKey = String.Format(RowKeyString, gameId);
   public PlayerGameEntity(){}

   public Int64 Points { get; set; }
   public bool Win { get; set; }
   public Int32 Kills { get; set; }
   public Int32 Deaths { get; set; }
   public Int64 GameDuration { get; set; }

The Aggregate Entities will store the aggregate player statistics. The PartitionKey will be based off of the PlayerId and the RowKey will be a static string to identify the entity type since there will only be one of these entities per player. These entities are Mutable. They should be updated at the end of each game.

public class PlayerEntity : TableEntity
   private static string RowKeyString = "Players";

   public PlayerEntity(Int64 playerId)
      PartitionKey = String.Format("{0:x16}", playerId);
      RowKey = RowKeyString;
    public PlayerEntity() { }

    public Int64 TotalPoints { get; set; }
    public Int32 TotalGames { get; set; }
    public Int32 TotalWins { get; set; }
    public Int32 TotalKills { get; set; }
    public Int32 TotalDeaths { get; set; }
    public Int64 TotalSecondsPlayed { get; set; }

This will create Storage Partitions that look like the example below, captured using Cerebrata Tools. Each PlayerEntity (the Aggregate Entity) will have a set of corresponding PlayerGameEntities (Event Entities).  As you can see in the example below the PlayerEntity Values can all be recalculated from the PlayerGameEntities.

agg-event model storage exp

This model has another benefit besides avoiding Data Loss & Corruption; the Data Layer is now a record of all events that have gone through the system.  This is extremely useful in testing and debugging in development and production.

You should follow me on Twitter here