Last weekend I made a rather bold statement on Twitter.

This sparked off a conversation in 140 character installments during which I found it difficult to fully convey my point. This is my attempt at clarity.

In the services world taking dependencies on third party services is increasingly necessary, especially as more services move into the cloud. However, irregardless of third party failures, I firmly believe that YOU own your services availability.

I’d like to examine a single point of failure that communication with most third party services have: SSL Certificates. Most services implement an all or nothing policy regarding SSL Certificates. Either it meets the criteria (correct host name, acceptable signature algorithm, valid date, etc…) or it does not. However, this black and white policy forces your service to have a single point of failure outside of your control.

Implementing a flexible policy, that can be updated on the fly without deploying new code, reduces the damage this single point of failure can cause. Instead of having calls fail until the certificate issue has been resolved by the third party a flexible policy gives the DevOps team the ability to analyze the security risk of accepting the bad certificate versus the business cost of having down time.

Some examples where this may be appropriate:

  • An SSL Certificate has expired: If the certificate recently expired, and there is no evidence that the private key has been compromised the risk of accepting SSL traffic from this endpoint is low. The communication is still encrypted, and likely secure. Accepting the expired certificate for a few hours or days might be worthwhile to avoid service degradation.
  • An SSL Certificate has a weak signature algorithm: If the certificate has been recently renewed with a weaker signature algorithm than expected, the communication is still encrypted. Accepting the weaker certificate for a few hours to avoid down time may be acceptable, while a new certificate is rolled out.

If the security policy is flexible, I may decided to accept a recently expired SSL certificate for a configurable duration of time, allowing my services to stay up. Inversely I may analyze the security risk and decided that I cannot tolerate it for any period of time. In this case I will decide to have a degraded services experience and continue to reject the certificate.

In either scenario the power is in my hands. I am owning the availability of my service. I am making a conscious decision on whether to be available or not. With the standard security policy there is no option. My service’s availability is not in my control.

An ideal solution would allow for the security policy to be adjusted for a single certificate. Critical or error level logs would still be created as long as the certificate did not meet the default security standards. In addition the decision to accept the certificate should be revisited periodically by the DevOps and Business teams, until a valid SSL certificate is provided by the third party service.  There should not be a blanket policy, instead each certificate failure should be evaluated on a case by case basis.

I am by no means advocating that services should blindly ignore SSL certificate failures, or that the third party services should not be held responsible when failures occur. I am instead advocating for the ability to make the decision for myself and update our security policy on the fly if needed. The goal of a flexible policy is not to blindly tolerate security risks, but provide the ability to make trade offs in real time: Service Availability vs Security Risk.

You should follow me on Twitter here