Post-mortem: 24/10/2022

Posted by
Aaron Gregory

Mistakes happen; it's the learning the takes place afterwards which is absolutely vital. On the 24th October 2022, we suffered a major outage which took our main website, client area/billing system & all of our HTTPS URL's offline.

Why did this happen? What was done about it? What went wrong & how will we do things differently in the future?

What Happened:
At approximately 10:41am UK time, our internal monitoring alerted our infrastructure administrator to a major failure in our production cluster. During diagnostics, it was discovered that OpenShift had rolled back a previously applied patch to the routers.

Digging deeper, we noted that the API was still online and functional, but the UI was not. I tried logging in to the API, and when that failed, I logged into the nodes to confirm. When I logged into the nodes, I noted that OpenShift had rolled out a new router. This caused the initial outage. I also noted that the cluster was CPU constrained, and in an effort to recover the cluster, all 3 machines were rebooted.

During this, it was noted that core-1 had an incorrect disk layout. This was changed to prevent further instability and bring core-1 inline with the rest of the nodes.

OpenShift continued to refuse to boot. To mitigate this, we modified the core node system to give easier unified access, and took the opportunity to also upgrade storage infrastructure, to both improve the lifecycle of all applications, and improve cluster health. In addition, we have deployed more nodes to handle sustained load better, and modified our routing to allow for outages without affecting core services.

Why was this such a major outage and what went wrong?

The severity of this outage was contributed to by a number of factors, but most notably; most of our core services went offline due to an isolated issue. Our main website & client area was knocked offline alongside a client-side outage, meaning clients could not reach our support team other than via social media.

We are a small team; in times of severe outages, we pull together in order to restore services as quickly as possible. Unfortunately, this was at the expense of client communication. Many clients were left in the dark as to what had happened and what we were doing about it; it took us a little while to get updates distributed via social media, and this isn't acceptable.

So, what'll happen in the future?

We have made the decision to migrate our billing system entirely away from our core infratructure, similarly to how our Status page operates. In the unlikely event of an outage, clients need to retain the ability to contact our support desk at all times; migrating our billing system to an alternate location in the cloud will allow us to have guaranteed access to our support & billing systems, even in the event of an outage.

We are now formulating a disaster response plan, which will be automatically triggered in the process of a major outage. This consists of a strict process that will be followed including the contact of affected clients via email/SMS, regular updates via social media at set durations, and opening up further channels of contact.

Work to strengthen our internal infrastructure continues, and we are working on ensuring that our HTTPS URL's remain fully redundant in the event of a core network outage.


If you liked reading this, take a look at what else we've written!