Report on the service-impacting incident that started during the evening of July 22, 2019
*Plum Voice has been a provider of cloud IVR services since 2004. Throughout that time we have prioritized reliability for our infrastructure and software. During many years over the last decade we have had no appreciable downtime at all. Our full service uptime over the last decade, for a typical customer, has been better than 99.998%.
That uptime record has been achieved despite the fact that carriers and infrastructure components are problematic on a regular basis, and software inevitably has bugs. We spread telecom traffic across five carriers and three data centers. We deploy redundant equipment in each level of our technology stack. Our core software has been thoroughly stress-tested in production use for over a decade. We also follow careful security policies that allow us to be certified for SOC2, PCI, HIPPA, etc.
*On the evening of July 22, 2019 an incident occurred that was not contained quickly. Plum’s primary datacenter was made completely unusable by a connectivity break caused by a mysterious packet loss issue. With no quick resolution in sight, our disaster recovery plan was invoked.
We faced challenges reconfiguring systems to implement the recovery plan quickly and optimally. Reasons include:
Meanwhile the root cause datacenter connectivity break was traced to a vendor that had provided Plum with a decade of good service, but in this case failed to meet their service level agreements and moreover did not acknowledge the problem quickly. The root cause was ultimately confirmed to be in their infrastructure, and repaired around 2PM ET on July 24th.
*STATUS AND PLAN:
*Steps have been taken to avoid a repeat of the root cause, or anything similar. And we have recognized some errors and outdated assumptions in our disaster recovery plan. We are updating the disaster recovery plan, optimizing our infrastructure configuration to prepare better for failover scenarios, and intend to stage periodic disaster recovery drills.
Our team takes tremendous pride in our decade-plus record of reliability. We are fully committed to continuous improvement and to providing excellent service as we proceed.