Alarm Notification via Phone is experiencing an Outage
Incident Report for RACO Mfg and Eng Co
Postmortem

Report on the service-impacting incident that started during the evening of July 22, 2019

*BACKGROUND:
*
Plum Voice has been a provider of cloud IVR services since 2004. Throughout that time we have prioritized reliability for our infrastructure and software. During many years over the last decade we have had no appreciable downtime at all. Our full service uptime over the last decade, for a typical customer, has been better than 99.998%.

That uptime record has been achieved despite the fact that carriers and infrastructure components are problematic on a regular basis, and software inevitably has bugs. We spread telecom traffic across five carriers and three data centers. We deploy redundant equipment in each level of our technology stack. Our core software has been thoroughly stress-tested in production use for over a decade. We also follow careful security policies that allow us to be certified for SOC2, PCI, HIPPA, etc.

*INCIDENT FACTS:
*
On the evening of July 22, 2019 an incident occurred that was not contained quickly. Plum’s primary datacenter was made completely unusable by a connectivity break caused by a mysterious packet loss issue. With no quick resolution in sight, our disaster recovery plan was invoked.

We faced challenges reconfiguring systems to implement the recovery plan quickly and optimally. Reasons include:

  1. Relatively recent updates to the infrastructure, intended to create better redundancy, were not yet thoroughly documented in the disaster recovery plan and manual failover was implemented more slowly than intended.
  2. Recent infrastructure updates, again intended to improve uptime, nonetheless contained certain oversights exposed in high stress situations. This resulted in service degradation for some customers even once the failover was fully in place.

Meanwhile the root cause datacenter connectivity break was traced to a vendor that had provided Plum with a decade of good service, but in this case failed to meet their service level agreements and moreover did not acknowledge the problem quickly. The root cause was ultimately confirmed to be in their infrastructure, and repaired around 2PM ET on July 24th.

*STATUS AND PLAN:
*
Steps have been taken to avoid a repeat of the root cause, or anything similar. And we have recognized some errors and outdated assumptions in our disaster recovery plan. We are updating the disaster recovery plan, optimizing our infrastructure configuration to prepare better for failover scenarios, and intend to stage periodic disaster recovery drills.

Our team takes tremendous pride in our decade-plus record of reliability. We are fully committed to continuous improvement and to providing excellent service as we proceed.

Sincerely,
Plum Voice

Posted Jul 29, 2019 - 16:02 PDT

Resolved
Notification calls are now being initiated.

Call-in number issues have been resolved.
Posted Jul 25, 2019 - 12:00 PDT
Monitoring
RACO’s telephone partner is experiencing service impairment. We receive updates from them as they make progress in restoring full functionality. All other means of alarm notification are fully functioning. We will update the message on the login page when the situation warrants.
Posted Jul 24, 2019 - 09:04 PDT
This incident affected: Alarm Notification Services (AlarmAgent Voice System Health).