Zingle App loading Issues
Incident Report for Zingle
Postmortem

Incident Report - Zingle App Loading Issues

Customer Impact

On April 7, 17 and 25, the Zingle platform experienced an outage where customers were unable to access the Zingle Inbox and conversations. Zingle engineers were immediately alerted to the outage and began working to resolve the issues.

Cause

Between April 7 and April 25, Zingle's AWS infrastructure experienced three separate outages caused by a Redis "master" service becoming unresponsive. The root cause of all three incidents was a chain of two failures: the queue workers assigned to process "webhook" queue jobs crashed, creating a large backlog of webhook queue jobs that needed to be processed (which are stored in Redis). When the queue workers were restarted, a bug in a 3rd-party library used as part of Zingle's queueing system caused a command to be sent to Redis that, when combined with the large backlog of unprocessed queue jobs, caused the Redis server to become unresponsive.

Resolution Steps

Zingle has monitors in place designed to monitor queue job backlogs to prevent exactly this type of incident from occurring, but a configuration error in the specific monitor that observes the webhooks queue prevented this alert from being sent.

The Zingle engineering team has taken the following corrective actions to prevent this scenario from occurring again in the future:

  1. Storage of information about the webhook queue jobs has been moved from Redis to a different queueing system that is both more scalable and that no other part of the Zingle system is dependent on. This will ensure that even if this system fails, the rest of the Zingle platform's functionality will continue normally.
  2. The misconfiguration of the queue backlog monitor has been corrected

We sincerely apologize for the inconvenience that this caused during that time, as we know Zingle is mission-critical to many teams and businesses. If you have any questions or concerns, please do not hesitate to reach out to your account manager or our Support team.

Posted Apr 29, 2022 - 08:20 PDT

Resolved
This incident has been resolved.
Posted Apr 26, 2022 - 06:14 PDT
Monitoring
We have identified the source of this issue and implemented the necessary steps to correct it.

This issue should now be resolved and the ZIngle app recovered. Users may need to refresh the tab they have the app loaded in to see the recovery.

We will continue to monitor this issue. Thanks for your patience.
Posted Apr 25, 2022 - 15:04 PDT
Identified
We are aware of some issues impacting the Zingle app and are looking into issues. At present, the app may fail to load or displays a 404 error. Users may also see loading with high latency. I expect to have an update shortly.

You may follow updates made here - https://status.zingle.me/
Posted Apr 25, 2022 - 14:54 PDT
This incident affected: [Oregon Datacenter] Web Application (Web Inbox).