On April 7, 17 and 25, the Zingle platform experienced an outage where customers were unable to access the Zingle Inbox and conversations. Zingle engineers were immediately alerted to the outage and began working to resolve the issues.
Between April 7 and April 25, Zingle's AWS infrastructure experienced three separate outages caused by a Redis "master" service becoming unresponsive. The root cause of all three incidents was a chain of two failures: the queue workers assigned to process "webhook" queue jobs crashed, creating a large backlog of webhook queue jobs that needed to be processed (which are stored in Redis). When the queue workers were restarted, a bug in a 3rd-party library used as part of Zingle's queueing system caused a command to be sent to Redis that, when combined with the large backlog of unprocessed queue jobs, caused the Redis server to become unresponsive.
Zingle has monitors in place designed to monitor queue job backlogs to prevent exactly this type of incident from occurring, but a configuration error in the specific monitor that observes the webhooks queue prevented this alert from being sent.
The Zingle engineering team has taken the following corrective actions to prevent this scenario from occurring again in the future:
We sincerely apologize for the inconvenience that this caused during that time, as we know Zingle is mission-critical to many teams and businesses. If you have any questions or concerns, please do not hesitate to reach out to your account manager or our Support team.