MagazineCoverage

Slack outage caused by AWS gateway failure

With the overloaded cloud provider AWS' gateway, setting off a series of events that downed the messaging service for a long five hours began about 9 a.m. EST with customers experiencing occasional errors immediately. By 10 a.m., the service was unusable for all subscribers.

As per techtarget, Slack released a root cause analysis report to the media this week, detailing how AWS problems set off a domino effect that left the service inaccessible. Slack relies entirely on AWS for its cloud hosting.

Slack declined to discuss the problems related to the AWS Transit Gateway. However, a source familiar with the matter confirmed that the gateway failed to scale up fast enough to handle the incoming traffic.

At the same time, Slack experienced network problems between its back-end servers, other service hosts and its database servers. The troubles resulted in the back-end servers handling too many high-latency requests.

Slack had a backup reserve of servers ready to go, but began to discover problems with the provisioning service it used to spin up and verify those backup servers, which was not designed to handle the task of trying to get Slack up and running on more than 1,000 servers in a short period of time.

The result was not enough servers to meet Slack's capacity needs, which led to customers receiving error messages or not loading Slack.