Slack outage post mortem

12/4/2023

And when we finish, we’re excited to start working on some awesome improvements for the product, which our users have been patiently waiting for. It’s a big project, but we’re pretty close to done. It’s a big deal, and a milestone for us that we’ve reached a scale where this is even needed. We are sharding our database, which, for those of you who aren’t programmers, means splitting the data in our one big database out across many little databases.

This outage was triggered by the system that manages our global. We added a few more servers too, just to make sure it was extra speedy.Īs to why we’re doing infrastructure upgrades and clearing out old data in the first place, we are preparing for a large migration we’re doing soon. Now that our platforms are up and running as usual after yesterday’s outage, I thought it would be worth sharing a little more detail on what happened and why and most importantly, how we’re learning from it. It turned out to be pretty simple to just avoid that slow query for people and still keep the site working as expected for them, so that’s what we did.Īnd now the site is up and running, humming along. At this point, we thought, “Well, that’s not gonna work” and started looking at a way to code around the problem. There was a bug in the web server we use (Apache) that we had never run into before, but which apparently has bitten a lot of people, so we had to resolve that as well in order to get things up and running again.Īfter fixing that bug, and attempting to let everyone back in, the site immediately went down again and we re-blocked people and the site took a while to come back up. A skeleton post-mortem doc is created automatically by the incident manager app when the incident is reported. It turned out, the issue was being compounded by an unrelated issue related to an infrastructure transition we recently made. This is helpful for filling out the post-mortem doc and also searching back over prior incidents when something similar happens later. Today, we tried to slowly let more and more of the blocked users back in, which seemed to be working, but then things went sideways in a weird way. The Slack channel not only allows the team to coordinate in real-time, but also serves as a long-lived record of what happened when. Last night we figured out what the issue was and blocked those types of queries, and got the site back up, but it wasn’t working for anyone who had to make one of those slow queries. A few common queries were running extremely slowly, building up, and blocking up data access for all users. This time, half way through the operation, the site stopped working. Yesterday (Sunday) we cleared out some old data from our database, which is something we’ve done a few times before, but not in the last year. This post is to explain what happened, hopefully it will put people’s fears to rest and give a sense of what’s happening behind the scenes. Last night and today we had a significant outage, with intermittent service for a number of hours.

0 Comments

Slack outage post mortem

Leave a Reply.

Author

Archives

Categories