This is a tale about how I had to work through Christmas as a result of mostly bad luck.
I was working at a large tech company on a new product that had been available less than a year. Since this was the first Christmas for this product, the company wasn’t sure what to expect for demand or cost of failure. I was working on this product for almost 2 years. There were 3 other people on my team. The most senior member was out for parental leave, the next was out on vacation, and the last had been on the team for 2 months out of college. To be clear, I was on my own for the first Christmas period of a reportedly hot, new product. What makes matters even worse is how the product generated scale. People could buy the product any time but it only generated traffic when it was registered to a new user. We could not predict when this would happen. We suspected it would happen for several million customers between 6 am and 3 pm on Christmas day in my timezone. This was projected to bring in triple to dectuple our existing load.
Week Before Christmas
The above circumstances should be a sign that risk needs to be mitigated. Unfortunately for me, this didn’t go quite as well as it could have. Why:
- The range of projected traffic was several orders of magnitude wide. This means we could have gotten 10 requests per second or 1000. That was the best estimates.
- When running scaling tests, we could generate a specific number of requests but no way of understanding what data should be in the requests. Should it be 10 requests with 10 different customers or 10 with the same customer? It turns out that made a massive scaling difference to my services.
- The company created tier-1 SWAT teams to resolve issues across the product before paging the product specific engineers. These people had generic sets of instructions for common problems. What this didn’t include was how to manage scaling for the database we were using. As it turns out, the prescribed action was exactly the opposite of what should have been done.
- Finally, the company actually had no idea of the impact of failure. We just assumed all failure was bad.
On a personal note, I decided to schedule going to a friend’s house just for the Christmas dinner. I was away from family and thought I’d be able to spare 3 hours for dinner when I was assured nothing should go wrong. I even though it would be an uneventful Christmas where I could relax, eat, and be happy. I was wrong.
Day of Christmas
7 am: Paged awake and told to get on a conference call immediately.
NOTE: Once in a conference call or chat, you need to be constantly available and are asked for updates every 5 min regardless of how boring, tense, or upsetting it is.
7:10 am: I join a chat room with my manager, my skip manager, the SWAT team engineer, the SWAT team engineer’s manager, and my director. That’s right: 2 engineers and 4 managers.
7:30 am: Figured out the problem. For those who are technical, we had an SQL database with a limited number of connections and all connections were timing out due to long running queries. This means that all requests were taking really long and failing. This is one of the reasons why people don’t like using SQL in large scale systems. The only way to recover from this is to stop traffic to the SQL database until the connection pool is no longer thrashing and then gradually increase traffic until you start seeing connection timeouts again. Reduce traffic and keep it throttled to just before connection timeouts start. Normally, you should determine this rate of traffic before a massive event so you don’t have to do it on the fly.
7:35 am: I go to the chat window to explain my findings.
Chat with Manager, Skip Manager, Director, SWAT Engineer, SWAT Manager
Me: Hey, I found the issue.
SWAT Eng: Yes, me too. I’m scaling up the service.
Me: NO DONT
Me: DO NOT SCALE UP
Me: It will make it worse!
SWAT Eng: I can’t cancel the scale request. But what’s wrong? It’s running out of connections. That means we should scale up.
Me: No, the database is running out of connections, not the hosts. Scaling up the hosts is effectively DDOSing the database. You need to scale down.
Skip Manager: Can you explain why?
Me: I can but later. We need to scale down.
Director: Just scale down.
7:40 am: Finally take my face of my dining room table where it slammed after seeing them scale up. I next described what was happening to the service and how to recover. The SWAT engineer and manager left the room since I was now on point to handle this.
8 am – 1 pm: Repeatedly scaling down and up to find the safest traffic level. Each scaling action took 30 min and an additional 10 min of monitoring. All throughout I needed to give detailed reports to the managers in the room.
1 pm: We reached a stable state where services were running at maximum capacity but they still couldn’t keep up with load. Skip Manager correctly identified a possibility of improving database performance to increase throughput. He started taking a snapshot of the database to make a read replica. A week later, that snapshot still hadn’t completed.
3 pm: I hit the point of emotional breakdown after not eating, washing, sleeping, and realizing I couldn’t go to Christmas dinner. I may have curled up in a ball and cried for a few minutes.
3:15 pm: Me and the managers worked on identifying database optimization and attempting them. We finally gave up after realizing we couldn’t make any changes to a live database due to resource contention.
4 pm: Inhaled some instant ramen. This was the first food that day.
6 pm to 7 pm: Casual chatting about our favorite old video games and movies while waiting for SQL EXPLAIN and long running query analysis.
7 pm – 8 pm: We admit that we cannot scale further and no amount of SQL work can be done without a read replica. Additionally, I started making another snapshot in hopes that we could get two working at different times to minimize data loss.
8 pm – 8:30 pm: Excuse myself for 30 min to shower.
8:30 pm to 9 pm: We connect with other teams to determine impact to customer. Turns out, there was none. No one noticed or cared. Good to know I worked a 16 hour day under high stress for no reason.
9 pm to 11 pm: We created lists of items to monitor, action items for after the Christmas traffic, data we needed for recover, and how to communicate to partner teams what the outage was. We agreed to come back online tomorrow morning at 7 am to see how the snapshots were going and to check on the issue.
Day After Christmas (AKA Boxing Day)
7 am: Get up and run to my computer so I’m not paged for being late for check-in.
7:05 am – 8 am: Confirming that nothing has changed, no one cares that we have crashed and burned, and the database replicas are not done yet.
8 am – 9 am: Discussion with managers what to do now.
9 am: Repeating previous attempts to improve indexes on the database and search for slow queries now that the traffic has decreased. Also collect data for the reports to be filed during the next business day.
12 pm: We throw in the towel and decide to re-baseline the entire system when we get in to work after the holidays.
Much like Ebenezer, I also learned a few things through this Christmas trauma:
- Being on call sucks. Before this I was okay with being woken up at 2 am once in a while or working a few hours on weekends. No more.
- Being a manager sucks. My managers shielded me from the verbal impaling going on in the manager level meetings but I heard about it.
- I learned where my breaking point was in terms of working on something stressful for an extended period of time. Then I kept working past it. Don’t do that.
- After this, I told people to f*ck right off if they couldn’t prove their problem was causing financial loss. No way am I going to waste my personal time on something that doesn’t matter again.
- There is a large community of sleep-deprived, beer drinking engineers commiserating over this and similar experiences all the time. Be nice to them.
- When a big operational event is planned, have a backup so the primary can shower, eat, and cry in a corner if they need to.
- While it didn’t erase the trauma, trading my Christmas and half day of work for another week of paid time off was something I asked for and enjoyed taking.
Two years later…
Me: I quit.
Manager: Left the organization. Still at the company.
Skip Manager: Quit.
Director: Left the organization. Still at the company.
SWAT Engineer: Quit.
SWAT Manager: Left the organization. Still at the company.
The Service That Failed: Deleted. It was rewritten 1 year later.
No one involved in this incident stayed on the product longer than 2 years after the event. Not even the service lasted.