The time I brought down production

Background

This website is built with NextJS, hosted on Vercel and uses Cloudflare as its DNS (Domain Name Server). So the abbreviated fetch trace looks like this:

Initial website fetch request
Cloudflare understands and forwards the request to Vercel
Vercel understands and forwards the request to the NextJS server

Preface

SEO (Search Engine Optimization) is crucial for a website as it boosts findability. SEO can make or break a website; non-existent SEO will make the website difficult to find - more so than finding a needle in a haystack, and poor SEO will reach the wrong audience and render the site useless.

In short, SEO is essential for this website, and thus, I embarked on improving the website's SEO.

The Incident

11:46 pm:

While I was improving this website's SEO, I came across some SEO-checking websites online. From the generated reports, I discovered that some internal links were not redirecting correctly. Primarily, internal link redirects https://ngjx.org are rewritten to https://www.ngjx.org, causing a mismatch in host path.

Okay, no big deal, I just have to update the Cloudflare redirect rules to redirect all www.ngjx.org traffic to ngjx.org.

11:55 pm: The site goes down.

The investigation

00:14 am: I realize the website is down and start investigating.

The site is unreachable, all requests are timing out. What is going on? The vercel deployment is still online and the preview builds of production are still accessible. However, something is amiss. The logs do not show any request timing out. Perhaps it is an issue with the transport portion of the OSI model?

Checking the network logs in my browser reveal the issue - an infinite loop! When visiting the website, users are redirected to ngjx.org then www.ngjx.org, over and over before timing out shortly after. Why is this happening?

In an epic blunder, I still had Vercel redirecting all ngjx.org traffic to www.ngjx.org, thus causing an infinite loop of redirection mayhem.

What a predicament!

00:24 am: Production is rolled back.

00:25 am: The site is back up.

Takeaways

This 30-minute outage was avoidable. I should have reconfirmed routing rules before pushing to production.

Background

This website is built with NextJS, hosted on Vercel and uses Cloudflare as its DNS (Domain Name Server). So the abbreviated fetch trace looks like this:

Initial website fetch request
Cloudflare understands and forwards the request to Vercel
Vercel understands and forwards the request to the NextJS server

Preface

In short, SEO is essential for this website, and thus, I embarked on improving the website's SEO.

The Incident

11:46 pm:

Okay, no big deal, I just have to update the Cloudflare redirect rules to redirect all www.ngjx.org traffic to ngjx.org.

11:55 pm: The site goes down.

The investigation

00:14 am: I realize the website is down and start investigating.

In an epic blunder, I still had Vercel redirecting all ngjx.org traffic to www.ngjx.org, thus causing an infinite loop of redirection mayhem.

What a predicament!

00:24 am: Production is rolled back.

00:25 am: The site is back up.

Takeaways

This 30-minute outage was avoidable. I should have reconfirmed routing rules before pushing to production.

Alex Ng - BlogThe time I brought down production

Background

Preface

The Incident

The investigation

Takeaways

Background

Preface

The Incident

The investigation

Takeaways