Our AI writing assistant, WriteUp, can assist you in easily writing any text. Click here to experience its capabilities.

Why Twitter Didn’t Go Down: From a Real Twitter SRE

View Original View Raw

Summary

This article explains how Twitter's caching system works and why it has been able to stay up and running despite the mass exodus of engineers from the company. The author describes how the system is automated and how it is designed to handle failures.

Q&As

Why didn't Twitter go down when they supposedly lost around 80% of their workforce?
Twitter didn't go down when they supposedly lost around 80% of their workforce because they had a lot of automation and monitoring in place.

How did the remaining Twitter employees keep the website running?
The remaining Twitter employees kept the website running by designing and implementing tools that would keep the caches running.

What is a cache and what are its purposes?
A cache is a server that stores responses so that they can be served in milliseconds.

What is Mesos and how does it work with Aurora?
Mesos is a software that aggregates servers together so that Aurora can find them.

What are some of the issues that the Twitter team faced and how were they fixed?
Some of the issues that the Twitter team faced were bugs where new cache servers wouldn't be added back or where it took up to 10 minutes to add a server back. These were fixed by developing a culture where they could go and fix these while keeping projects on track.

AI Comments

👍 This is a great article that explains how Twitter has been able to stay up and running despite losing a large number of engineers.

👎 This article is a bunch of technical mumbo jumbo that doesn't explain anything.

AI Discussion

Me: It's about how Twitter didn't go down, even though they supposedly lost around 80% of their workforce.

Friend: Wow, that's really interesting! I would have thought that with such a high turnover rate, the site would have been in trouble.

Me: Yeah, I know. It's amazing that it's still running smoothly. I guess it just goes to show how well the team was prepared and how much they automated things.

Friend: Yeah, that makes sense. It would have been a lot harder to keep things running if they had to do everything manually.

Me: Exactly.

Action items

Learn more about Aurora and Mesos.
Implement automatic monitoring and repair tasks for servers.
Plan for capacity needs and extra headroom.

Technical terms

Site Reliability Engineer (SRE): responsible for automation, reliability and operations for a website or application
Cache: a storage location for frequently accessed or expensive to compute data that can be retrieved quickly
Aurora: a tool that finds servers for applications to run on
Mesos: a tool that aggregates servers together
Rack: a structure that holds servers in a data center
Switch: a device that connects servers on racks
Data center: a facility used to house computer systems and associated components, such as telecommunications and storage systems