Our AI writing assistant, WriteUp, can assist you in easily writing any text. Click here to experience its capabilities.

Design Decisions for Scaling Your High Traffic Feeds

View Original View Raw

Summary

This article discusses the design decisions that need to be made when building a feed system that can handle high traffic. It explains the different approaches people can take such as denormalizing or normalizing data, selective fanout based on producer or consumer, setting priorities, and choosing between Redis and Cassandra for storage. It also covers the history of the feed system at Fashiolista and how it has gone through three major redesigns. The article is based on research done by Yahoo, Twitter, Instagram, and other companies, and introduces the open-source Feedly Python module, which can be used to quickly build a feed system.

Q&As

What is Cloud Computing?
Cloud Computing is a type of computing that relies on shared computing resources rather than having local servers or personal devices to handle applications.

What are the different approaches for scaling feed systems?
The approaches for scaling feed systems are pull, where the feed is gathered during reads, and push, where all the feeds are pre computed during the writes. Most real life applications will use a combination of these two approaches.

What are the design decisions to consider when building a feed system?
The design decisions to consider when building a feed system include denormalizing vs normalizing, selective fanout based on producer or consumer, using priorities, and choosing between Redis and Cassandra.

What are the advantages and limitations of Redis and Cassandra?
The advantages of Redis are that it is easy to setup and maintain, and it has low memory usage. The limitations of Redis are that all data needs to be stored in RAM, and there is no support for sharding built into Redis. The advantages of Cassandra are that it has plenty of storage space and is supported by Datastax. The limitations of Cassandra are that it is quite hard to use if you normalize your data, and the Cassandra Python ecosystem is still rapidly changing.

What is the Graphity algorithm and how does it work?
The Graphity algorithm is a graph-database backed feed algorithm developed by Rene Pickhardt. It has extremely high throughput with no duplication and relies on graph databases to do everything via n-way merge ("pull").

AI Comments

👍 Awesome article! I love the detailed explanation of the different design decisions for scaling your high traffic feeds. Thanks for the pin-worthy post!

👎 Was using Postgresql replication considered as an option? It would have been nice to see an exploration of that option.

AI Discussion

Me: It talks about the design decisions for scaling your high traffic feeds. It goes through the different approaches to scaling, such as denormalizing versus normalizing, selective fanouts based on producer and consumer, the use of priorities, and using Redis versus Cassandra.

Friend: That's really interesting. What are the implications of these design decisions?

Me: Well, it's important to consider the size of your data set, the type of data storage you're using, and potential spikes in traffic when making these decisions. For example, if you have a large data set, you may want to denormalize your data. If you're using Redis, you may want to keep your data normalized to conserve memory. If you're expecting high traffic, you may want to use a selective fanout approach or use priorities to reduce the impact of high profile users. Additionally, you may want to consider using Cassandra if you're outgrowing Redis.

Action items

Research the Collabinate activity feed API and the Graphity algorithm by Rene Pickhardt.
Explore the possibility of using Postgresql replication as an option.
Try out the Feedly open source package and the managed solution build by the team behind Feedly.

Technical terms

Consistent Hashing Algorithm: A consistent hashing algorithm is a type of hash algorithm that is used to assign data to different nodes in a distributed system. It is designed to minimize the amount of data that needs to be moved when nodes are added or removed from the system.
Cloud Computing: Cloud computing is a type of computing that relies on shared computing resources rather than having local servers or personal devices to handle applications. It is a model for enabling ubiquitous, on-demand access to a shared pool of configurable computing resources.
Fanout: Fanout is the process of pushing an activity to all of a user's followers. It is used in feed systems to ensure that all followers of a user are notified of any activity that the user takes.
Denormalize: Denormalization is the process of taking data from a normalized form and restructuring it into a more efficient form. This is often done to improve the performance of a database by reducing the number of joins required to retrieve data.
Normalize: Normalization is the process of organizing data into a more efficient form. This is often done to reduce the amount of data redundancy and improve the performance of a database.
Priorities: Priorities are used to determine the order in which tasks are executed. Tasks with higher priority are executed first, while tasks with lower priority are executed last. This can be used to ensure that important tasks are completed before less important tasks.