Resiliency beyond the Classic Circuit Breaker

In 2011, Netflix was in trouble. They have dozens of internal services; if one fails, the error will break the UI.

Jul 10, 2024

They had a load of 20,000 requests per second, about 1,000,000+ requests from the API into upstream services in 10s.

What happens if a call starts failing? Traditional approaches like the Circuit Breaker are effective but may not be sufficient for all scenarios.

They had no choice but to increase resiliency, and they went beyond the classic Circuit Breaker.

The idea behind this is simple: adding a new step to the Circuit Breaker, The Fallback, to handle failures gracefully.

The Fallback provides an alternative response when the primary operation fails, ensuring the system remains responsive.

Now, the idea might sound simple, but the implementation can be tricky. That’s why you need to understand these three strategies:

1. Fail Fast

Instead of trying, the system halts the operation when as soon as an error happens.

It provides immediate feedback to other systems and conserves system resources.

Advantages:

Resource Conservation: Prevents wasting resources on operations destined to fail.

Quick Feedback: Other systems or components receive instant error feedback, allowing for faster issue resolution.

Disadvantages:

User Experience: Directly impacts users as operations fail instantly without a fallback.

Error Handling Complexity: Requires robust error handling to manage abrupt failures effectively.

You use this strategy when no reliable fallback mechanism is available.

Example

If you found missing data or a bad request.

2. Fail Silent

In case of a failure, the API would return a null value or a default value instead of propagating the failure.

This way, the failure does not propagate, and the user might only notice if the data is critical to the operation.

Advantages:

User Experience: Minimizes user impact by avoiding visible failures for non-critical operations.

System Stability: Reduces error propagation, maintaining overall system stability.

Disadvantages:

Data Integrity: Users might not notice the absence of non-critical data, leading to potential confusion.

Debugging Difficulty: Silent failures can make it harder to identify and debug issues.

This strategy comes into play for optional or non-essential data.

Example

When recommending similar products, if the Recommendation engine fails multiple times, it is okay not to show recommendations.

3. Custom Fallback

Instead of calling a failed service or DB, the system uses local data to provide a fallback response. The user still receives a response even if it's not the most up-to-date or complete.

Advantages:

Continuity: Maintains service continuity by providing users with fallback data.

Reduced Dependency: Decreases dependency on external services during failures.

Disadvantages:

Data Staleness: The fallback data might be outdated, impacting user experience or decision-making.

Implementation Complexity: Requires additional logic to manage and validate local fallback data.

Use this strategy when local data is available, for instance, in a cookie or local cache.

Example

A weather application shows the current temperature. If the system cannot fetch the latest data from the weather service, it uses the last known temperature stored in local storage.

These strategies are not one-size-fits-all; you must find the one that fits your system, and sometimes you will need a combination.

Final thoughts

1. Prioritize resilience measures where they are needed most.

Example: Payment processing in an e-commerce site is critical, while retrieving user preferences might be less so.

2. The better you understand the dependencies, the better resilience you can build.

Example: If Service A depends on Service B, and Service B fails, Service A might need a fallback mechanism.

3. Data Criticality helps decide which data can use fail-silent or custom fallback strategies.

Example: If the service fails, fallback to the last known recommendations stored locally.

The more you know your system, the better you can prepare it to bounce back from failure.

Keep building Resiliency!

System Design Classroom is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Articles I enjoyed this week

8 Strategies for Reducing Latency by

Saurabh Dashora

How Halo Scaled to 11.6 Million Users Using the Saga Design Pattern by

Neo Kim

Clean Code: 7 tips to write clean functions by

Daniel Moka

System Design: How to Scale a Database by

Ashish Pratap Singh

Graphs representation by

Franco Fernando

Thank you for reading System Design Classroom. If you like this post, Share it with your friends!

Somix

Jul 11

When you say 'Fail Silent,' do you mean a 'Fail-safe' system?

Because the documentation mentions fail-safe and fail-fast, but I haven't come across fail-silent anywhere yet.

Expand full comment

1 reply by Raul Junco

Thanks for mentioning my latest post, Raul. Your newsletter is wonderful; I'm recommending it to my subscribers. How could I have missed subscribing to your newsletter until now? I need to catch up with the previous articles :)

6 more comments...

System Design Classroom

Resiliency beyond the Classic Circuit Breaker

In 2011, Netflix was in trouble. They have dozens of internal services; if one fails, the error will break the UI.

1. Fail Fast

Advantages:

Disadvantages:

Example

2. Fail Silent

Advantages:

Disadvantages:

Example

3. Custom Fallback

Advantages:

Disadvantages:

Example

Final thoughts

1. Prioritize resilience measures where they are needed most.

2. The better you understand the dependencies, the better resilience you can build.

3. Data Criticality helps decide which data can use fail-silent or custom fallback strategies.

Articles I enjoyed this week

Discussion about this post