Retries

The conventional wisdom for web clients is that they should respond to failures with retries, using exponential backoff and jitter. While this advice is generally helpful, it can be improved upon.

Exponential Backoff and Jitter

This AWS architecture piece captures the conventional wisdom.

Will Circuit Breakers Solve My Problems?

Shows the limitations of circuit breakers.

The bottom line is that retries are often triggered by overload conditions, permanent or transient, and tend to make those conditions worse by increasing traffic. Many people replied saying that I’m ignoring the obvious effective solution to this problem: circuit breakers.

But this only works if the circuit breaker is accurate about whether the service is down. In a distributed system, with services that shard their data or where some requests but not all depend on another service, you can have a situation where some requests will reliably fail, while others will succeed.

Modern distributed systems are designed to partially fail, continuing to provide service to some clients even if they can’t please everybody. Circuit breakers are designed to turn partial failures into complete failures. One mechanism will likely defeat the other. Make sure you think that through before deploying circuit breakers.

Fixing Retries with Token Buckets

Starts with a suggested tweak to circuit breakers: always let first time requests through, but use the circuit breaker to determine retries. This approach will reduce the problems with circuit breakers, but Marc argues token buckets are superior. The article doesn't say this, but I think that with this solution, circuit breakers will never work to actively shed load.

What is Backoff For

At the risk of oversimplifying, there is a 2x2 matrix: do you have short or long-term overload? Do you have a small number of sources of work (colloquially, "clients"), or many independent clients?

Backoff helps with a short-term overload, as it postpones the work introduced by retries, letting you handle it in the future. Backoff helps with long-term overload for a small number of clients, as it postpones future requests from those same clients.

But if you have a long-term overload, and many independent clients, backoff is useless. Each client has to independently hit the service to determine that it's down, and then backoff just takes current work (retries) and defers it into the future.

If you have too many first tries, you need to have fewer. If you’ve got a bounded number of clients, getting each of them to back off is an effective strategy. With a bounded number of clients, backoff is an effective way to do that. If you have an unbounded number of clients, backoff is not an effective way to do that. They only hear the bad news after their first try, so no amount of backoff will reduce their first try rate.

If you’ve got an OK number of first tries, but some error rate is driving up second try (retry traffic), then you need to reduce the number of second tries. Backoff is an effective way to reduce the number of second tries now, by deferring them into the future. If you think you’ll be able to handle them better in the future, that’s a win. But backoff is not an effective way to reduce the overall number of second tries in total for long-running overload. For that, you need something like the adaptive retry approach.

Backoff is only a good retry policy in systems with small numbers of sequential clients, where the introduced delay between retries delays future first tries. If this property is not true, and the next first try is going to come along at a time independent of retry backoff, then backing off retries does nothing to help long-term overload. It just defers work to a future time.

Good Retry Bad Retry: An Incident Story

Works through the failure modes of different retry policies in detail. An appendix contains the results of simulations using many different policies, with a GitHub repository containing the code.

This piece explicitly connects the many clients/few clients from Brooker's post to the concept of open-loop and closed-loop benchmarks, and ends up recommending the the token bucket solution shared by Brooker.

A final topic mentioned is deadline propagation, in which each request sends in a timeout, after which time the request should be cancelled. The article concludes that deadline propagation is a complement to retry limits, but not a substitute (tbh, I follow the logic that it's not a substitute, I don't see how the simulations show it's a complement, though it does seem like a good idea).