Why retries fail and backoff saves you
Retrying a failed step sounds obviously good. Done naively, it can turn a small hiccup into a full-blown outage.
When a step fails, retrying is the natural instinct. But a system that hammers a struggling service with immediate retries often makes things worse — it piles load onto exactly the thing that was already failing. Retries need manners.
The retry storm
Imagine a service slows down briefly. Every client retries instantly, all at once, multiplying the load. The service that might have recovered in seconds now collapses under the retry storm. Naive retries can cause the outage they were meant to survive.
Back off and jitter
The fix is exponential backoff — wait a little, then longer, then longer still — plus a bit of randomness (jitter) so everyone isn't retrying in lockstep. Give the struggling service room to recover instead of crowding it. And cap retries, so a permanent failure doesn't loop forever.
A retry without backoff isn't resilience. It's a denial-of-service attack you wrote against yourself.