Fault-tolerant patterns for Microservice

8 min readSep 17, 2020

The Risk of the Microservices Architecture :

The microservices architecture moves application logic to services and uses a network layer to communicate between them. Communicating over a network instead of in-memory calls brings extra latency and complexity to the system which requires cooperation between multiple physical and logical components. The increased complexity of the distributed system leads to a higher chance of particular network failures.

With a microservices architecture, we need to keep in mind that provider services can be temporarily unavailable by broken releases, configurations, and other changes as they are controlled by someone else and components move independently from each other.

If Service A calls Service B, which in turn calls Service C, what happens when Service B is down? What is your fallback plan in such a scenario?

Can you return a pre-decided error message to the user?
Can you call another service to fetch the information?
Can you return values from cache instead?
Can you return a default value?

There are a number of things you can do to make sure that the entire chain of microservices does not fail with the failure of a single component.

Fault tolerance : is the property that enables a system to continue operating properly in the event of the failure of some of its components.

For us, a component means anything: microservice, database(DB), load balancer(LB), you name it.

How to Make Your Services Resilient?

Identify Failure Scenarios:

One way to achieve this is by making your microservices to fail and then try to recover from the failure. This process is commonly termed as Chaos Testing. Think about scenarios like below and find out how the system behaves:

Service A is not able to communicate with Service B.
Database is not accessible.
Your application is not able to connect to the file system.
Server is down or not responding.
Inject faults/delays into the services.

Avoid Cascading Failures:

When you have service dependencies built inside your architecture, you need to ensure that one failed service does not cause ripple effect among the entire chain. By avoiding cascading failures, you will be able to save network resources, make optimal use of threads and also allow the failed service to recover!

Avoid Single Points of Failure:

While designing your microservice, it’s always a good thing to think about how the microservice will behave if a particular component is down. That will lead to a healthier discussion and help is building a fault tolerant service.

Make sure not to design in a manner that your services are hugely dependent on one single component though. And if that happens, make sure to have a strategy to recover fast from the failure.

Design you applications such that you do not have single points of failure. You should be able to handle client requests at all times. Hence ensuring availability across the microservice ecosystem is critical.

Handle Failures Gracefully and Allow for Fast Degradation:

You should design your microservices so that they are fault tolerant — if there are errors/exceptions, the service should handle it gracefully by providing an error message or a default value. Your microservice needs to be fault tolerant and handle failures gracefully.

Have Good Logs:

Once we accept that failures are bound to happen and that some of them may result, not just in a degraded end user experience, but in errors, we must make sure that we are able to understand what went wrong when an error occurs. That means that we need good logs that allow us to trace what happened in the system leading up to an error situation. “What happened” will often span several microservices, which is why you should consider introducing a central Log Microservice.

CORRELATION Tokens:

In order to be able to find all log messages related to a particular action in the system, we can use correlation tokens. A correlation token is an identifier attached to a request from an end user when it comes into the system. The correlation token is passed along from microservice to microservice in any communication that stems from that end-user request. Any time one of the microservices sends a log message to the Log Microservice, the message should include the correlation token. The Log Microservice should allow searching for log messages by correlation token. The API Gateway would create and assign a correlation token to each incoming request. The correlation is then passed along with every microservice-to-microservice communication.

Roll forward vs Roll back:

When errors happen in production, we are faced with the question of how to fix them. In many traditional systems, if errors start occurring shortly after a deployment, the default would be to roll back to the previous version of the system. In a microservice system, the default can be different. Microservices lend themselves to continuous delivery. With continuous delivery, microservices will be deployed very often and each deployment should be both fast and easy to perform. Furthermore, microservices are sufficiently small and simple so many bug fixes are also simple. This opens the possibility of rolling forward rather than rolling backward.

Why would we want to default to rolling forward instead of rolling backward? In some situations, rolling backward is complicated, particularly when database changes are involved. When a new version that changes the database is deployed, the microservice will start producing data that fits in the updated database. Once that data is in the database, it has to stay there, which may not be compatible with rolling back to an earlier version. In such a case, rolling forward might be easier.

Design Patterns to Ensure Service Resiliency:

Asynchronous communication
Timeouts
Retries
Circuit Breaker
Fallback
Rate limiters

Use Asynchronous communication (for example, message-based communication) across internal microservices : It’s highly advisable not to create long chains of synchronous HTTP calls across the internal microservices because that incorrect design will eventually become the main cause of bad outages. On the contrary, except for the front-end communications between the client applications and the first level of microservices or fine-grained API Gateways, it’s recommended to use only asynchronous (message-based) communication once past the initial request/response cycle, across the internal microservices. Eventual consistency and event-driven architectures will help to minimize ripple effects. These approaches enforce a higher level of microservice autonomy and therefore prevent against the problem noted here.

Work around network timeouts: In general, clients should be designed not to block indefinitely and to always use timeouts when waiting for a response. Using timeouts ensures that resources are never tied up indefinitely.

Use retries with exponential backoff. This technique helps to avoid short and intermittent failures by performing call retries a certain number of times, in case the service was not available only for a short time. This might occur due to intermittent network issues or when a microservice/container is moved to a different node in a cluster. However, if these retries are not designed properly with circuit breakers, it can aggravate the ripple effects, ultimately even causing a Denial of Service (DoS).

Use the Circuit Breaker pattern. In this approach, the client process tracks the number of failed requests. If the error rate exceeds a configured limit, a “circuit breaker” trips so that further attempts fail immediately. (If a large number of requests are failing, that suggests the service is unavailable and that sending requests is pointless.) After a timeout period, the client should try again and, if the new requests are successful, close the circuit breaker.

Provide Fallback : When the the service request fails after retry, we need an alternative response to the request. Fallback provides an alternative solution during a service request failure. When the circuit breaker trips and the circuit is open, a fallback logic can be started instead. The fallback logic typically does little or no processing, and return value. Fallback logic must have little chance of failing, because it is running as a result of a failure to begin with.

In this approach, the client process performs fallback logic when a request fails, such as returning cached data or a default value. This is an approach suitable for queries, and is more complex for updates or commands.

Limit the number of queued requests : Clients should also impose an upper bound on the number of outstanding requests that a client microservice can send to a particular service. If the limit has been reached, it’s probably pointless to make additional requests, and those attempts should fail immediately. In terms of implementation, the Polly Bulkhead Isolation policy can be used to fulfill this requirement.

Throttling or Rate limiting is configured as a threshold on the maximum number of requests that can be made during a specific time period. For example, a throttle can be defined as a maximum number of requests per second, a maximum number of requests per day, or both.

When a threshold is reached, more requests arrive than can be processed. The throttled messages must be queued or rejected. If a queuing solution is implemented, all messages are put into a first-in first-out (FIFO) queue. When the service has capacity, it retrieves messages from this queue and processes them. When the request rate is greater, the available capacity messages are processed in order and are not lost. If a rejection solution is implemented and no spare capacity is available, messages are discarded.