Examples of circuit breaker pattern with retry mechanism

Hello,

Are there any “Formal examples” on how to implement the circuit breaker pattern along with the NServiceBus retry mechanisms? The problem i’m trying to solve is that when we receive a message at our nservicebus endpoint to call a web api if the web api is down or slow the configured timeout for the web api call + immediate and delayed retries mechanism in nservicebus essentially cause the nservicebus endpoint to have all threads stuck trying to process messages that call the API and cause the queue to backup or get “clogged”. Based on my knowledge of distributed computing what we need to do is fail fast so the nservicebus host doesn’t waste all threads ( which pull from it’s queue ) on a web api call that clearly isn’t resolving anytime soon. This means implementing the circuit breaker pattern!

Unfortunately i haven’t found any useful examples or advice on how to do this with nservicebus. Are there any examples available that show how to do this?

Hey @lochness

What would you expect to happen with an incoming message once your circuit breaker is triggered?

NServiceBus has the concept of “unrecoverable exceptions” as a fail fast mechanism. If you define an unrecoverable exception, the related message will skip recoverability. However, this means that it immediately ends up in the error queue and you have to retry all those failed messages manually. See the documentation for unrecoverable exceptions here: https://docs.particular.net/nservicebus/recoverability/?version=core_7.2#unrecoverable-exceptions

It sounds like it would make sense to create a dedicated endpoint which is responsible for the HTTP requests to the unreliable API. You can then configure recoverability to fit better with this specific scenario’s needs, e.g. disable immediate retries and configure delayed retries to wait at least one minute between the attempts. This way, other messages aren’t affected by the unreliable third party while you can still let recoverability handle the situation. See the documentation about configuring delayed retries: https://docs.particular.net/nservicebus/recoverability/configure-delayed-retries

A more flexible but complex approach is a custom behavior in the pipeline which can keep an eye out for specific exceptions from the Web API. It can then act as a circuit breaker by stopping the pipeline invocation early without invoking the actual handler and therefore avoiding subsequent requests to the API. This approach requires careful thinking to ensure it only affects the relevant messages and a good plan on what to do with the affected message. Just failing the message will let it go through the recoverability steps very quickly and potentially end up in the error queue as well. By combining it with a custom recoverability policy, you can delay those messages rejected by the circuit breaker using a longer delay by default.
Here’s our documentation about custom behaviors: https://docs.particular.net/nservicebus/pipeline/manipulate-with-behaviors
and custom recoverability policies: https://docs.particular.net/nservicebus/recoverability/configure-delayed-retries#custom-retry-policy

Before diving deeper into custom behaviors and recoverability I’d recommend to see whether a dedicated endpoint with some adjusted recoverability settings can solve the problem already.

First of all, such integrations should run in isolation on a separate endpoint. This way if this endpoint is stopped it will only affect this integration channel.

There is no easy way to currently dynamically limit concurrency or throughput.

I can recommend to create a monitor that checks the state of this integration API and will stop the endpoint when it is in a faulty state. This way the queue will just buffer all messages. This will make sure your endpoint isn’t wasting valuable resources on retrying messages of which you already know upfront they will fail. No need to schedule retries and other stuff. If the monitor check returns succes you can start the endpoint again and continue message processing.

Also, if your integration channel allows for it you could limit the maximum concurrency. By default this is

The default concurrency limit is max(Number of logical processors, 2) .

An additional approach it to apply rate limiting to prevent flooding of your integration during peaks so achieve a stable integration channel.

Hey Tim,

I don’t think creating an endpoint that only makes an api call is a reasonable solution for any customer. Nowadays most companies are in the cloud and each endpoint creation incurs overhead

  1. Cloud resources (memory and cpu)
  2. Building a CI/CD pipeline
  3. Storing deployable artifacts
  4. Monitoring

I feel like the using something like “Polly” along with NServiceBus’s unrecoverable exceptions concept on “BrokenCircuitException” exception type should handle my needs.

I wish NServiceBus had some examples/documentation on this or provided a recommended solution that doesn’t incur the overhead of “moving the problem elsewhere”

Thanks!

It is the essence of the microservice concept. Having a small unit of deployment that does one job that it absolutely owns and does it really well. Its API is its queue and the message contracts that it accepts.

How you deploy this is up to you. Multi process, single endpoint per process or single process with multiple endpoints.

The thing here is that you have fine-grained control over each individual endpoint in regards to:

  • Logging
  • Security
  • Throughput
  • Error handling
  • Scaling
  • Monitoring
  • High Availability
  • Deployment

Moving the problem elsewhere is not what we are doing.

In general, it is advisable to not host unrelated handlers in a single endpoint. Similar in that you would not have one database containing all data in your enterprise even though for most data there is absolutely no reason to store it centrally.

In my opinion, having different retry mechanisms per handler isn’t an advisable approach. It’s hard for operations to monitor and control and adds too much complexity for one unit of deployment.

By the way, if you want to have a sort of HOLD THE LINE approach that is possible by raising a critical error:

If you want to have your own circuit breaker logic then you can use Polly or clone our circuit breaker code:

Then you can add this in a custom pipeline extension as documented here:

or you could subscribe to error events:

If you are not into a HOLD THE LINE type of behavior but you want to postpone processing for a specific message type then you could just based on some statemachine decide to immediately defer messages. Our rate control sample shows how this could be achieved: