RabbitMQ TimeToWaitBeforeTriggering - What are the consequences and limits?

I’m completing the process of porting our solution from NServiceBus 5 to 7.

We have an acceptation test machine running on a shared vSphere server, and once a week at midnight there’s a backup plan running. While the backup is running, all our NServiceBus services shut down:

Message: Critical error, shutting down: ‘DataLogger MessagePump’ connection to the broker has failed.

Now I find it strange that a simple backup (more latency on the disk IO?) is enough to cut the network between our services and RabbitMQ broker (both on the same host) for over 2minutes, but it is possible.

With the previous versions, we would see lots of ‘AMQP shared queue closed’, broker disarmed and then armed log messages, but it would never shut down the services, it would just reconnect at some point.

The production environment is more stable, but I would love to make sure our services does not shut down by themselves in the middle of the night.

Based on the documentation, I can set the TimeToWaitBeforeTrigerring option to an higher value to prevent this.

My question is, are there any disadvantages or consequences to set this to an higher value? What would be the limit of acceptable? Would it be okay to set it to 15 or even 30 minutes?

I’m aware I can configure the services to automatically restart on shutdown, but we would really prefer them not restarting at all for a few different reasons.

You could set the value higher than the default 2 minutes. This would mean your endpoint would try to reconnect longer and eventually either reconnect or raise a critical error.

Perhaps, prior to increasing the TimeToWaitBeforeTriggering you could bump the value of the RequestedHeartbeat setting from the default 5 seconds to 60 to rule out an option that the value is too aggressive and the client thinks it gets disconnected.

Thanks for this reply. We had RequestedHeartbeat set to an higher value in the Debug build to help with with debugging and not having everything crash when using breakpoints, but left the default value in production.

Unfortunately, this doesn’t really answer my inquiring about what are the risks of raising those values. Is it only that it would take longer before reconnection, hence a few more seconds before messages start to process again, or could it lead to more serious problems? Like for example the service staying alive but in a frozen state with the broken disconnected not reconnecting anymore, or maybe messages getting dropped?

I’m also not perfectly clear the relation between the two settings. RequestedHeartbeat seems to be the time before it trigger the disconnected exception and start trying to reconnect, and TimeToWaitBeforeTriggering seems to be the time after it started trying to reconnect before it shut down the service process. But honestly, wouldn’t it be better to just retry connection as often and as fast as possible, and to retry indefinitely up until it reconnects, even if the RabbitMQ service was down for multiple hours?

I would say that there are no risks to increasing them.

There are several settings we have that are related to connection recovery.

The RequestedHeartbeat setting is directly passed into the RabbitMQ client. It controls both how often the broker and client send AMQP heartbeat messages to each other, and how long they will both go without receiving a message to consider the connection dead. Setting this value too low can lead to false positives, causing the connection to be considered dead when it’s actually fine. Our default value of 5 seconds is probably too low, and we should consider increasing it to 30-60 seconds instead.

When a connection is detected as being closed, that’s when RetryDelay comes into play. For the message pump connection, we rely on the RabbitMQ client’s built-in connection recovery mechanism to reconnect, so RetryDelay is passed directly into the RabbitMQ client. For our publishing connections, we have a manual retry mechanism, but RetryDelay is also used there.

The final setting is TimeToWaitBeforeTriggering. This setting controls the message pump’s circuit breaker. This is a setting that does not get passed into the RabbitMQ client because the circuit breaker is solely an NServiceBus concept. When a connection is detected as being closed, the RabbitMQ client starts attempting to recover the connection. We also arm the circuit breaker when this happens. If we haven’t reconnected in the elapsed time configured by TimeToWaitBeforeTriggering, we then trigger the circuit breaker, which causes the configured critical error action to execute. By default, nothing happens, so the connection will attempt to reconnect indefinitely.

That’s why we offer APIs to configure the behavior. You can adjust it as you desire to match your system’s requirements.

To get back to the scenario described in your original post, I agree that it’s strange that a backup would cause the problem you describe. I’d focus on learning more about why that is happening over tweaking connection recovery settings.

Thanks for this explanation, it makes things much clearer.

In fact, while upgrading to the new version, I used the sample project you provide as a baseline and it calls Environment.FailFast in the critical error function. This is something that we did not have previously in the NSB5 version using the Host method, so it also explains why services never shut down in the past.

In any case, I’ll leave this critical behavior and raise the heartbeat to 60 and circuit breaker trigger to 10minutes. It should be good enough to survive.

As for my backup situation, it’s not critical since the whole infrastructure is not particularly stable. On the other hand, we do see RabbitMQ disconnection happening in production too, usually when the CPU usage reach 100% for a few minutes by other processes.

Hello again Brandon,

I just encountered a problem related to this, and I would like to know if it’s a problem in my endpoint configuration or if it’s a bug in the NServiceBus.Transport.RabbitMQ library.

It seems like the library created a second message pump named “EndpointName-EndpointName” which I don’t know where it comes from. When I get a connection failure, it disconnect and arm both or them, but when the connection comes back it only reconnect the main one, and not the secondary one. So 10 minutes later, the critical event trigger.

2019-06-03 12:41:05.5062 NServiceBus.Transport.RabbitMQ.MessagePumpConnectionFailedCircuitBreaker The circuit breaker for 'EndpointName MessagePump' is now in the armed state 
2019-06-03 12:41:34.0023 NServiceBus.Transport.RabbitMQ.ChannelProvider Attempting to reconnect in 10 seconds. 
2019-06-03 12:41:35.0911 NServiceBus.Transport.RabbitMQ.MessagePumpConnectionFailedCircuitBreaker The circuit breaker for 'EndpointName-EndpointName MessagePump' is now in the armed state 
2019-06-03 12:41:33.8843 NServiceBus.Transport.RabbitMQ.MessagePumpConnectionFailedCircuitBreaker The circuit breaker for 'EndpointName MessagePump' is now disarmed 
2019-06-03 12:41:45.1572 NServiceBus.Transport.RabbitMQ.ChannelProvider Connection to the broker reestablished successfully. 
2019-06-03 12:51:35.0835 NServiceBus.Transport.RabbitMQ.MessagePumpConnectionFailedCircuitBreaker The circuit breaker for 'EndpointName-EndpointName MessagePump' will now be triggered 

Any idea why I have that “EndpointName-EndpointName” message pump? When I look at the exchange definition on the RabbitMQ management webpage, the EndpointName one contains binding for all my application messages, while EndpointName-EndpointName only contains a binding for “nsb.delay-delivery”. As of now, my application never send a message specifying the DelayDeliveryWith option, but I did call MakeInstanceUniquelyAddressable() and EnableCallbacks() in case that’s the cause.

Callbacks require an instance-specific queue, so having more than one queue and message pump is expected.

What value are you passing into your MakeInstanceUniquelyAddressable call? When you call that, a second queue and message pump are created, one that is instance-specific. If you are just passing your endpoint name into that call, then that is not correct. See Client-side callbacks • Callbacks • Particular Docs for more details.

I’m not sure what else you’ve got going on, but regardless of how many message pumps and connections the endpoint has, they should all be able to recover when disconnected.

So yes that’s where the queue comes from, and yes currently I used the endpoint name as the “uniqueId” because I only have one instance of each so I didn’t bother to come up with an unique name for the instance.

I’m not sure why it didn’t reconnect earlier today, but I forced the disconnection as a test 3 times and now every time both circuit breaker disarmed correctly upon reconnection. So everything looks good. It might happen only in a very specific scenario.

Sorry for wasting your time, if I manage to reproduce this behavior consistently I’ll write back.