Endpoint reconnect to RabbitMQ when connection is lost

I’m seeing that my Endpoint isn’t reconnecting to RabbitMQ if the connection is lost. Is there a way to configure NServiceBus to retry connectivity to RabbitMQ?

The RabbitMQ transport is designed to do exactly what you’re suggesting. It does attempt to reconnect to the broker when the connection is lost.

What behavior are you seeing that makes you think that it isn’t? Please provide specific details!

Thanks for the response, Brandon!

On my local machine, I start RabbitMQ in a container and start my Endpoint in a container, then the Endpoint connects just fine to RabbitMQ (I can verify that in RabbitMQ’s queue list) and the Endpoint can receive messages as expected. Then, I kill the RabbitMQ container and bring it back up. At this time, there are no queues listed in RabbitMQ, even though the Endpoint wasn’t modified at all. If I restart the Endpoint (using Start without Debugging in Visual Studio), I can then see my Endpoint queue in RabbitMQ.

Please let me know if I can provide any more details.

Are you saying that after you stop and start broker container that all of the exchanges, queues, and bindings are gone? If so, then that is your problem. Wiping out all of the established broker topology is not a valid scenario for testing connection recovery.

You are required to have all of the topology in place in the broker for the endpoint to function. You can do it manually, or if you can enable installers, and the endpoint will do it automatically. However, that only happens once when the endpoint starts.

If you want to test connection recovery by stopping the broker container, then you need to ensure that the container is maintaining state between runs, keeping the broker topology intact.

Hey Brandon, sorry to the delay, I wanted to test a couple more scenarios before responding.

Is there a way for endpoints to recreate the queue after start up, if it doesn’t exist? Or, as you said, how do you make the container maintain state between runs (tried adding a volume, but the queues didn’t come back up upon restart; googling didn’r reveal any answers either)?

No, endpoints run installers when they start up, and there is no way to make installers run any other time.

The details of how to configure a docker container are kind of out of scope here, but there should definitely be a way to set up an external volume and then configure the RabbitMQ broker to store its data on that volume instead of inside the container.

Giving some more background on Dan’s question - the prod use case is as follows:

We’ve got three instances of Rabbit running with K8. If one of them dies and restarts, or we cycle in a fourth instance, that new instance that does not connect to any of the running endpoints. Not sure if that restore behavior exists within the bounds of NSB. Is this an issue Particular has seen elsewhere? The use case Dan was describing was in an effort to replicate locally.

This statement confuses me a bit. Are you talking about RabbitMQ broker instances in a cluster dying and restarting? If so, then I don’t know what you mean by “that new instance that does not connect to any of the running endpoints”. The broker doesn’t connect to NServiceBus endpoints. The endpoints are the clients and they are the ones that initiate the connection.

It’s still not really clear what scenario you’re trying to describe or problematic behavior you’re seeing.

Yes, you are correct. I’m discussing the RabbitMQ instances in a cluster. The exact scenario we’ve seen is this…

Our memory was set too low on our instances of RabbitMQ in prod, so our message queues weren’t accepting any more messages. To solve, our DevOps team increased the memory, then replaced the old Rabbit instances one at a time with new ones. So remove Instance #1, replace it with Instance #4, remove Instance #2, replace it with Instance #5, remove Instance #3, replace it with #6. So we started with Instances 1-3 (low memory), and ended with Instances 4-6 (higher memory).

After doing that, the handler applications were unable to reconnect to Instances 4-6, thus messages were not being processed. Once we redeployed all of the handlers, the connections were re-established and normal functionality resumed. However, we shouldn’t have to restart all of the handlers simply because of Rabbit.

How is access to the broker cluster configured on the NSB endpoints? I assume you’ve got some sort load balanced virtual ip situation going on? If you’re swapping out actual broker instances at runtime, then it sounds like you’ve got some sort of ip affinity in the load balancer that’s preventing the NSB endpoints from connecting to the new broker instances.

Another example of when the Endpoints lose connection to RabbitMQ is when the node (or container) for RabbitMQ is modified. We can see the queues in RabbitMQ, but in order for the queues to process (reestablish connection to RabbitMQ), we need to restart each Endpoint.

I’ll have to speak to DevOps to figure out how that’s configured. Will get back to you.

To my knowledge, there is no session affinity.

From DevOps:

We have 3 rabbitmq nodes in a 1 master, 2 mirror cluster behind a public IP. If the master goes down, then a mirror takes over as master with the same public ip.

At this point, I think it’s pretty safe to say that that the problems you’re having are related to your RabbitMQ cluster and your networking setup.The RabbitMQ transport is relying on the actual RabbitMQ C# client to handle connection recovery.

There’s nothing specific to NServiceBus about the problem you’re describing. I think you’ll likely need to talk to Pivotal to gain more insight.