Sporadic error starting host with RabbitMQ in cluster mode

A solution using NServiceBus7 and NServiceBus.RabbitMQ is working perfectly in a single host system.

We developed some logic to have redundancy servers to have high availability (all on the same local network). RabbitMQ is set up in cluster mode with queues replicated using HA policies. Every NServiceBus services connect to the localhost RabbitMQ node living on their own host.

Up to now tests are working fine from a logical and functional point of view, but there’s some strange sporadic behavior when starting multiple services at the same time, as if RabbitMQ is not always responding (even though its uptime remains and the management webpage say all nodes are green).

This happens on a await Endpoint.Start(configuration); command:

Error 1 (timeout which shouldn’t happen because the service is alive):

2020-01-23 13:36:41.8380 #5 Failed to start. System.TimeoutException: The operation has timed out.
at RabbitMQ.Util.BlockingCell.GetValue(TimeSpan timeout)
at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply(TimeSpan timeout)
at RabbitMQ.Client.Impl.ModelBase.QueueDeclare(String queue, Boolean passive, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary2 arguments) at RabbitMQ.Client.Impl.ModelBase.QueueDeclare(String queue, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary2 arguments)
at NServiceBus.Transport.RabbitMQ.ConventionalRoutingTopology.Initialize(IModel channel, IEnumerable1 receivingAddresses, IEnumerable1 sendingAddresses)
at NServiceBus.Transport.RabbitMQ.QueueCreator.CreateQueueIfNecessary(QueueBindings queueBindings, String identity)
at NServiceBus.InitializableEndpoint.d__1.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at NServiceBus.Endpoint.d__1.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Server.TaskManager.Host.d__8.MoveNext() in C:\TeamCity\buildAgent\work\5b1345fe6501ed74\Sources\JamLogic\Server\TaskManager\Host.cs:line 60

Error2:

2020-01-23 18:17:49.6237 Server.ProgramScheduler.Host Failed to start. System.Exception: Channel has been closed: AMQP close-reason, initiated by Application, code=200, text=“Connection close forced”, classId=0, methodId=0, cause=
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at NServiceBus.SerializeMessageConnector.d__1.MoveNext()

Both seems to be errors coming from RabbitMQ and not NServiceBus directly, but from my setup I would expect RabbitMQ to be available at this point so I wonder why it returns these errors. Those errors also happens on a single service out of ten randomly when they all start simultaneously. And of course, just retrying a start again and it works.

I’m just wondering if there’s a way to solidify the connection to make sure they always starts correctly, and if any of you have an idea why RabbitMQ throw these errors.

Did these errors happen with on a single Rabbit instance before moving to a cluster with HA? I wonder if there isn’t some occasional deadlocking going on with multiple starting at the same time and the HA having to keep everything in sync?

How did you configure your HA?

No, we ran for years with a single instance without seeing this kind of issues, but we did (and still) have “RequestedHeartbeat=60;” into the rabbitmq connectionstring because it had disconnections while backups were using the disk once per week (already discussed in another thread on this forum).

Yes, it definitively seems like like you said, occasional deadlocking going on with multiple starting at the same time.

HA is defined as such (pretty basic config):

Name: HA
Pattern: .*
Apply to: Exchanges and queues

ha-mode: exactly
ha-params: 2
ha-sync-mode: automatic
max-length-bytes: 50000000

We have 13 services per host, so multiplied by 3 hosts we are now at 39 services pretty much all trying to connect to the broker simultaneously. Still, I would expect RabbitMQ to be able to handle this kind of load.

I’ve encountered a RabbitMQ cluster error with autoheal failing: https://groups.google.com/forum/#!topic/rabbitmq-users/_T2LQyrOxD4

Based on the comment, I’ve updated to RabbitMQ 3.8.2 Erlang 22.2
Now all 3 nodes are active and running and green in the RabbitMQ management console.
All services on node1 works perfectly, but on node2 and node3 they refuse to start randomly with the errors:

020-02-07 22:06:44.3955 NServiceBus.ReceiveComponent Receiver Main failed to initialize. System.TimeoutException: The operation has timed out.
   at RabbitMQ.Util.BlockingCell.GetValue(TimeSpan timeout)
   at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply(TimeSpan timeout)
   at RabbitMQ.Client.Impl.ModelBase.ModelRpc(MethodBase method, ContentHeaderBase header, Byte[] body)
   at RabbitMQ.Client.Framing.Impl.Model._Private_QueuePurge(String queue, Boolean nowait)
   at NServiceBus.Transport.RabbitMQ.QueuePurger.Purge(String queue)
   at NServiceBus.Transport.RabbitMQ.MessagePump.Init(Func`2 onMessage, Func`2 onError, CriticalError criticalError, PushSettings settings)
   at NServiceBus.TransportReceiver.Init()
   at NServiceBus.ReceiveComponent.<Initialize>d__2.MoveNext()

or this:

2020-02-07 22:24:09.8529 #4 Failed to start. System.TimeoutException: The operation has timed out.
   at RabbitMQ.Util.BlockingCell.GetValue(TimeSpan timeout)
   at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply(TimeSpan timeout)
   at RabbitMQ.Client.Impl.ModelBase.QueueDeclare(String queue, Boolean passive, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary`2 arguments)
   at RabbitMQ.Client.Impl.ModelBase.QueueDeclare(String queue, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary`2 arguments)
   at NServiceBus.Transport.RabbitMQ.DelayInfrastructure.Build(IModel channel)
   at NServiceBus.Transport.RabbitMQ.QueueCreator.CreateQueueIfNecessary(QueueBindings queueBindings, String identity)
   at NServiceBus.InitializableEndpoint.<Initialize>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at NServiceBus.Endpoint.<Start>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at VerMac.JamLogic.Server.TaskManager.Host.<Start>d__8.MoveNext() in C:\TeamCity\buildAgent\work\5b1345fe6501ed74\Sources\JamLogic\Server\TaskManager\Host.cs:line 41

I don’t understand, the service is running on the same computer, why can’t it connect?

I also got a lot of those error in my log files:

2020-02-07 21:59:26.2380 RingBufferExtensions Failed to buffer metrics data for $ProcessingTime after $10 attempts.

And I sometimes have this error in the rabbitmq server logs:
2020-02-07 22:36:37.249 [error] <0.13385.0> Supervisor {<0.13385.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.13380.0>, <0.13386.0>, <0.13380.0>, <<"[::1]:50453 -> [::1]:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"guest">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.13381.0>, <0.13387.0>) at <0.13388.0> exit with reason killed in context shutdown_error

I also have another service that started successfully with no errors, but does not receive any message and the queue keeps growing. I’m positive the names are set correctly.

It all seems so random, sometimes some start, sometimes they don’t. It seems that if I wait for around 1hour and then try again then sometimes it will start. I can’t understand anything of what’s going on or how to troubleshoot my issues. Please help!