Sporadic error starting host with RabbitMQ in cluster mode

Dunge1 · January 23, 2020, 7:24pm

A solution using NServiceBus7 and NServiceBus.RabbitMQ is working perfectly in a single host system.

We developed some logic to have redundancy servers to have high availability (all on the same local network). RabbitMQ is set up in cluster mode with queues replicated using HA policies. Every NServiceBus services connect to the localhost RabbitMQ node living on their own host.

Up to now tests are working fine from a logical and functional point of view, but there’s some strange sporadic behavior when starting multiple services at the same time, as if RabbitMQ is not always responding (even though its uptime remains and the management webpage say all nodes are green).

This happens on a await Endpoint.Start(configuration); command:

Error 1 (timeout which shouldn’t happen because the service is alive):

2020-01-23 13:36:41.8380 #5 Failed to start. System.TimeoutException: The operation has timed out.
at RabbitMQ.Util.BlockingCell.GetValue(TimeSpan timeout)
at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply(TimeSpan timeout)
at RabbitMQ.Client.Impl.ModelBase.QueueDeclare(String queue, Boolean passive, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary2 arguments) at RabbitMQ.Client.Impl.ModelBase.QueueDeclare(String queue, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary2 arguments)
at NServiceBus.Transport.RabbitMQ.ConventionalRoutingTopology.Initialize(IModel channel, IEnumerable1 receivingAddresses, IEnumerable1 sendingAddresses)
at NServiceBus.Transport.RabbitMQ.QueueCreator.CreateQueueIfNecessary(QueueBindings queueBindings, String identity)
at NServiceBus.InitializableEndpoint.d__1.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at NServiceBus.Endpoint.d__1.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Server.TaskManager.Host.d__8.MoveNext() in C:\TeamCity\buildAgent\work\5b1345fe6501ed74\Sources\JamLogic\Server\TaskManager\Host.cs:line 60

Error2:

2020-01-23 18:17:49.6237 Server.ProgramScheduler.Host Failed to start. System.Exception: Channel has been closed: AMQP close-reason, initiated by Application, code=200, text=“Connection close forced”, classId=0, methodId=0, cause=
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at NServiceBus.SerializeMessageConnector.d__1.MoveNext()

Both seems to be errors coming from RabbitMQ and not NServiceBus directly, but from my setup I would expect RabbitMQ to be available at this point so I wonder why it returns these errors. Those errors also happens on a single service out of ten randomly when they all start simultaneously. And of course, just retrying a start again and it works.

I’m just wondering if there’s a way to solidify the connection to make sure they always starts correctly, and if any of you have an idea why RabbitMQ throw these errors.

WilliamBZA · January 27, 2020, 8:11am

Did these errors happen with on a single Rabbit instance before moving to a cluster with HA? I wonder if there isn’t some occasional deadlocking going on with multiple starting at the same time and the HA having to keep everything in sync?

How did you configure your HA?

Dunge1 · January 27, 2020, 3:55pm

No, we ran for years with a single instance without seeing this kind of issues, but we did (and still) have “RequestedHeartbeat=60;” into the rabbitmq connectionstring because it had disconnections while backups were using the disk once per week (already discussed in another thread on this forum).

Yes, it definitively seems like like you said, occasional deadlocking going on with multiple starting at the same time.

HA is defined as such (pretty basic config):

Name: HA
Pattern: .*
Apply to: Exchanges and queues

ha-mode: exactly
ha-params: 2
ha-sync-mode: automatic
max-length-bytes: 50000000

We have 13 services per host, so multiplied by 3 hosts we are now at 39 services pretty much all trying to connect to the broker simultaneously. Still, I would expect RabbitMQ to be able to handle this kind of load.

Dunge1 · February 7, 2020, 10:10pm

I’ve encountered a RabbitMQ cluster error with autoheal failing: Redirecting to Google Groups

Based on the comment, I’ve updated to RabbitMQ 3.8.2 Erlang 22.2
Now all 3 nodes are active and running and green in the RabbitMQ management console.
All services on node1 works perfectly, but on node2 and node3 they refuse to start randomly with the errors:

020-02-07 22:06:44.3955 NServiceBus.ReceiveComponent Receiver Main failed to initialize. System.TimeoutException: The operation has timed out.
   at RabbitMQ.Util.BlockingCell.GetValue(TimeSpan timeout)
   at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply(TimeSpan timeout)
   at RabbitMQ.Client.Impl.ModelBase.ModelRpc(MethodBase method, ContentHeaderBase header, Byte[] body)
   at RabbitMQ.Client.Framing.Impl.Model._Private_QueuePurge(String queue, Boolean nowait)
   at NServiceBus.Transport.RabbitMQ.QueuePurger.Purge(String queue)
   at NServiceBus.Transport.RabbitMQ.MessagePump.Init(Func`2 onMessage, Func`2 onError, CriticalError criticalError, PushSettings settings)
   at NServiceBus.TransportReceiver.Init()
   at NServiceBus.ReceiveComponent.<Initialize>d__2.MoveNext()

or this:

2020-02-07 22:24:09.8529 #4 Failed to start. System.TimeoutException: The operation has timed out.
   at RabbitMQ.Util.BlockingCell.GetValue(TimeSpan timeout)
   at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply(TimeSpan timeout)
   at RabbitMQ.Client.Impl.ModelBase.QueueDeclare(String queue, Boolean passive, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary`2 arguments)
   at RabbitMQ.Client.Impl.ModelBase.QueueDeclare(String queue, Boolean durable, Boolean exclusive, Boolean autoDelete, IDictionary`2 arguments)
   at NServiceBus.Transport.RabbitMQ.DelayInfrastructure.Build(IModel channel)
   at NServiceBus.Transport.RabbitMQ.QueueCreator.CreateQueueIfNecessary(QueueBindings queueBindings, String identity)
   at NServiceBus.InitializableEndpoint.<Initialize>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at NServiceBus.Endpoint.<Start>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at VerMac.JamLogic.Server.TaskManager.Host.<Start>d__8.MoveNext() in C:\TeamCity\buildAgent\work\5b1345fe6501ed74\Sources\JamLogic\Server\TaskManager\Host.cs:line 41

I don’t understand, the service is running on the same computer, why can’t it connect?

I also got a lot of those error in my log files:

2020-02-07 21:59:26.2380 RingBufferExtensions Failed to buffer metrics data for $ProcessingTime after $10 attempts.

And I sometimes have this error in the rabbitmq server logs:
2020-02-07 22:36:37.249 [error] <0.13385.0> Supervisor {<0.13385.0>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.13380.0>, <0.13386.0>, <0.13380.0>, <<"[::1]:50453 -> [::1]:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"guest">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.13381.0>, <0.13387.0>) at <0.13388.0> exit with reason killed in context shutdown_error

I also have another service that started successfully with no errors, but does not receive any message and the queue keeps growing. I’m positive the names are set correctly.

It all seems so random, sometimes some start, sometimes they don’t. It seems that if I wait for around 1hour and then try again then sometimes it will start. I can’t understand anything of what’s going on or how to troubleshoot my issues. Please help!

SeymourRu · July 27, 2021, 10:38pm

@Dunge Did you manage to find out where problem came from?
We have same one here and it is a bit irritating

Dunge1 · July 27, 2021, 11:19pm

Not necessarily, but it does seems to be much stabler now. One thing for sure, RabbitMQ errors are so vague it’s a hell of a hard thing to diagnose.

A few tips:

Update both Erlang and RabbitMQ to latest, they fixed a lot of things since.
If your machines are in a VM environment, be sure to NEVER pause the OS. Do not put it on hold temporarily, do not use a backup solution that does a quick pause. I believe this was my main problem, as soon as the backup would start, the cluster would fail because the internal OS clock would stall for a few seconds. Realistically, it’s useless to backup a message queue anyway.
Do not set partitioning strategy to “autoheal”. Set it to “pause-minority” instead. Otherwise the disconnected node try to run on its own, and when the cluster recover it will often crash/restart the nodes that continued to work. You prefer the disconnected node to stop its work while disconnected and resume to the state of the other two when it comes back.
Apprently, “Quorum Queues” are much more stable than HA queue. Unfortunately, NServiceBus doesn’t support them.