RabbitMQ socket exception 0x80004005 after some period of working

Hello. I have windows server with ServiceControl and remote RabbitMQ as a transport. Can’t get ServiceControl to Replay error messages because after starting Replay process ServiceControl starts logging lots of errors:

2019-03-13 08:50:20.5195|23|Warn|ServiceControl.Recoverability.ReturnToSenderDequeuer+CaptureIfMessageSendingFails|Failed to send '9efd8c26-995e-8bce-53c7-46b67650b751' message to 'TelemetryEndpoint' for retry. Attempting to revert message status to unresolved so it can be tried again.
System.Exception: Channel has been closed: AMQP close-reason, initiated by Library, code=541, text="Unexpected Exception", classId=0, methodId=0, cause=System.Net.Sockets.SocketException (0x80004005): ```
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
   в RabbitMQ.Client.Impl.Frame.ReadFrom(NetworkBinaryReader reader)
   в RabbitMQ.Client.Impl.SocketFrameHandler.ReadFrame()
   в RabbitMQ.Client.Framing.Impl.Connection.MainLoopIteration()
   в RabbitMQ.Client.Framing.Impl.Connection.MainLoop()
   в System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   в System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   в ServiceControl.Recoverability.ReturnToSender.<HandleMessage>d__1.MoveNext()
--- Конец трассировка стека из предыдущего расположения, где возникло исключение ---
   в System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   в System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   в ServiceControl.Recoverability.ReturnToSenderDequeuer.<Handle>d__6.MoveNext()
--- Конец трассировка стека из предыдущего расположения, где возникло исключение ---
   в System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   в System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   в NServiceBus.Transport.RabbitMQ.MessagePump.<Process>d__28.MoveNext()

And some other similar, all leading to .Net Socket Exception.

But I see connections from ServicePulse on RabbitMQ admin panel, I see growth of message count in ServiceControl.staging queue, I can communicate with this RMQ broker from other software on this machine. Everything shows that Network layer works fine. But ServiceControll app can’t Replay messages due to SocketExceptions. Also after fresh reboot ServiceControll is working for some seconds. It starts filling the “staging” queue, but after 10-15 seconds logs again full of SocketExceptions and ServiceControl again can’t do anything.

ServiceControl version is 3.6.6

What’s happening and where else can I seek for problem?

Hi

Looking at the error code (541) it seems like there is some kind of a problem on the server/broker side. Can you try to find the corresponding entry in the RabbitMQ server logs?

Szymon

Here’s another message from ServicePulse logs

2019-03-13 11:05:04.0378|19|Warn|ServiceControl.Recoverability.RetryProcessor|Attempt 1 of 5 to stage a retry message ca6909a4-1aef-2ea9-2ce5-b8a2771331b1 failed
RabbitMQ.Client.Exceptions.AlreadyClosedException: Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Library, code=541, text="Unexpected Exception", classId=0, methodId=0, cause=System.Net.Sockets.SocketException (0x80004005): Попытка установить соединение была безуспешной, т.к. от другого компьютера за требуемое время не получен нужный отклик, или было разорвано уже установленное соединение из-за неверного отклика уже подключенного компьютера
   в RabbitMQ.Client.Impl.Frame.ReadFrom(NetworkBinaryReader reader)
   в RabbitMQ.Client.Impl.SocketFrameHandler.ReadFrame()
   в RabbitMQ.Client.Framing.Impl.Connection.MainLoopIteration()
   в RabbitMQ.Client.Framing.Impl.Connection.MainLoop()
   в RabbitMQ.Client.Framing.Impl.Connection.EnsureIsOpen()
   в RabbitMQ.Client.Framing.Impl.Connection.CreateModel()
   в NServiceBus.Transport.RabbitMQ.ConfirmsAwareChannel..ctor(IConnection connection, IRoutingTopology routingTopology, Boolean usePublisherConfirms)
   в NServiceBus.Transport.RabbitMQ.ChannelProvider.GetPublishChannel()
   в NServiceBus.Transport.RabbitMQ.MessageDispatcher.Dispatch(TransportOperations outgoingMessages, TransportTransaction transaction, ContextBag context)
   в ServiceControl.Recoverability.RetryProcessor.StageMessage(FailedMessage message, String stagingId)
   в ServiceControl.Recoverability.RetryProcessor.<TryStageMessage>d__8.MoveNext()

and found some warning messages in RMQ logs

rabbitmq_1       | 2019-03-13 08:09:23.005 [warning] <0.114.0> lager_error_logger_h dropped 9 messages in the last second that exceeded the limit of 1000 messages/sec
rabbitmq_1       | 2019-03-13 08:09:27.005 [warning]  lager_file_backend dropped 7 messages in the last second that exceeded the limit of 50 messages/sec
rabbitmq_1       | 2019-03-13 08:09:45.794 [warning] <0.13481.0> closing AMQP connection <0.13481.0> (172.26.1.26:50071 -> 172.20.0.2:5672 - Publish, vhost: 'nsbus', user: 'nsbus_user'):
rabbitmq_1       | 2019-03-13 08:09:48.001 [warning]  lager_file_backend dropped 5 messages in the last second that exceeded the limit of 50 messages/sec
rabbitmq_1       | 2019-03-13 08:09:59.003 [warning]  lager_file_backend dropped 8 messages in the last second that exceeded the limit of 50 messages/sec
rabbitmq_1       | 2019-03-13 08:10:10.003 [warning]  lager_file_backend dropped 5 messages in the last second that exceeded the limit of 50 messages/sec
rabbitmq_1       | 2019-03-13 08:10:14.934 [warning] <0.14097.0> closing AMQP connection <0.14097.0> (172.26.1.26:50078 -> 172.20.0.2:5672 - Publish, vhost: 'nsbus', user: 'nsbus_user')

hope this can help

Hi,

The error messages (lager_file_backend dropped) indicate that there are too many messages to be logged and are being dropped to not spam the logs. Could you please increase the limit of messages per seconds being logged and have a look into Rabbit logs to see if there is anything? Like suggested in here: RabbitMQ lager_error_logger_h dropped messages - Stack Overflow

At the moment the first exception simply states that operation cannot be completed as the connection to Rabbit is already closed. And the logs from rabbit have 2 entries about closing the connection and a whole set of messages saying that it si dropping log entries, so at the moment I can’t get much from those logs.

I suggest to increase the log limits and then start SC to get the 10-15 seconds of ‘working SC’ and when it starts spamming logs with exception entries stop it and take a look into log files. Both for NServiceBus and Rabbit. Please let us know of your findings

Regards
Michal