NSB stops consuming

ekjuanrejon · September 29, 2020, 1:17pm

Hello

I notice that NSB stops consuming messages under a certain case… Is it possible for NSB to throw an exception… that in turn would restart the pod

My docker container has the following error: This error is from EventStore. However the container is up and is not consuming any at all. I have to manually restart the container
[22,12:06:01.344,ERROR] Error processing EventStore.ClientAPI.Internal.EstablishTcpConnectionMessage

EXCEPTION(S) OCCURRED:

System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known

at System.Net.Dns.InternalGetHostByName(String hostName)

at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)

at EventStore.ClientAPI.Internal.EndPointExtensions.ResolveDnsToIPAddress(EndPoint endpoint)

Kyle_Baley · October 2, 2020, 3:49pm

Hi @ekjuanrejon,

It will be tough to diagnose this without looking at log files and I imagine it’s not a good idea to post those on a public forum. Could you open a non-critical support case at Support options • Particular Software and include the endpoint’s log files in it? Also include the version of NServiceBus and the transport (and version of it) you are using. When we’re done there, we can report back here with the result to help others.

– Kyle

danielmarbach · October 5, 2020, 8:57am

Hi

The exception seems to come from EvenStore client API. Where do you use the client API and where is the exception raised?

NServiceBus itself does manage circuit breakers for the infrastructure it manages. For example, the persistence, as well as the transport, can be considered a crucial infrastructure that NServiceBus manages. If it looses connection to the persistence or transport over a period of time it triggers the critical error action.

The default critical action in v7 has the following behavior

The default behavior is to log the exception and keep retrying indefinitely.

It is important to configure the critical action according to your needs.

It sounds though the exception is raised by the event store client that you use somewhere in your business code (we are not supporting event store officially as persistence or transport). So if the event store client access happens as part of handling messages it is up to you to decide whether a specific exception should be treated special. If nothing configured the connection loss above would just be treated as a regular non-permanent error that might eventually resolve itself with retries over time or if not the messages would be moved into the error queue.

If you don’t want this behavior it might make sense to configure a custom recoverability policy that does trigger critical errors after having seen this error over a certain period of time.

And as @Kyle_Baley mentioned if you can give us more context we might be able to help you resolve this issue quicker.

Regards
Daniel