DueDelayedMessageProcessor prevents/delays shutdown

When the Transport database cannot be accessed continuously we keep getting the following log messages

FATAL NServiceBus.Transport.SqlServer.DueDelayedMessageProcessor [(null)] Exception thrown while moving matured delayed messages

for several minutes. This keeps the process alive but non-functioning. I would expect that the CriticalErrorAction should be triggered after a shorter amount of time. Our CriticalErrorAction would kill the process, so that our process monitoring picks up the dead process and alerts IT staff. And it appears that sometimes it stops processing anything and stays alive as a zombie, and in ServicePulse it’s shown as missing heartbeats (as expected).

Which configuration options influence this behavior?

We’re on NSB 7.4.3, ServiceBus.SqlServer 6.1.2.

Let me know, what additional information I should provide.

Thanks, Philipp

Do the log files show any other type of error/warning during this log entry?

The endpoint will only trigger a critical error is the transport is not available for several minutes.

Could it be that your handlers get “stuck” due to this?

Consider using https://github.com/ramonsmits/NServiceBus.SimpleMonitoring to see if your pipeline has stuck handlers.

For further reference, this is scenario 1

I get the following behavior when the transport database server is not available when starting the process. So, a stuck handler should not be the issue.

I remember that it was a lot quicker in this scenario to give up and kill the process in the past (happened all the time after restarting the developer machine and forgetting to start transport database). We migrated from NSB v5 to v6 recently. Did some of that behavior change from v5 to v6?

Is there a way to control how long it takes for it to give up?

We get some more messages. NServiceBus is configured to only emit log messages of WARN or higher (using log4net).
After those messages: the last one, as mentioned above, is repeating in an increasing pattern every second.

5x followed by a break of 20 seconds
10x followed by the break
20x, 40x …

This goes on for more than 20 minutes now, with no sign of the process to give up.
After starting the transport database these log messages stop.

BUT: the endpoint does not process any of the messages waiting in its queue table
So, this leads to:

Process is never stopping if the database connection comes eventually, but it will not process messages which renders it a kind of zombie that we cannot detect. except by missing heartbeats, but this information cannot be used (at least currently) to send alarm messages (SMS, phone calls) to the IT staff.

2020-12-17 12:41:33,409 WARN NServiceBus.Transport.SqlServer.ExpiredMessagesPurger [(null)] Purging expired messages from table [nsfusion_transport].[transport].[xbrl_Tagging] failed after purging 0 messages.

2020-12-17 12:41:33,409 FATAL NServiceBus.Transport.SqlServer.DueDelayedMessageProcessor [(null)] Exception thrown while moving matured delayed messages

System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)

 ---> System.ComponentModel.Win32Exception (53): The network path was not found.

   at System.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, Boolean applyTransientFaultHandling, String accessToken)

   at System.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)

   at System.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnectionPool pool, DbConnection owningObject, DbConnectionOptions options, DbConnectionPoolKey poolKey, DbConnectionOptions userOptions)

   at System.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)

   at System.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)

   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)

   at System.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()

--- End of stack trace from previous location where exception was thrown ---

   at NServiceBus.Transport.SqlServer.SqlConnectionFactory.<>c__DisplayClass2_0.<<Default>b__0>d.MoveNext() in /_/src/NServiceBus.Transport.SqlServer/Configuration/SqlConnectionFactory.cs:line 37

--- End of stack trace from previous location where exception was thrown ---

   at NServiceBus.Transport.SqlServer.SqlConnectionFactory.OpenNewConnection() in /_/src/NServiceBus.Transport.SqlServer/Configuration/SqlConnectionFactory.cs:line 21

   at NServiceBus.Transport.SqlServer.DueDelayedMessageProcessor.MoveMaturedDelayedMessages() in /_/src/NServiceBus.Transport.SqlServer/DelayedDelivery/DueDelayedMessageProcessor.cs:line 47

ClientConnectionId:00000000-0000-0000-0000-000000000000

Error Number:53,State:0,Class:20

2020-12-17 12:41:33,416 FATAL NServiceBus.ReceiveComponent [(null)] Receiver Main failed to initialize.

System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)

 ---> System.ComponentModel.Win32Exception (53): The network path was not found.

   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)

   at System.Data.ProviderBase.DbConnectionPool.WaitForPendingOpen()

--- End of stack trace from previous location where exception was thrown ---

   at NServiceBus.Transport.SqlServer.SqlConnectionFactory.<>c__DisplayClass2_0.<<Default>b__0>d.MoveNext() in /_/src/NServiceBus.Transport.SqlServer/Configuration/SqlConnectionFactory.cs:line 37

--- End of stack trace from previous location where exception was thrown ---

   at NServiceBus.Transport.SqlServer.SqlConnectionFactory.OpenNewConnection() in /_/src/NServiceBus.Transport.SqlServer/Configuration/SqlConnectionFactory.cs:line 21

   at NServiceBus.Transport.SqlServer.ExpiredMessagesPurger.Purge(TableBasedQueue queue, CancellationToken cancellationToken) in /_/src/NServiceBus.Transport.SqlServer/Receiving/ExpiredMessagesPurger.cs:line 28

   at NServiceBus.Transport.SqlServer.MessagePump.PurgeExpiredMessages() in /_/src/NServiceBus.Transport.SqlServer/Receiving/MessagePump.cs:line 189

   at NServiceBus.Transport.SqlServer.MessagePump.Init(Func`2 onMessage, Func`2 onError, CriticalError criticalError, PushSettings settings) in /_/src/NServiceBus.Transport.SqlServer/Receiving/MessagePump.cs:line 54

   at NServiceBus.ReceiveComponent.PrepareToStart(IBuilder builder, RecoverabilityComponent recoverabilityComponent, MessageOperations messageOperations, PipelineComponent pipelineComponent, IPipelineCache pipelineCache) in /_/src/NServiceBus.Core/Receiving/ReceiveComponent.cs:line 166

ClientConnectionId:00000000-0000-0000-0000-000000000000

Error Number:53,State:0,Class:20

2020-12-17 12:41:34,447 FATAL NServiceBus.Transport.SqlServer.DueDelayedMessageProcessor [(null)] Exception thrown while moving matured delayed messages

System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)

 ---> System.ComponentModel.Win32Exception (53): The network path was not found.

   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)

   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)

   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)

   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)

   at System.Data.ProviderBase.DbConnectionClosed.TryOpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)

   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)

   at System.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken)

--- End of stack trace from previous location where exception was thrown ---

   at NServiceBus.Transport.SqlServer.SqlConnectionFactory.<>c__DisplayClass2_0.<<Default>b__0>d.MoveNext() in /_/src/NServiceBus.Transport.SqlServer/Configuration/SqlConnectionFactory.cs:line 37

--- End of stack trace from previous location where exception was thrown ---

   at NServiceBus.Transport.SqlServer.SqlConnectionFactory.OpenNewConnection() in /_/src/NServiceBus.Transport.SqlServer/Configuration/SqlConnectionFactory.cs:line 21

   at NServiceBus.Transport.SqlServer.DueDelayedMessageProcessor.MoveMaturedDelayedMessages() in /_/src/NServiceBus.Transport.SqlServer/DelayedDelivery/DueDelayedMessageProcessor.cs:line 47

ClientConnectionId:00000000-0000-0000-0000-000000000000

Error Number:53,State:0,Class:20

I believe the FATAL NServiceBus.ReceiveComponent [(null)] Receiver Main failed to initialize. is the critical part in the log.

For further reference this is scenario 2.

Now I tested the same but with the scenario where the transport database is available, the endpoint process is started, processes messages, and then I’m killing the transport database server.
The endpoint process managed to empty the queue before the transport database got unavailable.

We get the following log messages, where most of them repeat, until the process shuts down after around 1 minute and triggers the CriticalErrorAction, as expected

2020-12-17 13:12:26,798 WARN NServiceBus.Heartbeat.HeartbeatSender [(null)] Unable to send heartbeat to ServiceControl.
System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)
 ---> System.ComponentModel.Win32Exception (64): The specified network name is no longer available.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at System.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at NServiceBus.Transport.SqlServer.MessageDispatcher.DispatchUsingNewConnectionAndTransaction(IEnumerable`1 operations) in /_/src/NServiceBus.Transport.SqlServer/Sending/MessageDispatcher.cs:line 118
   at NServiceBus.Transport.SqlServer.MessageDispatcher.DispatchDefault(SortingResult sortedOperations, TransportTransaction transportTransaction) in /_/src/NServiceBus.Transport.SqlServer/Sending/MessageDispatcher.cs:line 105
   at NServiceBus.Transport.SqlServer.MessageDispatcher.Dispatch(TransportOperations operations, TransportTransaction transportTransaction, ContextBag context) in /_/src/NServiceBus.Transport.SqlServer/Sending/MessageDispatcher.cs:line 39
   at NServiceBus.Heartbeat.HeartbeatSender.SendHeartbeatMessage()
ClientConnectionId:599fd8ee-6a96-4ff1-a650-bab1af10159d
Error Number:64,State:0,Class:20
2020-12-17 13:12:26,798 FATAL NServiceBus.Transport.SqlServer.DueDelayedMessageProcessor [(null)] Exception thrown while moving matured delayed messages
System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)
 ---> System.ComponentModel.Win32Exception (64): The specified network name is no longer available.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at System.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at NServiceBus.Transport.SqlServer.DueDelayedMessageProcessor.MoveMaturedDelayedMessages() in /_/src/NServiceBus.Transport.SqlServer/DelayedDelivery/DueDelayedMessageProcessor.cs:line 49
ClientConnectionId:86c605e9-b9b6-42dd-9903-7b72feb8318f
Error Number:64,State:0,Class:20
2020-12-17 13:12:26,799 ERROR NServiceBus.Metrics.ServiceControl.ServiceControlReporting.NServiceBusMetadataReport [(null)] Error while sending metric data to NServiceBus.Routing.UnicastAddressTag.
System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)
 ---> System.ComponentModel.Win32Exception (64): The specified network name is no longer available.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at System.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at NServiceBus.Transport.SqlServer.MessageDispatcher.DispatchUsingNewConnectionAndTransaction(IEnumerable`1 operations) in /_/src/NServiceBus.Transport.SqlServer/Sending/MessageDispatcher.cs:line 118
   at NServiceBus.Transport.SqlServer.MessageDispatcher.DispatchDefault(SortingResult sortedOperations, TransportTransaction transportTransaction) in /_/src/NServiceBus.Transport.SqlServer/Sending/MessageDispatcher.cs:line 105
   at NServiceBus.Transport.SqlServer.MessageDispatcher.Dispatch(TransportOperations operations, TransportTransaction transportTransaction, ContextBag context) in /_/src/NServiceBus.Transport.SqlServer/Sending/MessageDispatcher.cs:line 39
   at NServiceBus.Metrics.ServiceControl.ServiceControlReporting.NServiceBusMetadataReport.RunReportAsync()
ClientConnectionId:2e8a9c1b-c99e-47e1-a2ff-114dd18178b5
Error Number:64,State:0,Class:20
2020-12-17 13:12:26,799 WARN NServiceBus.Transport.SqlServer.QueuePeeker [(null)] Sql peek operation failed
System.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The specified network name is no longer available.)
 ---> System.ComponentModel.Win32Exception (64): The specified network name is no longer available.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at System.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at NServiceBus.Transport.SqlServer.QueuePeeker.Peek(TableBasedQueue inputQueue, RepeatedFailuresOverTimeCircuitBreaker circuitBreaker, CancellationToken cancellationToken) in /_/src/NServiceBus.Transport.SqlServer/Receiving/QueuePeeker.cs:line 77
ClientConnectionId:a0dc8ab1-3496-49a5-a682-c2cf6c157dbc
Error Number:64,State:0,Class:20
2020-12-17 13:12:26,881 WARN NServiceBus.Transport.SqlServer.RepeatedFailuresOverTimeCircuitBreaker [(null)] The circuit breaker for SqlPeek is now in the armed state
2020-12-17 13:12:43,910 WARN NServiceBus.Transport.SqlServer.QueuePeeker [(null)] Sql peek operation failed
System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at System.Data.Common.ADP.ExceptionWithStackTrace(Exception e)
--- End of stack trace from previous location where exception was thrown ---
   at NServiceBus.Transport.SqlServer.SqlConnectionFactory.<>c__DisplayClass2_0.<<Default>b__0>d.MoveNext() in /_/src/NServiceBus.Transport.SqlServer/Configuration/SqlConnectionFactory.cs:line 46
--- End of stack trace from previous location where exception was thrown ---
   at NServiceBus.Transport.SqlServer.SqlConnectionFactory.OpenNewConnection() in /_/src/NServiceBus.Transport.SqlServer/Configuration/SqlConnectionFactory.cs:line 26
   at NServiceBus.Transport.SqlServer.QueuePeeker.Peek(TableBasedQueue inputQueue, RepeatedFailuresOverTimeCircuitBreaker circuitBreaker, CancellationToken cancellationToken) in /_/src/NServiceBus.Transport.SqlServer/Receiving/QueuePeeker.cs:line 77

so, if we combine these 2 scenarios from above in the real world, this is happening:

  1. Endpoint is running smoothly
  2. Transport DB gets down for more than 1 minute
  3. Endpoint crashes (Scenario 2)
  4. Service restart mechanism will restart the service
  5. DB is still down and the endpoint keeps trying to connect (Scenario 1)
  6. DB gets back up
  7. We have a living endpoint process that is never processing any messages

And this is a huge problem for us. I believe we’re not the only ones who might be affected. We didn’t change any configuration between migrating from NSB v5 to v6, only adapted according to the API changes.

@philippdolder Thanks for this detailed description. Based on your responses I should the following steps to reproduce:

  1. Create a backlog of messages
  2. Stop transport DB
  3. Wait until endpoint instance is restarted
  4. Start transport DB
  5. Endpoint not processing any messages from the backlog

I’ll try these steps and report my results.

@philippdolder I’ve tried to reproduce this.

At step 4. Start throws an exception and stops my process. This is as expected as transport recovery only works when the endpoint was Started.

Based on your response you seem to be suggesting that the endpoint got started. Could it be that this is an issue in your hosting code?

@ramonsmits

let me extract the core parts of our transport configuration. Could be that it’s our hosting code, but we didn’t change it apart from the necessary migration steps as outlined in the guides when migrating from SqlServer transport 4.3.1 to 6.1.2, and Sql persistence from 4.6.1 to 6.0.4.

And I’m pretty sure this behavior started to pop up just recently.

I’ll try to reproduce the scenario with the old version of our system before the upgrades and then get all the exact changes in the configuration/hosting code.
Otherwise I’ll try to make an isolated reproduction as a github repo.

At step 4. Start throws an exception and stops my process. This is as expected as transport recovery only works when the endpoint was Started.

so, when I understand correctly. In your case the endpoint shuts down almost immediately when starting up while the transport database is not available?

This seems to be the difference between your observation and mine

Hi @ramonsmits

I created a github repo for a reproduction of the issue I’m encountering with a minimalistic configuration that shows the issue we’re having.

The main branch contains the equivalent configuration before upgrading sql transport. Sql transport version is 5.0.3 and behavior is as expected and I understand that’s what you also saw in your repro. The process will die after trying to connect to the transport database for a couple of times.

The update-transport-to6.1.2 contains the version after updating transport and removing the now obsolete configuration for native delayed delivery. That’s the behavior we’re experiencing (the NSB versions used here are exactly what we use).

In the update-transport-to6.2 branch I updated to the latest available package version 6.2 to make sure it’s not because we use an outdated version. BUT: with this version the behavior gets even worse. Now there’s not even any exceptions thrown when the transport database cannot be reached.

Hope this helps to understand the issue I’m experiencing.

From my perspective the transport behavior for this scenario got worse with any updates I did.

Thank you for your feedback
Philipp

I tested this with the latest minor/patch releases:

  • NServiceBus 7.4.4
  • NServiceBus.Transport.SqlServer 6.2.0

This is the solution I’ve used:

I’ll look at your repo.

It seems that NServiceBus.SqlServer v5.0.3 does not fail at start. On your main branch what I see happening is that the endpoint starts and eventually a critical error gets raised.

Upgrading the transport to 6.2.0 results in fail on “Start”. However, your repro is using the BackgroundService. An exception gets thrown but the this hosting model is hiding the following exception:

System.Exception: Pre start-up check failed: Could not open connection to the SQL instance. Check the original error message for details. Original error message: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not
accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Provider, error: 40 - Could not open a connection to SQL Server)
   at NServiceBus.ReceiveComponent.ReceivePreStartupChecks() in /_/src/NServiceBus.Core/Receiving/ReceiveComponent.cs:line 184
   at NServiceBus.StartableEndpoint.Start() in /_/src/NServiceBus.Core/StartableEndpoint.cs:line 40
   at NServiceBus.HostingComponent.Start(IStartableEndpoint startableEndpoint) in /_/src/NServiceBus.Core/Hosting/HostingComponent.cs:line 76
   at NServiceBus.Endpoint.Start(EndpointConfiguration configuration) in /_/src/NServiceBus.Core/Endpoint.cs:line 31
   at Service.MyService.ExecuteAsync(CancellationToken token) in C:\temp\nsb-sql-transport-repro\Service\MyService.cs:line 80

However, for some reason the host isn’t exiting. If I replace the endpoint initialization with a simple throw new Exception("Boom!"); the host does exit.

I’m not sure why that is but I’m not familiar enough with BackgroundService.

I did test with https://docs.particular.net/nservicebus/hosting/extensions-hosting and that does fail at startup.

that’s the (only) way NServiceBus fails on startup that I knew of until now. At least this explains why after upgrading the transport it doesn’t fail the same way anymore.

I’ll look into the way we’re hosting the process and come back once I figured out what to change.
Thanks for your effort, this helped pinpoint the issue to the hosting model

The exception gets throws but the process then hangs. Why does is hang and not exit with a failure? It could be that we have something to fix here.

I did some more investigation and found a couple of sources related to the issue.

It appears that the BackgroundService actually goes down, but the application hosting it stays alive. Therefore it appears to live on as a zombie process.

I added 2 different solutions that both terminate the process if Endpoint.Start throws an exception:
Branch solution-failfast uses the same FailFast mechanism as we already use for the other failure scenarios. This exits the process with a non-zero exit code.

Branch solution-application-lifetime uses the IHostApplicationLifetime reference to stop the application in case of startup failure, but this exits the process with 0 which I don’t like at all because this suggests it’s a graceful shutdown. This solution is following the stack overflow link below.

I believe with this knowledge there’s nothing to be fixed in NServiceBus. BUT: it should be pointed out in the docs and migration guides, that the behavior has changed between the different transport versions.

I’ll also have closer look into the latest .NET Core hosting model along with the NServiceBus integration to see if this suits our needs to run background services.

Sources for reference:




EDIT: It seems our responses crossed each other and yours showed up after me submitting mine.

After experimenting a bit it seems the exception behavior is related to the hosting model:

With the following ExecuteAsync implementation the exception will not result in the host to shutdown and write an exception. Commenting the Task.Yield will result in the host to exit.

protected override async Task ExecuteAsync(CancellationToken token)
{
    await Task.Yield();  // or await Task.Delay(1);
    throw new Exception("Boom");
}

This surprised me a bit and did some googling finding https://stackoverflow.com/questions/55483080/net-core-ihostedservice-background-task-exception-will-not-terminate-applicatio which gave a solution. After taking a dependency on IHostApplicationLifetime and adding a try..catch where the catch will invoke IHostApplicationLifetime.StopApplication(); host now stops. However, the exception is NOT logged! It needs to be manually logged it seems

So, this seems to be the complete minimal implementation to log critical startup errors and stop the host.

readonly IHostApplicationLifetime applicationLifetime;
readonly ILogger logger;

public MyService(ILogger<MyService> logger, IHostApplicationLifetime applicationLifetime)
{
    this.applicationLifetime = applicationLifetime;
    this.logger = logger;
}

protected override async Task ExecuteAsync(CancellationToken ct)
{
    try
    {
        await Task.Delay(1, ct);
        throw new Exception("Boom");
    }
    catch (OperationCanceledException)
    {
    }
    catch (Exception e)
    {
        logger.LogCritical(e, "ExecuteAsync");
        applicationLifetime.StopApplication();
    }
}

I’ve applied these changes to your repro and created a PR and that now works:

I do recommend you to use NServiceBus.Extensions.Hosting when using the hosting model.