Handling SQL Server AG failover with NServiceBus.Router

Thorarin · October 19, 2021, 10:02am

Since this thing is non-trivial for me to reproduce, I figured I’d try my luck here.

I’m running a simple .NET (Core) application that is routing messages from a SQL Server transport to Azure Service Bus. Recently there was a hiccup in our SQL Server availability group which isn’t handled very well by our router application. I think one of the servers was being restarted.

I see a couple of these messages logged by NServiceBus.Transport.SqlServer.QueuePeeker:

Unable to access availability database ‘MyDatabase’ because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role.

Then this by NServiceBus.Transport.SqlServer.DueDelayedMessageProcessor:

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)

Then NServiceBus.Raw.RunningRawEndpointInstance:

Receiver stopped.

Finally from NServiceBus.Raw.StoppableRawEndpoint:

Initiating shutdown.

Shutdown complete.

So the endpoint shut down. I would like to be able to handle this in my application somehow. The router endpoint is being started in a BackgroundService and I would like to detect that this happened and either shutdown my entire application or restart the router after some delay.

Does anybody have some pointers on how I could detect this so I can stop my router and start a new one?

   protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        NServiceBus.Logging.LogManager.UseFactory(new NServiceBus.Extensions.Logging.ExtensionsLoggerFactory(loggerFactory));

        var routerConfig = new RouterConfiguration(routingConfig.QueueName)
        {
            PoisonQueueName = routingConfig.PoisonQueueName,
            CircuitBreakerThreshold = routingConfig.CircuitBreakerThreshold,
            ImmediateRetries = routingConfig.ImmediateRetries,
            DelayedRetries = routingConfig.DelayedRetries
        };

        AddSqlInterface(routerConfig);
        AddAzureServiceBusInterface(routerConfig);

        var staticRouting = routerConfig.UseStaticRoutingProtocol();
        staticRouting.AddForwardRoute(InterfaceNameSql, InterfaceNameAzureServiceBus);
        staticRouting.AddForwardRoute(InterfaceNameAzureServiceBus, InterfaceNameSql);

        if (routingConfig.AutoCreateQueues)
        {
            routerConfig.AutoCreateQueues();
        }

        routerConfig.AddMessageLogging(loggerFactory);
        
        var router = Router.Create(routerConfig);
        await router.Start().ConfigureAwait(false);
        await Cancelled(stoppingToken);
        await router.Stop().ConfigureAwait(false);
    }

    private static Task Cancelled(CancellationToken cancellationToken)
    {
        var semaphore = new SemaphoreSlim(0);
        return semaphore.WaitAsync(cancellationToken);
    }

    private void AddSqlInterface(RouterConfiguration router)
    {
        router.AddInterface<SqlServerTransport>(InterfaceNameSql, t =>
        {
            t.DefaultSchema(routingConfig.SqlEndpoint.Schema);
            t.ConnectionString(configuration.GetConnectionString(routingConfig.SqlEndpoint.ConnectionStringName));
            t.WithPeekDelay(TimeSpan.FromSeconds(routingConfig.SqlEndpoint.PeekDelaySeconds));
            t.Transactions(TransportTransactionMode.SendsAtomicWithReceive);
        });
    }

ramonsmits · October 19, 2021, 11:44am

Critical errors usually are unrecoverable and rely on the host to restart the process and hope that now the issue has been resolved unless a critical error callback is registered. Maybe something similar is possible with the Router which is maintained by @SzymonPobiega

I would highly recommend to host the router in isolation in its own process so that is can be restarted in isolation too when there are any connectivity issues for example here due to a fail-over where existing connections are invalid.

SzymonPobiega · October 19, 2021, 1:28pm

Hi

Thanks for the detailed report. Indeed, the root cause is critical error handling, just as @ramonsmits pointed out. NServiceBus used to have a default critical error handling that stops the endpoint but does not kill the process.

That behavior has changed in NServiceBus long ago because the transports are now able to re-establish connections with the broker even after the critical error has been raised. Unfortunately that change has not been applied to the Router.

I have just released Router 3.9.2 which changes this behavior. The Router is going to log the fact that the critical error has been raised but will not stop the receiver and let it recover from the problem.

By the way, I noticed nuget.com is having issues so it might take a while until the package is indexed and available.

Thorarin · October 19, 2021, 2:31pm

@ramonsmits It is running in a dedicated process, so restarting would be fine. Since the application was not shutting down, that was being made difficult however. By the time we noticed, 120k messages were waiting

@SzymonPobiega Thanks for the incredibly quick fix. I’ll try to update the packages ASAP and run it in our various pre-production environments for a while.