How come sometimes my messages on the NServicebus queue get an infinite DeliveryCount?

I had posted this question already on StackOverflow, but I had been advised to post it here too. StackOverflow url: c# - How come sometimes my messages on the NServicebus queue get an infinite DeliveryCount? - Stack Overflow

For some of my NServiceBus integrations, I see that my messages “get stuck”. My queue receives messages from one of my saga’s, and normally everything works fine. The message comes in, its being processed, and then removed from the queue. However, sometimes it just stops removing these messages from the queue. It processes the message (in this case they are being send to a SQL-DB), but then keeps it on the queue, upping the delivery-count by 1. After I notice this, I disable & enable the WebJob that handle my Saga’s (sometimes the DeliveryCounts for individual messages reaches a value of over 3000, even though the default MaxDeliveryCount should be set to 6 according to the documentation). Doing that fixes the problem for awhile, until it show up again at some point.

Some of the WebJobs run on .NetFrameWork 461, these all run fine. The ones that sometimes “stop working” are the ones build on .NetCore 2.1. I’m not implying that this is a problem with the framework, but my guess is that the error might have to with setting up the endpoint (because the EndPoint Configuration is a bit different between these version).

I have already tried to replicate the error by sending 15.000+ messages to the queue, or by disabling the WebJobs and only activating them when the queue is already full. Nothing works, the problem just shows up at pure random. This means that most of the time everything goes fine, up until that one point it decides it wont anymore.

indent preformatted text by 4 spaces
private async Task<EndpointConfiguration> BuildDefaultConfiguration()
{
    var environment = this.configuration["Environment"];
    var endpointConfiguration = new EndpointConfiguration(this.endpointName);

    endpointConfiguration.SendHeartbeatTo($"particular.servicecontrol.{environment}");
    endpointConfiguration.SendFailedMessagesTo("error");
    endpointConfiguration.AuditProcessedMessagesTo("audit");

    var host = Environment.GetEnvironmentVariable("WEBSITE_INSTANCE_ID") ?? Environment.MachineName;
    endpointConfiguration
        .UniquelyIdentifyRunningInstance()
        .UsingNames(environment, host)
        .UsingCustomDisplayName(environment);

    var metrics = endpointConfiguration.EnableMetrics();
    metrics.SendMetricDataToServiceControl($"particular.monitoring.{environment}", TimeSpan.FromSeconds(2));

    endpointConfiguration.UseContainer<NinjectBuilder>(customizations =>
    {
        customizations.ExistingKernel(this.kernel);
    });

    endpointConfiguration.ApplyCustomConventions();
    endpointConfiguration.EnableInstallers();
    endpointConfiguration.UseSerialization<NewtonsoftSerializer>();

    var connectionString = this.configuration["ConnectionStrings:ServiceBus"];
    var transportExtensions = endpointConfiguration.UseTransport<AzureServiceBusTransport>();
    transportExtensions.ConnectionString(connectionString);
    transportExtensions.UseWebSockets();
    transportExtensions.PrefetchCount(1);

    // license
    var cloudStorageAccount = CloudStorageAccount.Parse(this.configuration["ConnectionStrings:Storage"]);
    var cloudBlobClient = cloudStorageAccount.CreateCloudBlobClient();
    var cloudBlobContainer = cloudBlobClient.GetContainerReference("configurations");
    await cloudBlobContainer.CreateIfNotExistsAsync().ConfigureAwait(false);
    var blockBlobReference = cloudBlobContainer.GetBlockBlobReference("license.xml");
    endpointConfiguration.License(await blockBlobReference.DownloadTextAsync().ConfigureAwait(false));

    endpointConfiguration.DefineCriticalErrorAction(async context =>
    {
        try
        {
            await context.Stop().ConfigureAwait(false);
        }
        finally
        {
            Environment.FailFast($"Critical error shutting down:'{context.Error}'.", context.Exception);
        }
    });

    return endpointConfiguration;
}

I have included the function that sets up the EndPointConfiguration. I expect I’m missing something here that is causing the error, but I have no idea what it is. To clarivy: I do not actually receive an error. I just notice that messages are being processed, but not being removed from the queue.

Edit:

I have added a screenshot from the queue through QueueExplorer, to visualize the problem. There where around 300 messages on the queue before I took the screenshot. Those where all being blocked by the messages you can see in this picture. A simple restart of the webjob that contains the handlers for these messages fixes the problem, until it goes wrong again.

DeliveryCountBug

Could you please point to the documentation you’re referring to?

From the code, it appears you’re using the new ASB transport, and not the legacy one. The new transport sets the MaxDeliveryCount to the maximum (int.MaxValue), and the recoverability is supposed to take care of the failed message and move it to the error queue. From the configuration code, it looks like the default recoverability is used (five immediate retries and three delayed retries), so the message should go to the error queue after 24 attempts as per documentaiton.

A few questions:

  1. Do you see MaxDeliveryCount on the queue set to something different than the int.MaxValue?
  2. When the messages are “infinitely” retried, do you see a long statement about moving messages to the error queue?
  3. Are you using the latest version of the transport? There was an issue with the SDK in the past that was fixed (Incoming messages are retried infinitely and log a large amount of exceptions · Issue #51 · Particular/NServiceBus.Transport.AzureServiceBus · GitHub).

This is far from normal behaviour, and that’s why I’ve suggested raising a support request. I would still recommend raising a support case to provide details you’d not be comfortable sharing on a public forum. Once an issue is identified, you’d be able to post the finding here to share with others.

Thanks for the reply Sean.

About the MaxDeliveryCount being set to default to 6; I couldnt find what I was mentioning, so I must have misread this somewhere.

For a bit there I thought the bug you where mentioning was what was causing our problems, but our Transport package is already up to date, so that couldnt be it. I have however added the code that depicts the amount of immediate and delayed retries to the endpoint configuration, and have been running tests with that since yesterday. Nothing has broken yet, so for now I’m cautiously thinking that might have fixed the issue.

I will report back if the problems show up again, but for now I’m thinking by adding the following code it has been fixed:

RecoverabilitySettings recoverabilitySettings = endpointConfiguration.Recoverability();
        recoverabilitySettings.Immediate((Action<ImmediateRetriesSettings>)(immidiate => immidiate.NumberOfRetries(2)));
        recoverabilitySettings.Delayed((Action<DelayedRetriesSettings>)(delayed =>
        {
            delayed.TimeIncrease(TimeSpan.FromMinutes(1.0));
            delayed.NumberOfRetries(1);
        }));

Sounds good. If you still run into this issue, please open a support case to dig deeper and provide some more specifics. You can reference this thread to provide the context. Thank you.