Scheduled messages (eg. Saga Timeouts) causes over-scale on Azure Function Consumption Plan

Maybe just a “FYI” as this is not caused by NServiceBus but when hosting an NServiceBus Endpoint with Azure Functions with a consumption plan (perhaps premium as well) scheduled messages on the Queue will cause the scale controller to start creating more and more instances of the app, even if it’s under no load.

This will happen if you have a lot of Saga timeouts or other delayed messages as all of these translates to a scheduled message on the Azure Service Bus queue. Apparently, the scale controller can’t distinguish between active or scheduled/deferred messages so it sees a message and the time of arrival (which might be some time ago) but not that it should be processed later, so it will start spinning up new instances.

There is an open issue: https://github.com/Azure/Azure-Functions/issues/715 but it’s been open since 2018…

If you would like to reproduce this without any involvement of NServiceBus, you can add these three functions to a function app on a consumption plan and watch the server count in like Application Insight:

[FunctionName("MyTrigger")]
public static async Task QueueTrigger([ServiceBusTrigger("debug", Connection = "AzureServiceBus")] 
    string myQueueItem, 
    ILogger log)
{
    log.LogInformation($"Message: {myQueueItem}");
}

[FunctionName("SendImmediate")]
public static async Task<IActionResult> SendImmediate(
    [HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = null)] 
    HttpRequest req,
    [ServiceBus("debug", EntityType.Queue, Connection = "AzureServiceBus")] 
    IAsyncCollector<Message> messages,
    ILogger log)
{
    log.LogInformation("Sending message immediately");

    var bytes = Encoding.UTF8.GetBytes("Immediate message");
    var message = new Message(bytes);
    await messages.AddAsync(message);

    return new OkResult();
}

[FunctionName("SendScheduled")]
public static async Task<IActionResult> SendScheduled(
    [HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = null)] 
    HttpRequest req,
    [ServiceBus("debug", EntityType.Queue, Connection = "AzureServiceBus")] 
    IAsyncCollector<Message> messages,
    ILogger log)
{
    var seconds = 60;
    
    log.LogInformation($"Sending message scheduled in {seconds} seconds");

    var bytes = Encoding.UTF8.GetBytes("Scheduled message");
    var message = new Message(bytes)
    {
        ScheduledEnqueueTimeUtc = DateTime.UtcNow.AddSeconds(seconds)
    };
    await messages.AddAsync(message);

    return new OkResult();
} 

Just post to the endpoints, especially the SendScheduled one… After about a minute or so new instances of the app will be created if your function app is hosted on a consumption plan.

For us, this simulated the behavior with our Sagas and their Saga Timeouts which caused the host to start scaling without actually needing it…

So, Azure Service Bus does indeed have native support for scheduled messages, but it might have an unintended side effect of over-scaling if you have a Function App on a Consumption plan… Perhaps something for the documentation? Unless I’m completely wrong about this :slight_smile:

//J

Good feedback about documenting it, @jens.
I’ve raised an issue to capture this.

Hi Jens,

could you share a bit more about the effects that you’ve observing? This would be a great help for us to figure out what is the impact of this problem.

  • How many instances do see being spun up by the controller?
  • Is there a cap at which no more instances are created?
  • Does this depend on the number of delayed messages in the queue?
  • Does the number of instances go down at some point?

Cheers,
Tomek

As we’ve changed away from auto scaling in our environments at the moment, I’ve used the sample code above to do some tests and perhaps that can answer some of your questions.

My first scenario was the following:

On an empty queue, schedule one message for 30 days in the future and see what happens.

After only three minutes (with absolutely no other actions on the function app), we were running on 5 instances.

How many instances do see being spun up by the controller?

In our real application, we had 12 instances running on a newly deployed (restarted) application with only one active user (our test environment), when we noticed it. We stopped the app manually and waited for the instances to drop off and restarted. Same thing happened again after a couple of minutes with some minor usage of the app.

In the test I did earlier today, with only one message scheduled for a long time in the future, it stopped at 5 initially. It had idled for 40 minutes (still 5 instances), so then I scheduled a couple of more messages for about a minute in the future, and the instances grew to 7 but then it kind of stopped there. Wasn’t able to force it higher, but this was with a very low message flow as I had to push messages manually.

Is there a cap at which no more instances are created?

Even though I’ve not been able to push it up really high yet, I have no doubt that it will continue to grow on a system with higher load than we’ve tested. The cap on the consumption plan is 200 instances.

Does this depend on the number of delayed messages in the queue?

Hard to say, but another test I’ve just made was to publish a scheduled message for 60 seconds in the future every 2 seconds. The instance count grew to 9 in a short time, but stopped there. Was around 29 scheduled messages on the queue on average. I doubled the rate of the producer so there was around 56 scheduled messages on the queue on average, but the instance count stayed at 9.

I then doubled the scheduled time to 120 seconds, every 2 seconds, and the schedule message count was around 56 and the instance count also stayed at 9.

The last thing I did was to set the schedule time to 10 minutes and then the instances grew to 10 after the first couple of messages arrived, but that’s just perhaps a coincidence.

Does the number of instances go down at some point?

Only when there were no scheduled messages at all, the instances started to drop off after about 10 minutes.

These tests are by no means very scientific, just a way to somewhat reproduce the behavior we see in our real application, but I hope they can give some insight at least.

//J

I doubt Particular can do much about this. This is how Azure Functions scale controller works. Among various metrics, it looks at the queue depth over the time and oldest message age to decide wherever to scale out or not. If delayed/scheduled messages are in the queue, the scale controller doesn’t seem to know to differentiate those from normal messages, available for processing, confusing those for messages that can’t get processed and scaling out.

Unfortunately, the documentation on the scale controller is too minimalistic. I’ve raised an issue with the Functions to have a tracking public issue others can refer to and chime in/upvote.

2 Likes

Thank you for sharing this detailed description Jens.