Scheduled messages (eg. Saga Timeouts) causes over-scale on Azure Function Consumption Plan

jens · October 8, 2021, 10:47am

Maybe just a “FYI” as this is not caused by NServiceBus but when hosting an NServiceBus Endpoint with Azure Functions with a consumption plan (perhaps premium as well) scheduled messages on the Queue will cause the scale controller to start creating more and more instances of the app, even if it’s under no load.

This will happen if you have a lot of Saga timeouts or other delayed messages as all of these translates to a scheduled message on the Azure Service Bus queue. Apparently, the scale controller can’t distinguish between active or scheduled/deferred messages so it sees a message and the time of arrival (which might be some time ago) but not that it should be processed later, so it will start spinning up new instances.

There is an open issue: ServiceBus Triggered Functions Overprovisioning · Issue #715 · Azure/Azure-Functions · GitHub but it’s been open since 2018…

If you would like to reproduce this without any involvement of NServiceBus, you can add these three functions to a function app on a consumption plan and watch the server count in like Application Insight:

[FunctionName("MyTrigger")]
public static async Task QueueTrigger([ServiceBusTrigger("debug", Connection = "AzureServiceBus")] 
    string myQueueItem, 
    ILogger log)
{
    log.LogInformation($"Message: {myQueueItem}");
}

[FunctionName("SendImmediate")]
public static async Task<IActionResult> SendImmediate(
    [HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = null)] 
    HttpRequest req,
    [ServiceBus("debug", EntityType.Queue, Connection = "AzureServiceBus")] 
    IAsyncCollector<Message> messages,
    ILogger log)
{
    log.LogInformation("Sending message immediately");

    var bytes = Encoding.UTF8.GetBytes("Immediate message");
    var message = new Message(bytes);
    await messages.AddAsync(message);

    return new OkResult();
}

[FunctionName("SendScheduled")]
public static async Task<IActionResult> SendScheduled(
    [HttpTrigger(AuthorizationLevel.Anonymous, "post", Route = null)] 
    HttpRequest req,
    [ServiceBus("debug", EntityType.Queue, Connection = "AzureServiceBus")] 
    IAsyncCollector<Message> messages,
    ILogger log)
{
    var seconds = 60;
    
    log.LogInformation($"Sending message scheduled in {seconds} seconds");

    var bytes = Encoding.UTF8.GetBytes("Scheduled message");
    var message = new Message(bytes)
    {
        ScheduledEnqueueTimeUtc = DateTime.UtcNow.AddSeconds(seconds)
    };
    await messages.AddAsync(message);

    return new OkResult();
}

Just post to the endpoints, especially the SendScheduled one… After about a minute or so new instances of the app will be created if your function app is hosted on a consumption plan.

For us, this simulated the behavior with our Sagas and their Saga Timeouts which caused the host to start scaling without actually needing it…

So, Azure Service Bus does indeed have native support for scheduled messages, but it might have an unintended side effect of over-scaling if you have a Function App on a Consumption plan… Perhaps something for the documentation? Unless I’m completely wrong about this

//J

SeanFeldman · October 8, 2021, 3:20pm

Good feedback about documenting it, @jens.
I’ve raised an issue to capture this.

tmasternak · October 13, 2021, 8:42am

Hi Jens,

could you share a bit more about the effects that you’ve observing? This would be a great help for us to figure out what is the impact of this problem.

How many instances do see being spun up by the controller?
Is there a cap at which no more instances are created?
Does this depend on the number of delayed messages in the queue?
Does the number of instances go down at some point?

Cheers,
Tomek

jens · October 13, 2021, 11:24am

As we’ve changed away from auto scaling in our environments at the moment, I’ve used the sample code above to do some tests and perhaps that can answer some of your questions.

My first scenario was the following:

On an empty queue, schedule one message for 30 days in the future and see what happens.

After only three minutes (with absolutely no other actions on the function app), we were running on 5 instances.

How many instances do see being spun up by the controller?

In our real application, we had 12 instances running on a newly deployed (restarted) application with only one active user (our test environment), when we noticed it. We stopped the app manually and waited for the instances to drop off and restarted. Same thing happened again after a couple of minutes with some minor usage of the app.

In the test I did earlier today, with only one message scheduled for a long time in the future, it stopped at 5 initially. It had idled for 40 minutes (still 5 instances), so then I scheduled a couple of more messages for about a minute in the future, and the instances grew to 7 but then it kind of stopped there. Wasn’t able to force it higher, but this was with a very low message flow as I had to push messages manually.

Is there a cap at which no more instances are created?

Even though I’ve not been able to push it up really high yet, I have no doubt that it will continue to grow on a system with higher load than we’ve tested. The cap on the consumption plan is 200 instances.

Does this depend on the number of delayed messages in the queue?

Hard to say, but another test I’ve just made was to publish a scheduled message for 60 seconds in the future every 2 seconds. The instance count grew to 9 in a short time, but stopped there. Was around 29 scheduled messages on the queue on average. I doubled the rate of the producer so there was around 56 scheduled messages on the queue on average, but the instance count stayed at 9.

I then doubled the scheduled time to 120 seconds, every 2 seconds, and the schedule message count was around 56 and the instance count also stayed at 9.

The last thing I did was to set the schedule time to 10 minutes and then the instances grew to 10 after the first couple of messages arrived, but that’s just perhaps a coincidence.

Does the number of instances go down at some point?

Only when there were no scheduled messages at all, the instances started to drop off after about 10 minutes.

These tests are by no means very scientific, just a way to somewhat reproduce the behavior we see in our real application, but I hope they can give some insight at least.

//J

SeanFeldman · October 13, 2021, 7:03pm

I doubt Particular can do much about this. This is how Azure Functions scale controller works. Among various metrics, it looks at the queue depth over the time and oldest message age to decide wherever to scale out or not. If delayed/scheduled messages are in the queue, the scale controller doesn’t seem to know to differentiate those from normal messages, available for processing, confusing those for messages that can’t get processed and scaling out.

Unfortunately, the documentation on the scale controller is too minimalistic. I’ve raised an issue with the Functions to have a tracking public issue others can refer to and chime in/upvote.

tmasternak · October 14, 2021, 7:53am

Thank you for sharing this detailed description Jens.

SzymonPobiega · October 18, 2021, 9:00am

Hi @jens

I talked about it a bit with @andreasohlund and we think there might be a workaround. We have not yet checked it, but it might be possible to create an additional queue and/or topic for delayed messages and have a behavior in NServiceBus that re-routes the delayed messages that are destined to the local endpoint to that queue/topic. Then we could set forwarding on that queue/topic to the local queue.

If that works than any timeout would wait for delivery on that additional queue and then be automatically forwarded to the input queue. The additional queue would not have any active receivers so scale controller will not be interested in it. The input queue’s scale controller won’t see the delayed messages.

Szymon

jens · October 20, 2021, 6:23am

Hi @SzymonPobiega. We actually talked about a similar approach, and it sounds like that could be a potential workaround. We are more than happy to test something if you are willing to add this behavior in NServiceBus.

//J

jens · October 20, 2021, 8:25am

FYI: Just got an answer from Microsoft regarding this:

As the Product Group confirmed, this would be a new feature request and it may take some time to proceed/ build a new feature.

For now, we would like to ask if you consider to use a Dedicated App Service Plan, instead of a Dynamic plan (Consumption plan)?

So a workaround would be nice, otherwise it’s not feasible to run this on a dynamic plan, which is unfortunate.

//J

SeanFeldman · October 20, 2021, 2:08pm

@jens to add to that response - Microsoft won’t prioritize the work until there’s an open by customer support case. The issue will be triaged but w/o a specific customer “incident” it won’t be treated seriously.

SeanFeldman · October 21, 2021, 10:11pm

Just a thought - to workaround it, the code scheduling messages could use the option of specifying the destination, which would always be the auxiliary queue auto-forwarding to the Function trigger queue. Customers would need to provision this auxiliary queue anyways since NServiceBus doesn’t create the infrastructure. This way you can scale back your bill w/o waiting on Functions team to fix the bug or forcing NServiceBus implement the workaround that doesn’t really needed to be in the NServiceBus code.

This issue is really a Microsoft issue they should have fixed long time ago.

jens · October 22, 2021, 7:27am

@SeanFeldman Thanks for great input. That would only be possible if we actually do a send or publish with a delay explicitly set, like this, right?

var sendOptions = new SendOptions();
sendOptions.DelayDeliveryWith(TimeSpan.FromSeconds(60));
sendOptions.SetDestination("delay-queue"); //with forwarding to endpoint queue

await context.Send(new Something(), sendOptions);

Our problems were almost only caused by the saga timeouts, and I’m not sure I can set a destination on a timeout message and would prefer not to implement our own Saga timeout handling if possible (we have quite a lot of sagas).

I talked to @andreasohlund yesterday and he said they might have a custom behavior implemented for us to try. I’ll wait for that first.

SzymonPobiega · October 22, 2021, 8:33am

Hej @jens

@andreasohlund and I have spiked a workaround to this issue based on the Azure Functions sample. The code is available here. As I mentioned previously, it uses a behavior so that your saga code is not aware of the workaround.

jens · October 26, 2021, 8:55am

Hi @SzymonPobiega and @andreasohlund. I’ve now tested the workaround but unfortunately it doesn’t seem to make a difference. The forward-to mechanics seems to forward the scheduled message immediately to the forward-queue which results in the scheduled message living in the endpoint queue and not in the delay queue. Same as before, in other words…

Tried it in ServiceBus Explorer as well, and sending a delayed message to a “delay” queue with forward-to also sent the message immediately as a scheduled message to the destination queue.

Unless I’m missing something?

SeanFeldman · October 27, 2021, 3:50am

@jens,

The Functions team is actively looking into this issue. If you want to help them with resolving it and not just find a workaround, you could chime in on the issue.

SzymonPobiega · October 27, 2021, 10:14am

@jens my bad. We looked at the queue dashboard to make sure the message indeed passes through the delay queue but we did not make sure that it stays there while being delayed

SeanFeldman · October 28, 2021, 2:36pm

I can only suggest, up to others to act - back up @jens on the issue I’ve raised and let Functions team fix it the right way.