Rate Limiting and Immediate Retries

dennismsmith · December 3, 2019, 5:32pm

I’m using a rate limiting approach similar to the one in Message Throughput Throttling • NServiceBus Samples • Particular Docs and the MSMQ transport. What I have found, though, is that if a message fails and generates an immediate retry, once the DelayMessage method is called, we lose the immediate retry count, so it gets set back to zero.

In one specific case I observed, due to another bug in the application that caused the message handler to always generate exceptions, I ended up with an infinite cycle of immediate retries due to the delay clearing the retry count. This meant that the message never got redirected to the error queue. I ended up having to manually remove the timeout related to the rate limiting from my database to stop the cycle of retries.

For delayed retries, I was able to copy the delayed retry headers in my DelayMessage method so that they work correctly. As a result, for the time being, I have disabled immediate retries and increased my number of delayed retries to compensate.

Is there any way to detect whether the message being processed is an immediate retry or to retain the immediate retry state when delaying processing? Alternatively, is there a better way to handle this situation?

Tim · December 4, 2019, 12:32pm

Is there any way to detect whether the message being processed is an immediate retry or to retain the immediate retry state when delaying processing? Alternatively, is there a better way to handle this situation?

The DelayMessage implementation used in the throttling sample basically creates a new message (a clone) with a new ID and the immediate retry functionality is heavily relying on the ID to remain the same across the retries. So every time a message gets delayed by the throttling mechanism instead of the recoverability mechanism, the immediate retry counter is basically reset (or a new one, more precisely). Note that setting the message id on the message won’t help as the transport uses the internal, MSMQ specific message ID, not the ID generated by NServiceBus for this check.

The throttling sample is a bit simple as it doesn’t have any logic in regards to “maximum amount of throttled retries”, you’d have to extend the throttling logic a bit further (e.g. with custom headers) to keep track of the amount of retries. Otherwise it just keeps on retrying endlessly (as long as the amount of configured immediate retries isn’t hit in between a delay) as you have noticed.

An alternative implementation could make better use of the delayed delivery mechanism by providing a custom recoverability policy, e.g. something like this (I haven’t further tested this approach tbh):

static RecoverabilityAction ThrottledDelay(RecoverabilityConfig recoverabilityConfig, ErrorContext errorContext)
{
    if (errorContext.Exception is RateLimitExceededException)
    {
        if (errorContext.DelayedDeliveriesPerformed < x)
        {
            return RecoverabilityAction.DelayedRetry(TimeSpan.FromSeconds(y));
        }

        return RecoverabilityAction.MoveToError(recoverabilityConfig.Failed.ErrorQueue);
    }

    return DefaultRecoverabilityPolicy.Invoke(recoverabilityConfig, errorContext);
}

which can be registerd in the configuration:

endpointConfiguration.Recoverability().CustomPolicy(ThrottledDelay);

With that approach, the behavior from the sample can be removed and the handler just has to throw the specific Exception.