Implementing Custom Retry Policy breaks DefaultRecoverabilityPolicy behavior

mgmccarthy · May 1, 2018, 12:29am

I’m using NSB 6.1.4 and have implemented a custom retry policy based on a particular message type. Here is the code:

class ConfigureRecoverabilitySettings : INeedInitialization
{
    public void Customize(EndpointConfiguration configuration)
    {
        var recoverability = configuration.Recoverability();
        recoverability.CustomPolicy(CreateSubjectCustomRetryPolicy);
        recoverability.Delayed(delayed =>
        {
            delayed.TimeIncrease(new TimeSpan(0, 0, 20));
        });
        recoverability.Immediate(immediate =>
        {
            immediate.NumberOfRetries(2);
        });
    }

    private RecoverabilityAction CreateSubjectCustomRetryPolicy(RecoverabilityConfig config, ErrorContext context)
    {
        context.Message.Headers.TryGetValue(Headers.EnclosedMessageTypes, out var messageName);
        if (messageName.Contains("TestCommand"))
        {
            if (context.DelayedDeliveriesPerformed == 3)
            {
                return RecoverabilityAction.MoveToError(config.Failed.ErrorQueue);
            }

            return RecoverabilityAction.DelayedRetry(TimeSpan.FromMinutes(10));
        }

        return DefaultRecoverabilityPolicy.Invoke(config, context);
    }
}

I followed the instructions for a “mix” of the default recoverability policy and a custom recoverability policy here: Custom Recoverability Policy • NServiceBus • Particular Docs

I am currently running using MSMQ as my transport and InMemoryPersistence.

When dispatching a message that is NOT a message named like “Testcommand”, the behavior I’m seeing is the immediate retries happen two in a row, like I would expect, then I dump into the first delayed retry… and this is where things get weird.

After the first delayed retry, two immediate retries are invoked again. The two immediate retries happen for each subsequent delayed retry (I am using the default 3 delayed retries) until the message ends up being moved to the error queue.

Not too sure if it’s a problem with the way I’m configuring the custom retry policy and trying to use the default recoverability policy as well? Or perhaps it’s a side-effect of using InMemoryPersistence?

I know in NSB 5 there was a “retry” queue for each endpoint… but I’m not too what NSB 6 does with messages during their retry policies.

Thanks!
Mike

Tim · May 1, 2018, 7:09am

Hey @mgmccarthy
The behavior you’re describing is indeed the supposed behavior. The immediate retry counter resets when doing a delayed retry so that after the delay time, the message will apply immediate retries again.
We’ve documented the total amount of retries to expect here: Recoverability • NServiceBus • Particular Docs
which seems to align with your observation?

I know in NSB 5 there was a “retry” queue for each endpoint… but I’m not too what NSB 6 does with messages during their retry policies.

The delayed delivery mechanism depends upon the selected transport. MSMQ will send messages to be retried at a later point in time into the .timeouts queue, where it will be picked up by the TimeoutManager. The TimeoutManager will store the delayed message in a database and dispatch it once it’s due. Other transports might use different approaches in case they support native delayed delivery.

mgmccarthy · May 1, 2018, 11:59am

Tim,

Thanks so much for the clarification and the link. The behavior I’m seeing makes sense now. The whole reason I posted is because I misunderstood immediate/delayed retries in NSB 6. So now I know.

Question: does the default policy for immediate and delayed retries depart from how FLR’s and SLR’s were handled in NSB 5?

In NSB 5, I think the FLR’s were tried immediately up front, but when the message was moved to SLR’s, the FLR’s were no longer invoked.

Right?

Thanks,
Mike

Tim · May 1, 2018, 4:20pm

Glad to hear that your observations make more sense now

Version 5 was also resetting the FLR counter. There is a difference between version 5 and version 6 though: The configured number of immediate retries in version 5 was a bit counter-intuitive as the actual number of retries was configured FLR number - 1. e.g. if you configured FLR with 1, it would actually not retry the message but only try to process it once before handing it off to SLR. Starting from version 6, this behavior has been changed to cause 1 additional processing attempt (as probably intended by that configuration) before handing it off to SLR.
There is a table demonstrating the expected (minimum) amount of retries per configuration on our docs, e.g. here for version 5: Recoverability • NServiceBus • Particular Docs. You can switch the version on the top of the page and see that the table will contain different results starting from version 6. I hope this explanation made sense and maybe it can explain your v5 observations?

mgmccarthy · May 2, 2018, 12:21am

Tim, thanks so much for the help. I’ve been using NSB since version 4, and apparently, up until now, haven’t really understood how FLR/SLR/immediate/delayed retries works. But now I know. On a public discussion board. With the people that write the software

The funny part is (if there i a funny part to all this) I did know about the Transportconfig MaxRetries count thing where the actual immediate retries are one less than the value provided in NSB5.

Either way, thanks so much for a great education about NSB 5/6 retries!

Mike

Tim · May 2, 2018, 9:43am

You’re always welcome @mgmccarthy! Also I’m glad to hear that you’ve been using NServiceBus for quite a while already without having to bother too much about the retry internals.

Really understanding how retries work in detail is non-trivial as you have noticed too. We’re trying to improve this constantly by providing better docs and better APIs (e.g. MaxRetries has actually been renamed to NumberOfRetries as we can’t guarantee to not retry messages more often due to technical transaction limitations). If you spot some documentation gaps you can also provide us feedback directly on our doco page