I want to build retry capability into my application, like ServicePulse does

lsmaple · October 12, 2017, 12:38am

My application is using MSMQ with NServiceBus 6 and I am looking at including features to allow end users to identify when a message related to their request has failed, and request of the application to have that message retried at the service that failed it.
My research has found the MessageSentToErrorQueue event that gives me most of the details of the failed message (error queue name, message id, message body, message headers, exception), which I can send up to the services serving the application UI to let them know there has been a failure.
The retry logic is where it get’s less clear. There is no ‘retry’ api in NServiceBus to retry a message on demand that I could find (would be a nice feature). I did find some code and script examples for ‘manually’ retrying messages from the error queue (MSMQ Transport Scripting • MSMQ Transport • Particular Docs). The example code uses the MSMQ message id to find the failed message and resend. The information in the failed message event MessageSentToErrorQueue does not include the MSMQ message id. This makes sense, since MSMQ is not used by every NServiceBus implementation. It looks like MSMQ has the means to search for a message using the correlation id, which the message headers does have, and it seems to match the MSMQ correlation id. I’m not sure how reliable and unique the correlation id is, but I thought I’d give that a try.
The next snag/concern is that with the ServiceControl service running, it appears that messages sent to the error queue wind up in the error.log queue. I tried to observe how messages moved around when I used ServicePulse to retry and archive messages, but it did not seem to affect the error.log queue, so the ServiceControl service looks like it is using it’s own datastore for those details. I also saw that ServiceControl can fire off its own failed message events over NSB, which gives me another alternative for getting the failed message events to the application UI, but ServiceControl does not have an api that I could see for managing message retries.
If someone has gone down this path, and has some insights, or if someone at Particular knows of a path or maybe one that is coming, it would be appreciated.

DavidBoike · October 12, 2017, 7:06pm

Hey Scott,

We ended up talking on the phone, but I wanted to share some of what we talked about in case it could be of any help to others.

At the ServiceControl level, recoverability is an infrastructure concern, and access to the message is fairly opaque. You can see the message body but the code subscribing to ServiceControl events won’t have access to all the message metadata or be able to utilize the message as a strongly-typed object.

There’s an assumption at this infrastructure level that these messages should complete successfully, and if they’re not, then a developer really needs to look at the cause and do something about the underlying cause. It’s not the place for message validation, for example - that should have happened before the message was even sent.

It would be difficult (if not impossible) for an end user to really know that a failed message has anything to do with them, or what to do about it. This leads to a situation where end-users are just repeatedly banging on F5 to get the notification to go away.

A better situation is to let developers or operations handle those infrastructure concerns and introduce business-level handling of these types of exceptions. You described to me that a frequent use case was when a user on an external system held a lock over a resource, maybe for 10 seconds, maybe for 10 hours, so normal immediate/delayed retries really don’t work.

In those cases, the exception should really be dealt with at a handler level, and there’s 2 ways to do this.

The first is right in the message handler, to catch that specific exception and then send/publish a different message to know that a business abnormality has occurred which requires action. Then another message handler could write that information to a table or broadcast a SignalR notification, or whatever is deemed appropriate for that business case.

However, that happens immediately on the very first processing attempt.

You can also use a custom recoverability policy to fine-tune the way exceptions are handled. This component can be deployed globally, but have knowledge of message types and act differently for different message types. So in this case, if this type of message (blocked by the locked external system) had already blown through all the immediate AND delayed retries, it could take the action of sending a different message or taking a different action, and consume the message. All other failed messages would be sent to the error queue as normal.

This gives you flexibility in handling different types of exceptions, with immediate access to all the information about the message and exception, without blurring the lines between handling “business exceptions” and “technical exceptions”.

lsmaple · November 28, 2017, 12:35am

David,
Thanks again for the conversation. I am trying to implement a custom recoverability policy where, if the action proposed by the default recoverability policy is to MoveToError, I want to publish a failed message event for other services to know the requested action has failed. I am not sure how I would publish a failure message within the custom policy method, because the custom retry policy I create and provide to the NSB configuration doesn’t have access to a NSB message context for publishing in the IConfigureThisEndpoint configuration class I’m using. The endpoint hasn’t been created yet. Can you show me how this pattern might look?
In the Customize method, I’ve added:

        var recoverability = endpointConfiguration.Recoverability();
        recoverability.CustomPolicy(MyCustomRetryPolicy);

And in the IConfigureThisEndpoint class, I’ve added a new policy method:

    RecoverabilityAction MyCustomRetryPolicy(RecoverabilityConfig config, ErrorContext context)
    {
        var action = DefaultRecoverabilityPolicy.Invoke(config, context);

        if (action is MoveToError)
        {
            // Would like to publish an event here
            var log = LogManager.GetLogger(typeof(EndpointConfig));
            log.Error("moved to error queue");
        }

        return action;
    }

lsmaple · November 28, 2017, 9:54pm

I figured out a way around this. I made the dependency container object available to my custom retry policy method. It tries to resolve an IMessageSession from the container, and if it finds one, it publishes a failed request message. Then, in the class that implements the NSB IWantToRunWhenEndpointStartsAndStops interface, I register the IMessageSession parameter given to the Start method in the container. Kind of round-about way, but it is working.