Retry Policies with Transaction Timeouts

dennismsmith · October 8, 2019, 6:01pm

I recently ran into a problem where an application bug caused a handler to exceed the 10 minute timeout permitted with MSMQ and DTC. What ended up happening was as follows:
T +0 - Start processing message
T +10 min - DTC transaction expired, so NServiceBus starts processing the message again so we have two instance of the handler running simultaneously
T +15 min - The first instance eventually performs an action where DTC throws an exception due to the transaction being aborted (e.g. data access, send message, commit transaction, etc.). It’s not clear to me if the NServiceBus immediate retry logic caused the message to get handled simultaneously again or not.
T +20 min - The DTC transaction expired for the second instance of handling the message, so we now end up with at least two instance of the handler running simultaneously again.

This continued for a couple hours until all of the timeouts were expired. Processing this message multiple times simultaneously blocked other messages from being processed for an extended period of time.

Obviously one of the solutions to this problem is to avoid application bugs that cause long running transactions, but I’m looking for a way to try and mitigate this sort of problem if we ever have bugs of this sort again.

My first thought was to use the RecoverabilityAction extension point to provide some custom handling for this case, but because the DTC transaction is aborted well before we actually get the exception raised in the application, it doesn’t actually help.

Is there anything I can do at the NServiceBus level to try to mitigate this sort of problem? Ideally, I’d like to be able to do the following:

Ensure that we don’t have the same message being processed multiple times simultaneously when a transaction times out
Avoid immediate retries when a transaction times out so that other messages can be processed first rather than getting blocked behind the message that timed out.

danielmarbach · October 9, 2019, 9:58am

Hi Dennis

Our usual recommendation here would be to split into multiple endpoints based on the business importance of the message. Because if a certain message type holds up another message type it is an indication that those should not be handled by the same endpoint.

Regards
Daniel

dennismsmith · October 9, 2019, 12:05pm

In this case, all of the affected messages were of the same type; the message that was taking a long time was holding up other instances of the same type.

danielmarbach · October 9, 2019, 3:57pm

Hi Dennis

probably having Metrics in place to give you an indication about the throughput, critical time and more would be helpful to detect problems like that?

I’m not sure I’d go for a code solution because I think a code solution inside the pipeline or the handler can easily become very complex and brittle to manage. But theoretically you could have a cancellation token source with an SLA for that handler and then observe the token managed by that source within the handler code. If the cancellation token is triggered you could delay the message you are currently handling with delayed delivery.

To complete processing of the current message without invoking additional handlers and reprocess it later, send a copy of the current message via IMessageHandlerContext.SendLocal(...) . Note the following restrictions:

Reusing the incoming message instance is possible, however it does not copy the headers of the incoming message. Headers need to be manually set on the outgoing message via the outgoing headers API.
A delay can be added using the send options. For more options see the delayed delivery section.
The sent message will be added at the back of the queue.

But again I’m not sure if is worth going down that path. I’d rather invest the time and energy into good monitoring of metrics. By that is my opinion and you might have other reasons why a code solution is your preferred way of doing it.

Regards
Daniel

dennismsmith · October 9, 2019, 5:53pm

Thanks for the reply. We can already detect when this problem happens, but my goal was to try to automatically mitigate it without a human needing to become involved. From my investigation and your comments, it doesn’t look like it’s a practical solution, which is what I suspected. I just wanted to make sure that I hadn’t missed an obvious way of dealing with this.