I recently ran into a problem where an application bug caused a handler to exceed the 10 minute timeout permitted with MSMQ and DTC. What ended up happening was as follows:
T +0 - Start processing message
T +10 min - DTC transaction expired, so NServiceBus starts processing the message again so we have two instance of the handler running simultaneously
T +15 min - The first instance eventually performs an action where DTC throws an exception due to the transaction being aborted (e.g. data access, send message, commit transaction, etc.). It’s not clear to me if the NServiceBus immediate retry logic caused the message to get handled simultaneously again or not.
T +20 min - The DTC transaction expired for the second instance of handling the message, so we now end up with at least two instance of the handler running simultaneously again.
This continued for a couple hours until all of the timeouts were expired. Processing this message multiple times simultaneously blocked other messages from being processed for an extended period of time.
Obviously one of the solutions to this problem is to avoid application bugs that cause long running transactions, but I’m looking for a way to try and mitigate this sort of problem if we ever have bugs of this sort again.
My first thought was to use the RecoverabilityAction extension point to provide some custom handling for this case, but because the DTC transaction is aborted well before we actually get the exception raised in the application, it doesn’t actually help.
Is there anything I can do at the NServiceBus level to try to mitigate this sort of problem? Ideally, I’d like to be able to do the following:
- Ensure that we don’t have the same message being processed multiple times simultaneously when a transaction times out
- Avoid immediate retries when a transaction times out so that other messages can be processed first rather than getting blocked behind the message that timed out.