Duplicate messages for delayed retries when scaled out using sql transport with native delayed delivery

dliljhammar · April 23, 2020, 7:57am

Hi,
I have an issue where messages are duplicated across scaled out nodes when delayed retries execute.

Initially the system used the TimeoutManager so I assumed that was the cause for the message duplication. But changing to sql transport native delayed delivery did not correct the behavior.

The experienced behavior:
As soon as a message on one node is sheduled for delayed retry, that retry is consumed by multiple nodes when executed. (Same message id). It seems that for some reason the native delayed delivery is experiencing the same race condition as TimeoutManager does, it is likely due to something in our setup, but I have not been able to figure this out.

I have read through most of the documentation and to what I can see our setup should be able to guarantee ‘excactly-once’ delivery without TransactionScope level and DTC.

System setup:

SQL Transport using native delayed delivery.
Default deployment scenario for SQL transport using SendAtomicWithReceive transaction level.
Centralized database with Sql transport and Sql persistence.
Single shared input queues for multiple competing consumers.

NServiceBus version=“6.4.3” targetFramework=“net452”
NServiceBus.SqlServer version=“3.1.3” targetFramework=“net452”
NServiceBus.Persistence.Sql version=“3.0.3” targetFramework=“net452”

I would very much appreciate some input on the topic and to see if someone har a similar setup either working as expected, or having similar behavior.

Best regards
Daniel

ramonsmits · May 15, 2020, 1:26pm

No duplicates were generated. The problem was with the behavior of immediate retries in a scaled out environment.

The number of instances act as a immediate retry multiplier and this behavior is now documented:

The number of instances act as a multiplier for the maximum number of attempts.

Mininum Attempts = (ImmediateRetries:NumberOfRetries + 1) * (DelayedRetries:NumberOfRetries + 1)
Maximum Attempts = MininumAttempts * NumberOfInstances

Example:

When taking the default values for immediate (5) and delayed retries (3) and 5 instances the total number of attempts will be a minumum of (5+1)*(3+1)=24 attempts and a maximum of 120.

source: Recoverability • NServiceBus • Particular Docs