I’m using NServiceBus version 7.7.3 with NServiceBus.AmazonSQS version 5.6.1 on .NET 6.
We experienced a condition in the production environment of our app where a handler that took a “long time” to process ultimately threw a fatal error. The processing time was typically 5-15 seconds, but there were a few that took over 30 seconds (the current visibility timeout for our SQS setup). The message never made it to the error queue.
In my dev environment I was seeing 5 second processing times with the fatal error. (The processing time is not the cause of the fatal error condition.) I witnessed the expected retry sequence and the message ultimately landed on the error queue. When I injected a random wait of over 30 seconds, I did see duplication of messages on SQS, but after removing the wait, the system recovered as expected (with multiple messages hitting the error queue).
The production event never got to the error queue. There was one SQS message with a Receive Count of over 1000. We ultimately had to intervene and delete the message.
My question is, what are the circumstances in which a message might not be deleted from the queue, not make it to the error queue, and continually collect Receive events?