SQS message with high Receive Count

dastultz · September 11, 2023, 1:47pm

I’m using NServiceBus version 7.7.3 with NServiceBus.AmazonSQS version 5.6.1 on .NET 6.

We experienced a condition in the production environment of our app where a handler that took a “long time” to process ultimately threw a fatal error. The processing time was typically 5-15 seconds, but there were a few that took over 30 seconds (the current visibility timeout for our SQS setup). The message never made it to the error queue.

In my dev environment I was seeing 5 second processing times with the fatal error. (The processing time is not the cause of the fatal error condition.) I witnessed the expected retry sequence and the message ultimately landed on the error queue. When I injected a random wait of over 30 seconds, I did see duplication of messages on SQS, but after removing the wait, the system recovered as expected (with multiple messages hitting the error queue).

The production event never got to the error queue. There was one SQS message with a Receive Count of over 1000. We ultimately had to intervene and delete the message.

My question is, what are the circumstances in which a message might not be deleted from the queue, not make it to the error queue, and continually collect Receive events?

Thanks.

mauroservienti · September 12, 2023, 2:31pm

Hi @dastultz,

I have a hard time wrapping my head around why something like this could happen.

Two things come to my mind:

A missing await statement (or something around async programming) in the handler code causing, in some situations, the handler to fail and throw when the recoverability pipeline has already started the dispatch to the error queue, leading it not to finish the operation.
More unlikely, some weird clock drift situation makes it so that the SQS time causes the delivery to the error queue to fail.

By the way, both the Core and SQS versions you use are unsupported. Would you be able to upgrade to the latest supported minor version and see if the issue still occurs?

I looked at the SQS transport code and could not find any apparent reason. Would you be able to share your message-handling and endpoint configuration code? (If there are privacy concerns, please send an email to support@particular.net, asking to assign the support case to me).

dastultz · September 13, 2023, 1:24pm

Thank you. We’ll work towards the upgrade. This will take a bit of time, so we’ll follow up later.

/D