RabbitMQ Transport, unacknowledged messages

macdonald-k · December 20, 2017, 3:25pm

I am using:
NSB 6.3.4
NSB.RabbitMQ 4.3.2
NSB.Persistence.SQL 2.1.3
RabbitMQ 3.6.9
Erlang 19.3

Outbox feature is enabled
NSB.RabbitMQ is using the default transaction mode ( I have not specifically set it)

I found a saga that was unfinished, that had started 2 days ago. I could find no messages in the error queues and no errors in the logs. When I dug deeper, I discovered that 3 messages were in the RabbitMQ Queue that were in an unacknowledged state. Until those messages are processed, the saga would not finish.

The server and service never crashed or restarted around the time this happened.

I am trying to determine why these messages ended up as unacknowledged and how I could detect such things in the future. If I read the NSB.RabbitMQ doc correctly, the message should be acked or nacked and not end up in this state.

bording · December 20, 2017, 11:35pm

If you’re seeing messages that have been stuck in a delivered but unacknowledged state, the only reason that I can think of that could cause that is that something in your handler logic for those messages caused the handler method to never finish processing.

That would mean that the transport never had a chance to send the ack message, so the broker was still waiting for it. The broker will wait for an ack indefinitely as long as the channel isn’t closed.

Are the messages still in this state, or have you already restarted the endpoint? If you see this happen again, I suggest taking a memory dump, which would let you analyze the state of the process. Opening the dump in Visual Studio and then using the Parallel Stacks debugging window can be very useful to visualize this.

macdonald-k · December 22, 2017, 3:55pm

The very first thing the message handler does, it log to the logfile… which didn’t happen.
I restarted the endpoint, and of course, the messages were reset.

Dump file is a good suggestion, I hand’t thought of that. Are you thinking that the thread/Task processing the message is running/waiting, but held up somehow?

bording · December 22, 2017, 4:23pm

Yes, that’s exactly what I was thinking. I’ve seen that happen before, and in that case it turned out to be a web API call that the handler was making that was never returning and didn’t have a timeout. When looking at a memory dump, I could see multiple stack traces that were sitting in that method.

If the message had finished processing, then the transport would have either sent the ack or it would have failed to send it and you’d see a log entry indicating that failure.

If you see this behavior again, collect a few memory dumps so that you can compare them to see if anything is changing over time. That should give you some insight into where the message processing was stuck.

macdonald-k · January 8, 2018, 2:59pm

In this case, the very first thing the handler does is log a start message, then it calls a web api. However, the start message was not logged until after I restarted the service, so it never got to the web api method.

I’ve been keeping an eye on it, and it hasn’t happened again; despite stressing it with over 6K messages that take it quite some time to work it’s way through (~30s per web api call)

If it does happen again, I will collect a memory dump and see what I can figure out.