RabbitMQ messages left Unacked

Dan_Bishop · September 13, 2019, 8:58pm

NServiceBus: 7.1.10
NServiceBus.RabbitMQ: 5.1.2
RabbitMQ.Client: 5.1.1
RabbitMQ Server: 3.7.17
Erlang: 22.0.7
RabbitMQ Server OS: Ubuntu 18.04

We have RabbitMQ setup in a three node cluster, with no queue mirroring enabled. We have two NServiceBus endpoints deployed on two Windows 2019 servers (one endpoint per server) that are processing messages from the queue. With this configuration, we are seeing many messages remain in the Unacked state in the queue. We have a test that drops about about 80 large messages onto the queue, where each message does a fairly expensive and time-consuming Bulk Insert to RavenDB 4.2. These messages can range in size from 250K to 2MB, and they require anywhere from 5-60 seconds to process. Usually about 60 messages are processed successfully, and the last 20 never seem to get processed. If we check the queue, it will show a bunch of messages in the Unacked state. At this point in time there is no load on either NServiceBus handler, so we know they aren’t still doing any work. We also don’t see any evidence of failures, or anything to that effect in the NServiceBus logs. During this state, if we try to stop the NServiceBus host’s Windows service, it will hang on shutdown and the service must be force killed from Task Manager. When the handler starts back up, it will successfully process the messages from the queue.

What is strange is that if we run the test again with only a single NServiceBus endpoint running, everything works properly and no messages are left in the Unacked state.

Another thing to note is that the RabbitMQ server is running on the same hardware as the RavenDB server that is under heavy load (due to the large volume of Bulk Inserts), so it is possible that RabbitMQ is low on CPU/memory resources while these tests are running.

We’d really like to get to the bottom of this and figure out how to prevent this from happening. If I can provide any further information, or if you’d like to get on a call for me to demonstrate this problem, please let me know.

Our RabbitMQ setup from the NServiceBus handler is:

var transport = endpointConfiguration.UseTransport<RabbitMQTransport>()
    .ConnectionString(rabbitMqConnectionString)
    .UseConventionalRoutingTopology()
    .Transactions(TransportTransactionMode.ReceiveOnly);

bording · September 16, 2019, 8:56pm

If you’re seeing unacked messages, then that would mean that the endpoint hasn’t finished processing them, so the transport hasn’t sent an ack message back to the broker yet. There’s not anything in the information you’ve provided so far that would make me think that isn’t the case.

If each message has a lot of work to do, then it’s quite possible you’re overloading your RavenDB server, especially considering the sizes of the messages you’re talking about.

You definitely shouldn’t be running RavenDB and RabbitMQ on the same hardware because if the resources are constrained enough, the RabbitMQ broker will basically shut down when it goes into its various alarm states.

At this point, I’d recommend splitting RavenDB and RabbitMQ off onto different machines and then see how that changes things. What machines are the endpoints are running on? Are they also shared? I would recommend splitting those off as well if they are on the same machines as RavenDB.

If you want to know what the endpoints are busy doing, I recommend taking some memory dumps, and the stacks of the various threads will give you some insight into that.