High storage queue count on MSMQ Distributor

mgmccarthy · November 4, 2021, 1:32pm

What does it mean when the distributor’s storage queue has a lot of messages in it?

791 messages seems to be a lot. After reading these docs, I think the answer to my question is the workers have a lot of available bandwidth but the distributor is not able to distribute the message fast enough to the available workers. Aka, the distributor is the bottleneck.

Another guess is the way to try to fix Distributor being the bottleneck is to horizontally scale the distributor to more quickly distribute the load to the available workers.

Thanks!

ramonsmits · November 5, 2021, 5:17pm

The “storage” queue shows basically how how slots are unoccupied.

Does your endpoint queue have any messages in it? Its only a bottleneck if that has many messages queues and the storage queue shows many messages too.

Also, as many messages are flowing through these queues while locks are hold on to these the shown numbers might hide the truth.

Just look at the main queue on the distributor and check the activity for each worker. Only if the main queue builds up a significant backlog of messages it indicates a bottleneck.

Also, a NServiceBus v6 endpoint can act as a distributor worker (no longer in v7) and can also have the “capacity” configured as shown in the following sample. The capacity can be larger that the configured concurrency limit and can be used to “pre fetch” messages to the worked machine to reduce the latency involved in the forwarding of messages from the distributor machine to a worker machine.

https://docs.particular.net/samples/scaleout/distributor-upgrade/#sample-specific-changes-enlisting-with-the-distributor

Very often the bottleneck is caused by network and or storage IO limitations. Please monitor your network, storage, memory and cpu metrics to understand where the bottleneck is.

However, there will indeed be a limit with the distributor as everything needs to be flow via this node. For this reason in V6+ we dropped the distributor in favor of sender side distribution. Using sender side distribution removes the “distributor” hop which reduced latency and halve your network IO as there is no forwarding. Also, the network IO limit of the distributor machine no longer is the limit but the limit of each individual worker or the network itself.

Last but not least, MSMQ performance with Core v6 is significantly faster due to performance async pipeline improvements. Depending on your workload it might be that fully removing the distributor and upgrading to Core v6 and not even relying on sender side distribution make will your system be more performant.

mgmccarthy · November 5, 2021, 6:38pm

Thank you @ramonsmits! Thanks for laying out, in detail the mechanics behind all this. Next time I’m monitoring in production, I’ll keep an eye on the distributor’s input queue as well as the storage queue. We’ll be moving the RabbitMQ soon, so we probably wont’ pay the price to refactor to sender side distribution, but if that gets held up, I can certainly look into it.