Competing Consumer - Strategies for dividing handlers across endpoints

neilbarnwell · December 21, 2022, 9:40am

Scenario

Let’s say I start with one domain entity.

Initially I have one endpoint, with a few handlers in, on a single “queue”. I realise soon that this is causing me two problems:

The endpoint simply can’t get through the messages quickly enough. Not only are any one user’s messages taking a long time to be handled, but messages resulting from the actions of different users are behind each other. So Andy’s messages are waiting in a queue behind Susan’s messages, which are waiting in the same queue behind Alex’ messages, and so on.

This is a particular problem when a device that has been offline for some time re-connects and all it’s messages arrive at once.
Some messages take a particularly long time to process (relatively speaking). This is causing potentially “fast” messages to be delayed behind a few “slow” ones.

Looking for a solution

I look at the documentation on scaling out and decide to implement competing consumer.

I move “slower” handlers to a new endpoint, and I deploy two more instances of the “fast” endpoint. (EDIT: Come to think of it, the built-in multi-threading support in NServiceBus would have achieved almost the same thing).

New problems

I now see a situation where handlers are competing on the same domain entities by ID, causing concurrency failures resulting in automatic retries.

In particular the “slow” messages are almost never processed, because for every entity they load, process, and attempt to save, another endpoint has already updated that entity for a different, “faster” message, and so the slow message fails to save and the whole pipeline has to start over. Immediate retries and Delayed retries are superb built-in functionality, but in this case the failed attempts account for a huge amount of waste.

Finally, the actual question for readers of this forum

Thank you for reading this far.

Usually in these situations the advice from the community is “Your boundaries are wrong.”. Does this mean my entity is wrong, and should be more entities? What if the actions being performed all require the same value to satisfy invariants? In that case I can’t split the entity up.

I know that for NServiceBus, topic-based routing is considered undesirable - so I can’t have an endpoint-per-user and ensure at least any one user’s messages aren’t waiting behind their colleagues’ messages.

I’m not sure what the options are. Can anyone help?

MikeMinutillo · December 28, 2022, 3:35am

If you have slow handlers and fast handlers that are all targeting the same data, you might want to consider switching to pessimistic concurrency for your data operations. At least this way, the slow handlers will eventually acquire the lock and process, potentially blocking a batch of fast message handlers for the duration.

You might get into a situation where the fast endpoint is not able to process any messages, because all of the messages at the head of its input queue require a lock that the endpoint cannot acquire while the long-running handler is in flight. If your locking is async, then you may be able to increase the concurrency on the fast message handler endpoint to compensate. This allows the endpoint to try and process more messages in parallel, which decreases the likelihood that all of Andy’s messages wait for Susan’s messages. The success of that strategy will depend on the mix of messages expected in the fast endpoint queue and the queue length, so you’d need to test it with a production-like data load to see how it behaves.

Ultimately, contention over that resource will limit the scalability of the system. You can work around it a little bit, and you may be able to work within those workarounds. In the long-run, you should examine your boundaries and see if there’s some way you can adjust them to reduce contention. It may not be possible with the constraints that you have. It’s hard to know without a concrete example.