NServiceBus Router and failed/poison messages

bdlane · April 26, 2021, 8:08am

I’m looking at the atomic update-and-publish example, and wonder how failed/poison messages should be handled?

The related docs page says that failed messages are moved to the ‘poison’ message queue - is this the equivalent of the ‘error’ queue configured on a standard endpoint? Or is it somehow different? And should ServiceControl be set up to monitor this queue?

When the router is configured to create its own queues and both interfaces are SQL databass, it creates a poison queue in both interfaces’ databases (i.e. the web/business database and the transport database) - can/should ServiceControl be set up to monitor both these queues?

laila.bougria · May 4, 2021, 12:42pm

Hi @bdlane,

The poison queue used by NServiceBus.Router is a separate queue from the error queue that’s configured in your endpoints (and monitored by ServiceControl). Messages that end up in the NServiceBus.Router poison queue failed due to:

serialization errors
messages that failed to be routed to the correct endpoint.

This queue is not actively monitored by ServiceControl. You could try to configure NServiceBus.Router’s poison queue, to point to the same error queue as is configured in your endpoints (now the default is set to ‘poison’ which creates a dedicated queue). The messages should contain all the headers required for ServiceControl to process and retry those messages as well.

Kind regards,
Laila

bdlane · May 6, 2021, 11:50am

Thanks @laila.bougria for clearing that up - make sense.

Is the term ‘poison’ vs ‘error’ an important distinction here then? Just wondering why it differs from the main NServiceBus terminology.

I tried pointing a ServiceControl instance at the poison queue and it picked up the messages fine, so that seems to work - at least for the failures I tried.

In the sample, the router creates a poison queue on both interfaces. On the ‘main’ transport side, it would be easy to just configure the router to use the standard error queue and have the ServiceControl instance for the main transport pick up and handle the failed messages. But the poison queue on the web side isn’t reachable by the main ServiceControl instance - because this is a ‘non standard’ physical routing configuration, right?

So the options to monitor the poison queue on the web side would be:

another ServiceControl instance, that could be configured as a remote instance for the main ServiceControl instance
a ServiceControl Transport Adapter to route the messages from the web transport to the main transport

Are there any other options I’ve missed?

WilliamBZA · May 14, 2021, 8:33am

The terminology likely differs because the Router is a community package and the documentation for it is managed by the owner of the package - and not by Particular.

Particular will (unless a mistake is made) always use Error as the name of the queue since “Poison” might be a technology specific term that doesn’t apply across all transports.

As to your question about ServiceControl, You’re correct that it’s because it’s non-standard. I expect the intention of this sample is not to show you how you should use the router to deploy your system, but merely to show how you could.

My preference would be for the Web side to use the same error queue as ServiceControl. If that is absolutely not possible, then a Transport Adapter to forward the messages to the error queue that ServiceControl is workable - although it adds deployment complexity.

What is your use case for needing the router? Perhaps there’s another way.

SzymonPobiega · May 18, 2021, 10:37am

@bdlane

The name poison was my mistake when designing the Router. I should have named it error as in regular NServiceBus endpoints. I thought that the name better reflects the nature of the expected failures. While in a regular endpoints you expect that the message is going to end up in the error queue because of errors when processing (possibly caused by a bug in the processing logic), with the Router the message is most likely to end up in the poison queue because the message itself is malformed and can’t be forwarded to the destination transport (e.g. it is too large) and there is no way to just fix the routing code and make it succeed.

That said, for ServiceControl the cause of the failure (bug vs message is malformed) does not really matter and SC can deal equally well with both cases. When releasing the next version of the Router I’ll consider using the regular error name for the failed message queue API.

bdlane · May 26, 2021, 1:30pm

Thanks for the context around the name ‘error’ vs ‘posion’ - makes sense. And @SzymonPobiega’s confirmed that too in his follow up post. I just wanted to make sure I wasn’t missing anything given the different names.

I’ve been looking at the Router and the ‘atomic update and publish’ sample because we have an app that currently runs on .NET Framework and uses MSDTC to ensure atomic update and publish, but we want to upgrade it to .NET Core 3.1 which doesn’t support MSDTC.

I actually had an email conversation with @andreasohlund about this and, having considered all the options, I settled on the Router option for now to see how that goes. The main alternative - an eventual consistency model where the web app sends a message to a backend service that takes care of the update and publish - seems like too big a change for us, and we don’t need the scaling and independence that kind of model would give.

I too would prefer to use a single ServiceControl instance and error queue! It irks me that we’ve essentially now got two physical transport instances - the web database and the main NSB database. But it seems like it’s necessary anywhere the Router would be used, as it implies two or more physical transports.

bdlane · May 26, 2021, 1:47pm

Ah, yes, I can see that now it kind of dawned on me as I was playing around with the sample - the Router isn’t ‘processing’ the message like a normal handler, but moving it to another queue, so you wouldn’t get normal processing failures.

Super!

And on that note - is there an equivalent of the error notifications API in the router? I know we can hook in to ServiceControl’s error notification messages, but we use the error notification API in regular endpoints to send error telemetry to Application Insights. Having looked through the router code couldn’t see anything similar?

And thanks very much for providing this component - it is super useful!

SzymonPobiega · May 31, 2021, 11:55am

No, unfortunately there is no notification API for the Router. I added an issue to track this idea here.