Lost Published messages

kepar · July 2, 2024, 7:23am

Hi,

I’ve been using NServicebus 7.8.4 in a production environment for about 2 years, and every now and then we notice “lost” published messages.

Our logic goes like this:

public async Task Handle(UpdateDetected event, IMessageHandlerContext context)
{
// check if this event ever started a SAGA
var eventAlreadyProcessed =await IsEventInTablestorage (event.Key).ConfigureAwait(false);

if(eventAlreadyProcessed)
{
   return;
}

switch (SagaToStart)
{

    case string s when s.Equals(nameof(event), StringComparison.OrdinalIgnoreCase):

        await context.Publish<event>(msg =>
                {
                    msg.Key = event.Key;
                    msg.X = event.X;
                    msg.Y = event.Y;
                }).ConfigureAwait(false);
        break;
		
		case: // other cases...
		case: // other cases...
		default:
			throw Exception("unknown SAGA/wrong configuration");
}

// mark that this event started a SAGA
await AddEventToTableStorage (event.Key).ConfigureAwait(false);
}

this works great most of the time, approximately 1 in 2000 messages get lost this way.

I see in my logging that the context.Publish is being called with the correct configuration & after that the AddEventToTableStorage, but no sign of a started SAGA or an error in Servicepulse, ServiceInsight, …

The problem does not occur with any specific type of SAGA, it seems the publish returns correctly but the message is never handed over to Azure Service bus.

Is there anything that I can check or enable to troubleshoot this further?

danielmarbach · July 3, 2024, 2:23pm

Hi @kepar

Thanks for bringing this to attention. Generally, things should not get lost. We do take message loss scenarios very seriously because it is important to us that people can trust their systems running on top of NServiceBus. So let’s find out together what could cause this occasional loss of message.

Let me first give you a bit of context. When you are writing context.Publish the message is not actually going out. It is batched in memory and then at a later stage handed over to the transport. Depending on the transaction mode of the transport this operation can either be separate from the incoming message transaction (ReceiveOnly) or be “attached” to the incoming message transaction (SendsWithAtomicReceive). What transport transaction mode are you using?

With Azure Service Bus the SendsWithAtomicReceive mode uses a send-via approach that hands over the message to Azure Service Bus that get reliably transferred by the broker. This operation can fail in very rare cases (for example, when the destination is no longer reachable). When that happens, the message would land in the transfer dead letter queue. Can you if there is anything in the transfer deadletter queue?

Furthermore publish operations require a subscriber. So in cases when there is no subscriber a publish operation would be discarded by the topic. NServiceBus by default does auto subscribe based on the handler types it founds during scanning and then adds the necessary subscriptions. Is it possible that during the time there is for some reason no subscribers available? How do you manage the topology on Azure Service Bus? Do you let NServiceBus do “its thing” or do you run NServiceBus endpoints without manage rights and set up the topology during deployment manually, with the tool or by any other mechanism?

What persister and version are you using to store the sagas? Can you also elaborate why you are using a generic update message that you then have to map to specific events? And why are those events and not commands? Is anyone else interested in those?

Do you have any logs and diagnostic files from around that time we could look into? You can open up a non-critical support case and attach the necessary files there to keep the privacy of that data.

FYI It is possible to enable Azure Service Bus event logging / tracing Azure Service Bus end-to-end tracing and diagnostics - Azure Service Bus | Microsoft Learn and azure-sdk-for-net/sdk/core/Azure.Core/samples/Diagnostics.md at main · Azure/azure-sdk-for-net · GitHub to capture all Azure Service Bus operations. With that, we could also verify whether the logs correspond to an actual operation being executed.

Last but not least, we have also seen sometimes missing await statements being the culprit of “message loss scenarios”. Have you checked the code for such scenarios?

Regards,
Daniel

kepar · July 8, 2024, 10:39am

Hi @danielmarbach,

First of all thanks for the extensive reply.

As far as I can see we do not specify the Transaction mode for our ServicebusTransport

the transport is configured like this

var transport = endpointConfiguration.UseTransport<AzureServiceBusTransport>()
                .ConnectionString(settings.ServiceBusNamespace)
                .CustomTokenCredential(new DefaultAzureCredential());

I’m not sure what de default Transaction mode would be in NServiceBus.Transport.AzureServiceBus" Version=“2.0.2”
But I am seeing messages in the “In transfer Message Count” which makes me think that it is using ‘SendsWithAtomicReceive’ mode

Unfortunately I can’t find any messages in the transfer dead letter queue.

regarding the subscribers, I was advised not to let NServicebus do it’s thing in production, therefore we have disabled this

#if DEBUG
    endpointConfiguration.EnableInstallers();
#endif

thanks for pointing that out, that makes me scared that future releases will be missing some subscriptions (something to add to my todo list).
for now, I do see the correct rules in my subscription to capture my event.
as w’ve been in production with this tool for more than 2 years that would have been a big problem.

As for persistence we use NServiceBus.Storage.MongoDB Version=“2.2.0” configured like this:

		var persistence = endpointConfiguration.UsePersistence<MongoPersistence>();
		persistence.UseTransactions(false);
		persistence.MongoClient(new MongoClient(settings.NServiceBusStorageConnectionString));
		persistence.DatabaseName(settings.NServiceBusStorageDatabaseName);

we are integrating with an external application which does not have any concept of events.
so we are left with periodically querying the database for changes.
when we detect those changes we map the data to events to Start the correct SAGA to process this specific event.
no one else is interested in these events, so I guess it should be commands, but this is how our solution was setup by our long gone consultant ;-).

we do log to Application Insights but this only shows us that the code before and after the “context.publish” was called, no errors can be found around those times.

I still have to look into enabling Azure Service Bus event logging / tracing, thanks for the interesting links!

I doublechecked the code for missing await statements, but could not find any. I still hope it’s just a configuration issue…

danielmarbach · July 29, 2024, 2:53pm

Hi @kepar

It seems our system hasn’t picked up that you replied and re-opened the internal tracking issue. Since I was on vacation and something didn’t work on our end, nobody has replied meanwhile. Sorry for that!

I’m back from vacation and will give your answer a closer look tomorrow. In the meantime, have you had any other insights?

Regards,
Daniel

danielmarbach · July 30, 2024, 9:08am

To follow up on some of the details

That indicates that nothing went wrong during the transfer but still isn’t an indication that something was actually handed over to the broker.

By the nature of pubsub you can have zero to n subscribers. If there are no subscribers, then things are a no-op. Nothing to be scared off but to be aware off

Good to hear that the infrastructure is in place.

I do understand you mentioned the publish is called which is an indication that quite likely something is not handed over to the broker. But just to double-check:

You are doing some kind of deduplication logic. Have you checked whether this logic is not incorrectly deduplicating in certain scenarios? Furthermore, have you verified your logic with access to the table storage doesn’t run into some kind of “transactionality” or concurrency issue that causes the insert to fail, retry and then the message being seen as a duplicated and discarded?

Regards,
Daniel