Saga timeouts race condition

Thorarin · March 27, 2023, 1:26pm

I’ve run into an unexpected situation when implementing a new saga. I’m sending a timeout without any delay, so make sure some data is stored before continuing. It’s a little hacky, I agree, but I expected it to work fine.

Instead, what is happening is that the timeout message actually is processed before the saga is even persisted at all, and the timeout is ignored:
No saga found for timeout message 272462ac-5dad-43ef-8b00-afd200bc902e, ignoring since the saga has been marked as complete before the timeout fired

We’re using Azure Service Bus transport and NHibernate persistence.

Now, I haven’t found any examples where this is done, but there are official examples of relatively short timeouts like 10 seconds, and I’ve regularly observed clock drift between our servers and Azure that is larger than 10 seconds. This clock drift potentially means that even if I add a 10 second delay, message delivery could still be instant, leading it to be ignored.

This happens also with outbox enabled. I expected the timeout message to not actually get picked up from the outbox until the saga had been persisted. Is this expected behavior?

Thorarin · March 27, 2023, 2:49pm

It seems something else is going on, but I’m unsure what. Introducing a 10 second delay did not fix my problem. I’ve double checked if I call MarkAsComplete() anywhere unintentionally, but no. I’m was really confused on why the relevant saga is not found for the incoming timeout message.

It seems that the ID my saga gets stored under in my database, is not at all the same as the NServiceBus.SagaId in the timeout message. Oddly enough when I step through my code slowly, they sometimes are the same and everything worked as expected. The mind boggles

Thorarin · March 27, 2023, 3:04pm

After realizing the above, I traced it to the configuration of the persistence quite quickly, or more accurately the fluent NHibernate mapping:

public class MySagaDataMap : ClassMap<MySagaData>
{
    public MySagaDataMap()
    {
        Not.LazyLoad();

        Id(t => t.Id).GeneratedBy.Assigned();  // GeneratedBy.Assigned() was missing
        // ...
    }
}

We had the exact same code in several earlier sagas, but this was never as issue, because the message that started the saga never requested a timeout. Only subsequent handlers were requesting timeouts, at which time Data.Id would be the value that was loaded from persistence. Therefore, we never noticed that the saga ID effectively changed after handling the first message.