Troubleshooting Saga Persistence

Hi there,

We have a few Sagas that SendLocal a Command to a Handler to perform a small amount of initialization work. The handler then Replies to the Saga.
The Reply intermittently fails with SagaNotFound (We have configured this as a fatal error in our case). The only handlers for these Commands and Messages are the 1 Saga and the 1 Handler.

The reply Message is not configured in the mapping for the Saga nor does it contain a value for the CorrelationProperty - I am under the impression that the correlation to a Saga with a Reply was handled internally, can you confirm? If this is incorrect it is most likely the problem but then I would be confused on why it works other times.

This has been observed with SqlPersistence and Learning Persistence. I believe I have been able to narrow this down to an issue with hosting multiple endpoints in 1 host. If I only start one endpoint I am unable to reproduce. Other than that I am at a loss as to how to proceed in troubleshooting this. I can say with 99% certainty that this isn’t an out of order message problem…i.e. saga completing early. It’s as if the Saga just never gets persisted but was known to be invoked by the Command it sent (and the subsequent Reply that gets lost)

We are using Outbox with SqlPersistence if that helps at all.

Any insight would be appreciated.

TIA,
JM

You are right, request/response between a saga and a handler should be correlated automatically.

The usual way is to correlate on some kind of ID and let the user control how to find the correct saga instance using that ID. NServiceBus provides native support for these types of interactions. If a Reply is done in response to a message coming from a saga, NServiceBus will detect it and automatically set the correct headers so that it can correlate the reply back to the saga instance that issued the request.

With the exception that this is only 1 level deep as mentioned:

The exception to this rule is the request/response message exchange between two sagas. In such case the automatic correlation won’t work and the reply message needs to be explicitly mapped using ConfigureHowToFindSaga.

So if that is correct the correlation should happen.

Source: - Sagas • NServiceBus • Particular Docs

That seems strange. I assume here you host multiple different endpoints in a single host process?

– Ramon

Yes, we have a suite of services that have 2-3 endpoints started per host process. We’ve seen this happen randomly (but fairly consistently) in most of these services. The other strange thing is that we can’t seem to reproduce it with a debugger attached. If we run the services locally, with no debugger and LearningPersistence it happens every subsequent Saga invocation i.e. it only works the first time.

The Sagas are SqlSagas.
The Handler only handles the 1 command message type.

Another assumption I have is that the Assembly scanner is to exclude handlers, should we be using to exclude message types as well? Could this be an issue of endpoints having their “wires crossed” on the Reply?

Our general configuration is RabbitMQ, SqlPersistence (MsSqlServer), Ninject, Outbox, JsonSerialization with a TopShelf Host.

The other thing I should add here is that the Reply message only has default values, meaning it has properties (including the correlation property the saga uses) but they are not assigned any values.

An update on this -

I believe this has nothing to do with the messaging or saga semantics and more to do with the the ObjectBuilder. Brought up here previously.

Each endpoint instance (2 in the same host) has a Child Kernel that still has a reference to the root kernel but I don’t see how any modifications the Child Kernel could affect the Parent Kernel.

The Saga just won’t initiate the insert, nothing gets sent over the wire. It has an open connection to do so.

Again, this is only an issue for all subsequent requests. The first one works as expected.

Removing the NinjectBuilder and moving to a ServiceLocator pattern within the handlers resolves the issue. This is obviously not ideal and changing containers is not an option.

I haven’t gotten as far the why and would really appreciate some basic things to try to get this working. We’re finalizing a licensing agreement with Particular as this is our initial implementation and this has already created some doubts in our organization.

I hope this new info can lead to a little more activity on this post.

When hosting multiple endpoints in a single process I would recommend to exclude assemblies and not types and to package assemblies based on endpoint.

Pubsub is based on found handlers/sagas that have events. Yes, you can ‘wire’ things incorrectly if you have the exclusions configured incorrectly, and that can be pretty annoying to fix as you will need to remove these bindings in RabbitMQ.

Replies use the sender information from the original message. These do not use the event/command routes.

As stated earlier, if the reply is not multiple levels deep then the reply will be able to correlate to the right saga instance. It does not need a correlation property in the Reply message. However, if you do have that property I don’t know what the behavior is. I have to verify that.

– Ramon