we built resiliency using Nservice bus Saga, when a service went down. We used SqlDB to store the failed messages with another unique identifier(CorreltionId) and we used Saga Timeout to retry the failed message to check if the service is up and running and messages are submitted. Every thing is working as expected in Non Prod
In our production environment we will push messages from 16 boxes to the Nservice bus, and there are 4 boxes in which the Saga application code was deployed to handle those messages. During critical failure, we saw the failed messages were duplicated (same message id with different CorrelationId).
Not sure how does those duplicates were created.
Also is there a way how to replicate this issue in non prod environment.
How do we solve this?