I am working on a distributed system using NServiceBus and looking for guidance on managing long-running processes effectively. Specifically:, I am dealing with a scenario where tasks take several minutes to complete and could span multiple retries if something fails.
Timeout Management :- What’s the best way to handle timeouts without causing unnecessary delays or overwhelming the system: ??
Saga Design :- Are there any specific patterns or anti-patterns to follow when implementing sagas for such processes: ??
Error Handling and Retries :- How do you ensure retries don’t cause duplicate processing or degrade system performance: ??
Scaling :- What’s the recommended way to scale these processes while maintaining reliability; ??
If anyone has experience with similar scenarios or can share resources, it would be greatly appreciated. I am also interested in how to measure and monitor these processes for optimization.
Both are examples of how to avoid dealing with transactions. If you are using Azure ServiceBus as an NServiceBus transport, you could be leveraging the feature that automatically extends the message lock by automatically renewing it. I would not suggest doing it; it’s primarily designed for those scenarios that are borderline and not the ones that are well-known to take much longer than the default timeout.
The simple answer would be to not rely on timeouts at all. We must dive more into the design to provide a better answer. Let’s imagine you have an incoming message that kicks off the process of rendering a PDF, which takes 15 minutes. It’s longer than any possible transaction, and even if you could extend the lock so long, it makes little sense to hold a resource for such a long time.
Ideally, you want to design two separate components: a coordination saga and a worker.
The saga knows about the process and delegates the execution to the worker. The saga handles the message request to create the PDF. It receives it, stores in its saga data that the process started, and sends a message to the worker. At the same time, the saga sets a deadline for the process to complete by sending a timeout to itself (RequestTimeout saga API).
The worker is configured to use no transactions and consumes the message from the saga. It starts rendering the PDF, and let’s say that, for the sake of the example, it sends a message to the saga for every rendered page.
Each time the saga receives a PageRenderedMessage, it can keep track of the amount of completed work in its saga data. When the timeout expires, the saga can decide if it’s worth waiting more time or raising a business error. For example, if the last PageRenderedMessage was received within a reasonable amount of time, the saga could decide to wait longer; if it was long ago, the saga might choose to consider the worker not responding and escalate the issue further.
Summarizing the concept, if workers can report progress back to the coordinating saga, you want to do so that the saga can make more informative decisions and not blindly rely on the timeout to decide what to do.
By designing the process without using transactions, you explicitly avoid having duplicate messages requesting to kick off the long-running background process. And if the saga fails to handle the message and the NServiceBus retry logic kicks in, you are sure you have not yet started the resource-consuming process.
With the above-presented design, scaling plays a different role if we’re talking about the saga or the workers. Regarding the saga, it’s no problem; deploy more endpoint instances. In the end, the saga does very little.
For workers, on the other end, it depends on the type of tasks they can perform. If they are entirely interchangeable and dynamic resource allocation is not a concern, they can use the competing consumer approach. In the event they cannot you need to implement some logic in the coordinating saga to keep track of which worker is busy and manually address an available worker to perform the task. The competing consumer approach should be the preferred approach whenever possible.
For additional OpenTelemetry metrics and traces, you could implement business-level metrics and traces in the coordinating saga and workers to keep track of the overall execution time of the long-running background tasks.