Implementing Saga Timeouts

Steve123 · November 1, 2018, 7:45pm

In reference to the RetailDemo in the Particular documentation and sagas, how would a timeout be used if the shipping logic took too long to complete? The documentation says this about sagas and timeouts:

“In the next lesson ( Coming Soon ) we’ll see how using timeouts enables us to add the dimension of time to our business policies, allowing us to send messages into the future to wake up our saga and take action, even if nothing else is happening.”

There is some documentation about saga timeouts here:

but I am not understand the code snippet on that page fully how I would apply that to the RetailDemo if shipping where to take too long. Also what if for shipping, and exception is thrown instead of a timeout, and all recovery fails? I would like to publish a ShippingCanceled message if recovery fails. Failures can be caused by never recovering as well as timeouts for the first attempt to ship.

SzymonPobiega · November 8, 2018, 11:06am

Hi Steve

In that lesson the ShipOrder message is handled by ShipOrderHandler that does nothing. We can imagine that this handler actually invokes a shipping company web service providing the package data and pick up time and location. If everything goes fine the business process is over.

Now what happens if the shipping company web service is down? If ShipOrder is handled by an ordinary handler and that handler does not do any error handling, NServiceBus recoverability built-in feature makes sure that the ShipOrder command processing is retried. By default there are several immediate attempts to process it (all fail because the web service is down) and then the message processing is delayed by some time. When that time passes, NServiceBus attempts to process the message again. Eventually the message ends up in the error queue and is picked up by ServiceControl. It can be returned to the queue via ServicePulse.

The failure handling mechanism described above is usually appropriate for technical failures i.e. when there is a deadlock or the whole database goes down. In such cases usually the built-in retry mechanism is able to resolve the problem. In the case of external provider (such as shipping company) there are likely business rules around downtime i.e. if UPS does not respond withing 30 minutes, use FedEx (or vice versa). In that case you don’t want to manually retry the message from ServiceControl. You need to handle the ShipOrder command in a dedicated saga. Here’s how that saga can look like.

The saga is started by ShipOrder command and immediately sends the ArrangeParcelPickup message to the UPS integration endpoint. Given that UPS unavailability is a problem we want to handle at the business (not infrastructure) level, that handler does something like this:

try
{
   var result = CallUPS();
   return context.Reply(new ParcelPickupConfirmed(result));
}
catch
{
   return context.Reply(new ShippingServiceUnavailable());
}

The shipping saga reacts on ParcelPickupConfirmed message (the happy path) and completes the process. It handles ShippingServiceUnavailable by either scheduling a timeout (if the 30 minute window has not elapsed yet) or switches to FedEx

Task Handle(ShippingServiceUnavailable msg, IMessageHandlingContext context)
{
   Data.UPSAttempts++;
   if (Data.UPSAttempts > 5)
   {
      return Send("FedEx", new ArrangeParcelPickup());
   }
   return RequestTimeout<RetryArrangeParcelPickup>(context, TimeSpan.FromMinutes(5));
}

The result is that the saga attempts to contact UPS every 5 minutes for 30 minutes and, if that fails, it switches to a different shipping company. The saga can be easily generalised so that it uses a list of shipping companies arrange in order of preference.

The key points here are:

The unavailability of a third party web service should be dealt with at the business level (code in the handler) rather than infrastructure layer (built-in retries)
The saga should not directly call any web services. It should use messaging to orchestrate integration endpoints that themselves call the third party web service (single responsibility FTW!)
The timeouts in the saga can be used to define sophisticated retry policy aligned with the business needs (in this example retry every 5 minutes)

Hope it makes sense,
Szymon