Retry entire saga when exception occurs

ponkarthik87 · February 16, 2022, 8:40pm

Hi,

I have a scenario like below.

Components

• Token database
• Pricing Web service

To get the pricing data from the Pricing Web service, a token from the database is needed for authentication.

The design is implemented as below

The saga started to correlate the request and response
The get token command is sent.
The get token command handler queries the token from the database and replies with a token as a token message
In the token message handler, the token is retrieved from the message, and get pricing command will be sent
1. The get pricing command handler makes the HTTP call to the pricing web service and gets the expired token exception.
In this case, If we allow this handler to fail then retry will not help as the token already expired.

If I wrap this exception in another message and reply to saga then

   a.	is there a way for me to restart the saga. 
   b.	If yes, how can I do that? 
   c.	How to configure, how many times to retry? 
   d.	How to give up on retry and mark the message as poison?
   e.	Is there any way to recover this poisoned message?

Any help on this is highly appreciated.

Please let me know if you need more details.

Thanks, Pon

Dennis · February 21, 2022, 1:01pm

Hi,

Initially, my thought was, why get the token via a message? Why not do a GetPricing command, have the GetPricingCommandHandler retrieve the token somehow and make the call?

My second thought was when reading your retry issue was, that if you do it like I just described, every single time a retry occurs, a new token will be retrieved and there’s no way you can have an expired token (or at least not because of the retry).

The reason I would do it like that is because the saga should orchestrate the business process, not the technical requirements (or non-functional requirements) where you need to get a token to be able to do something.

Now if there’s a very clear reason why you need to get the token this way (and if the answer is ‘yes’, I would really take a step back and reconsider it again or maybe even twice) then there’s no need to start the saga again.

Have the GetPricingCommandHandler potentially send two messages. One is HereIsThePriceResponse and the other one is CrapTheTokenExpiredResponse. There’s a specific reason your GetPricingCommandHandler that blows up because of the token. Swallow the exception and send that response, so that the saga knows it needs to request the token again.

Let me know if that helps.

MarcS · February 21, 2022, 1:59pm

I have implemented something similar, but I didn’t use a saga for it.
My messagehandler has a static property that holds the token that is needed to call the service.
The messagehandler first checks if the token is null or expired. If that is the case it first gets a new token and stores it in the static property.
After that, (or if the token was available and still valid) it calls the service.

ponkarthik87 · February 22, 2022, 6:16pm

Thanks for your response, @Dennis . The thought process behind this design is, the token has its own database.

GetTokenCommandHandler resides in Token Endpoint
GetPricingCommandHandler resides in Pricing Endpoint

Directly accessing the token DB (maybe with the repository pattern) from the GetPricingCommandHandler won’t create a tight coupling? Today tokens are retrieved from DB tomorrow these tokens may be retrieved from third-party service. In that case with this implementation, there is no change in the pricing endpoint as the GetToken handler resides in his own end point.

" *Swallow the exception and send that response so that the saga knows it needs to request the token again" I like this idea, but if we do that then it’s kind of we are doing the manual retry. This means we need to write the special resilience logic around this. How many times the system has to retry and How are the manual retry are gonna work as the failed message handler will have the same persistent data to start with. for example if I write a logic in a way that after three failure the saga has to fail then when I manually retry how can I reset the counter so that process continue.

ponkarthik87 · February 22, 2022, 6:20pm

Thanks @MarcS for the reply. Can you please explain how beneficial having the token is as static property?

“The messagehandler first checks if the token is null or expired. If that is the case it first gets a new token and stores it in the static property” By doing this I think we are in the same loop, I have explained the previous reply. Am I missing something?

Dennis · February 24, 2022, 2:49pm

The coupling is already there, putting messages in between doesn’t resolve the coupling. There’s no way you can receive the price without a proper token, not even if pigeons start delivering the tokens.

I have zero context related to your scenario, but I hear that argument a lot and hardly ever anyone had to act on it. But even then, abstract away the storage/retrieval of the tokens and replace the implementation behind the abstraction. Repository pattern could indeed be an option.

TL;DR

Not really. But I’d indeed probably never implement it this way. Rather use the option of directly retrieving the token inside the same endpoint.

The long answer:

Not really. Same answer as above, but here’s more explanation for possible scenarios. There are multiple possible ‘retries’, with lack of a better word.

A customer needs to pay an order but doesn’t have enough money
A customer needs to pay an order but primary creditcard fails and has secondary creditcard
A customer needs to pay an order but the 3rd party service can’t be contacted

The first is clearly a business process. The message handler for processing the payment receives some sort of notification from the bank and needs to report this to whoever controls the order, in our case via a message. The handler responsible for the order could cancel shipment, or email the customer about this and give the customer 10 days to enter different kind of payment method. It all depends on how the business wants to treat this.

The third is clearly a technical issue. Right?
But imagine that there’s a message handler waiting for a response. This is not just a matter of “Try immediately 3 times and then 3 times more with delayed retries. If all fails, move it to the error queue”. The business might want a response from the payment message handler within 10 minutes? A saga could be set up that, if the payment message handler doesn’t provide within 10 minutes, we’ll try something else. Like give up and email the customer?

The second might not seem a clear cut business process. But it’s closer to your example. It might fail because there’s not enough money, but it might also be that the 3rd party service for CreditCard-1 is unavailable. Do we differentiate between technical failure and lack of money here? Because the saga waiting for an answer, again, might make a decision of its own after 10 minutes. I would not wait for both the first creditcard to fail and initiate the second creditcard. That’s the responsibility of the payment endpoint, not the saga inside the order endpoint, or whatever they’re called.

Even though the payment might fail because of technical reasons (the service is unavailable) or because there’s not enough money, trying for the second creditcard is simply part of (potentially) a saga and 1 or more messages being sent.

I would say the same for your scenario. Even though the token expired, this should not go into retries, because it’s not a transient failure or similar. This will never fix itself, unless we retrieve another token. And we can only retrieve another token by letting the sender of the token (in your case the saga) know that we need a new one.

But again, I’d implement retrieving the token and retrieving the data inside the same endpoint.

Think of it like logging. Would you send log-entries via messages and queues, because otherwise the endpoint would be tightly coupled to the logging engine? The coupling is already there, don’t fight it

ponkarthik87 · February 28, 2022, 11:17pm

Thank you @Dennis for your input. It really helps. We are thinking of redesigning this.