Request is throttled for Delayed messages

loe-lobo · April 16, 2020, 9:47am

Hi All.

I’m facing issues with delayed messages while using Amazon SQS as the transport layer.
We have thousands of delayed messages (in the FIFO queue) and this exception is being logged thousand of times per minute:

Amazon.SQS.AmazonSQSException: Request is throttled
 ---> Amazon.Runtime.Internal.HttpErrorResponseException: Exception of type  'Amazon.Runtime.Internal.HttpErrorResponseException' was thrown.
   at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
   at Amazon.Runtime.Internal.HttpHandler`1.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.SQS.Internal.ValidationResponseHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
   --- End of inner exception stack trace ---
   at Amazon.Runtime.Internal.HttpErrorResponseExceptionHandler.HandleException(IExecutionContext executionContext, HttpErrorResponseException exception)
   at Amazon.Runtime.Internal.ErrorHandler.ProcessException(IExecutionContext executionContext, Exception exception)
   at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
   at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext)
   at NServiceBus.Transports.SQS.MessagePump.ConsumeDelayedMessages(ReceiveMessageRequest request, CancellationToken token)

I read the documentation about this issue AmazonSQS Troubleshooting • Amazon SQS Transport • Particular Docs but the delayed message process is handled by the NServiceBus SQS implementation, so I’m not sure how can I improve it or fix the errors.

At this moment I have defined the ConcurrencyLimit to 35

var timeoutManager = e.TimeoutManager();
timeoutManager.LimitMessageProcessingConcurrencyTo(endpointSettings.ConcurrencyLimit);
e.LimitMessageProcessingConcurrencyTo(endpointSettings.ConcurrencyLimit);

Would this configuration have an impact on the Throttling restriction from SQS?

Thank you in advance.

ramonsmits · April 20, 2020, 10:16am

This can result in getting throttled.

Another option is to use a rate limiting approach using GitHub - ramonsmits/NServiceBus.RateLimiter: Rate limiter for NServiceBus

Edit: I did not notice this is the ConsumeDelayedMessages message pump, this solution will only work on the main pipeline

loe-lobo · April 20, 2020, 12:34pm

Hi @ramonsmits.
This is a interesting idea, but how can I apply the RateLimiter only for the SQS FIFO queue (delayed messages)?
Our scenario is simple, we use SQS queues and we handle one type of event that starts a saga and the saga request a timeout, I don’t want to apply a limit on the processing of these events and on the starting of the saga, but we need to rate on the SEND action of the timeout. Do you have any suggestion on this scenario?

Thanks in advance.

mauroservienti · April 21, 2020, 6:28am

Hi @loe-lobo,

This is Mauro, I worked on the original implementation fo the native delayed feature in the SQS transport.

starting from the end: no, the timeout manager configuration has no impact on the way native timeouts are handled by the transport.

To better understand your scenario could you give us some insights on timeouts types you have? Are they short-lived timeouts or long-living? And the “thousands of delayed messages” were created by the system, so they are expected to be thousands, or are the results of the throttling or some other strange behavior you’re observing?

Also, is the exception causing the endpoint to stop processing messages? Or is causing timeouts to be delayed longer than expected? Or is it just appearing in the logs?

I have a hunch that the problem might be due to the way the transport reschedules timeouts if there are many pending in the FIFO queue.

Thanks,
.m

loe-lobo · April 22, 2020, 8:55am

Hi @mauroservienti,

I have both long-living (about 7 days) and short-lived (less than 15 minutes) timeouts.
To every Saga we start we have a timeout (long-living) requested, the handler of this timeout will mark the saga as completed. So we do expect to have a more significant volume of timeout being requested.

The standard queue still processes the messages, but at a slower rate than expected. While the FIFO queue is processing the messages, the Timeouts are delayed longer than expected.
Please correct me if I’m wrong, the process of delaying messages using SQS as transport requires that the SQS FIFO metric ApproximateAgeOfOldestMessage to be around 1800 seconds (30 minutes), and at this moment I’m seeing 10000 seconds (167 minutes) which means we are not processing the FIFO queue fast enough.

I can play with the concurrency for both Standard and Delayed messages, if you have any ideas. I’m also requesting an increase on the AWS quota to reduce the number of throttled requests.

Thanks.

Loe Lobo

mauroservienti · April 22, 2020, 1:55pm

I assume you’re using unrestricted native delayed deliveries.

I have a hunch that the problem is related to the way native delayed deliveries work under the hood. SQS allows for a native delay of 15 minutes (900 seconds) so if you need to delay a message by 20 minutes what we do is:

add a custom header that says delay by 20 minutes
delay the message once by 15 minutes
when the message reappears in the delay queue we check for the aforementioned custom header and if present issue one more delay by 5 minutes

If the delay is by 60 hours we go in cycles of 15 minutes until 60 hours gone by.

In a case where there are many timeouts in the delay queue that need to be rescheduled (or are expiring), I can see how SQS can start throttling the endpoint.

Does the above sound like the scenario you’re facing?

.m

loe-lobo · April 22, 2020, 2:22pm

Hi @mauroservienti, you assumed right, I’m using the Amazon SQS Delayed Delivery • Amazon SQS Transport • Particular Docs .

And your scenario is quite right as well. I have only one thing to include: while handling the normal queue and requesting a timeout for the first time, we also hit the quota threshold on SEND action to FIFO queue.

I’m just thinking out loud here, but to overcome this quota limitation, we could send the timeout to another standard SQS queue, or even the same queue, and delay it for 15 minutes. I understand that the FIFO queue is used also to keep the order of the messages, but if we can guarantee that the message will be consumed within 15 minutes from the sent time (horizontal scale) it should be fine, isn’t it?

Thanks in advance.
Loe Lob

mauroservienti · April 23, 2020, 9:06am

We have a hunch that the issue might be in the way we reschedule messages. If our gut feeling is correct we also have a path forward.

To validate our reasoning, would it be possible for you to configure the offending endpoint to output debug logging and send us the logs at support@particular.net?

The FIFO queue is needed to make sure that we reduce at the minimum the risk of generating duplicate messages when the timeouts are rescheduled. As of now, there is no way to address timeouts to a different queue other than the default endpoint-name-delays.fifo.

ramonsmits · May 15, 2020, 11:43am

It seems that you’ve resolved your issue. Could you share your solution with us?

danielmarbach · July 2, 2020, 7:33am

Hi @loe-lobo

We have released 5.0.1 and 4.4.1 that now batches the FIFO queue operations which increases the throughput (we have seen more than 6x more throughput), reduces the cost of operations and reduces the chances of getting throttled.

Regards
Daniel

loe-lobo · July 8, 2020, 5:15pm

Hi, @ramonsmits, I modified my application to use less delayed messages, I could reduce by 30% the number of requests. The new implementation didn’t solve the problem completely, so I’m working on a different method to schedule messages, using Redis ordered set to store the messages.

Thanks.

loe-lobo · July 8, 2020, 5:18pm

Hi @danielmarbach, thank you for the update.

danielmarbach · July 9, 2020, 4:07pm

By new implementation are you referring to the new patches we released?

Request can still get throttled if you reach more than 300 batches per second on the same FIFO queue unfortunately that is by design on FIFO queues.

Regards
Daniel

loe-lobo · September 7, 2020, 9:25am

Hi @danielmarbach, we requested to AWS to increase the limit to 600, and now it’s working as expected. Thank you for the code changes

Regards.
Loe