I’m facing issues with delayed messages while using Amazon SQS as the transport layer.
We have thousands of delayed messages (in the FIFO queue) and this exception is being logged thousand of times per minute:
Amazon.SQS.AmazonSQSException: Request is throttled
---> Amazon.Runtime.Internal.HttpErrorResponseException: Exception of type 'Amazon.Runtime.Internal.HttpErrorResponseException' was thrown.
at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken)
at Amazon.Runtime.Internal.HttpHandler`1.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.SQS.Internal.ValidationResponseHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
--- End of inner exception stack trace ---
at Amazon.Runtime.Internal.HttpErrorResponseExceptionHandler.HandleException(IExecutionContext executionContext, HttpErrorResponseException exception)
at Amazon.Runtime.Internal.ErrorHandler.ProcessException(IExecutionContext executionContext, Exception exception)
at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext)
at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext)
at NServiceBus.Transports.SQS.MessagePump.ConsumeDelayedMessages(ReceiveMessageRequest request, CancellationToken token)
Hi @ramonsmits.
This is a interesting idea, but how can I apply the RateLimiter only for the SQS FIFO queue (delayed messages)?
Our scenario is simple, we use SQS queues and we handle one type of event that starts a saga and the saga request a timeout, I don’t want to apply a limit on the processing of these events and on the starting of the saga, but we need to rate on the SEND action of the timeout. Do you have any suggestion on this scenario?
This is Mauro, I worked on the original implementation fo the native delayed feature in the SQS transport.
starting from the end: no, the timeout manager configuration has no impact on the way native timeouts are handled by the transport.
To better understand your scenario could you give us some insights on timeouts types you have? Are they short-lived timeouts or long-living? And the “thousands of delayed messages” were created by the system, so they are expected to be thousands, or are the results of the throttling or some other strange behavior you’re observing?
Also, is the exception causing the endpoint to stop processing messages? Or is causing timeouts to be delayed longer than expected? Or is it just appearing in the logs?
I have a hunch that the problem might be due to the way the transport reschedules timeouts if there are many pending in the FIFO queue.
I have both long-living (about 7 days) and short-lived (less than 15 minutes) timeouts.
To every Saga we start we have a timeout (long-living) requested, the handler of this timeout will mark the saga as completed. So we do expect to have a more significant volume of timeout being requested.
The standard queue still processes the messages, but at a slower rate than expected. While the FIFO queue is processing the messages, the Timeouts are delayed longer than expected.
Please correct me if I’m wrong, the process of delaying messages using SQS as transport requires that the SQS FIFO metric ApproximateAgeOfOldestMessage to be around 1800 seconds (30 minutes), and at this moment I’m seeing 10000 seconds (167 minutes) which means we are not processing the FIFO queue fast enough.
I can play with the concurrency for both Standard and Delayed messages, if you have any ideas. I’m also requesting an increase on the AWS quota to reduce the number of throttled requests.
I have a hunch that the problem is related to the way native delayed deliveries work under the hood. SQS allows for a native delay of 15 minutes (900 seconds) so if you need to delay a message by 20 minutes what we do is:
add a custom header that says delay by 20 minutes
delay the message once by 15 minutes
when the message reappears in the delay queue we check for the aforementioned custom header and if present issue one more delay by 5 minutes
If the delay is by 60 hours we go in cycles of 15 minutes until 60 hours gone by.
In a case where there are many timeouts in the delay queue that need to be rescheduled (or are expiring), I can see how SQS can start throttling the endpoint.
Does the above sound like the scenario you’re facing?
And your scenario is quite right as well. I have only one thing to include: while handling the normal queue and requesting a timeout for the first time, we also hit the quota threshold on SEND action to FIFO queue.
I’m just thinking out loud here, but to overcome this quota limitation, we could send the timeout to another standard SQS queue, or even the same queue, and delay it for 15 minutes. I understand that the FIFO queue is used also to keep the order of the messages, but if we can guarantee that the message will be consumed within 15 minutes from the sent time (horizontal scale) it should be fine, isn’t it?
The FIFO queue is needed to make sure that we reduce at the minimum the risk of generating duplicate messages when the timeouts are rescheduled. As of now, there is no way to address timeouts to a different queue other than the default endpoint-name-delays.fifo.
We have released 5.0.1 and 4.4.1 that now batches the FIFO queue operations which increases the throughput (we have seen more than 6x more throughput), reduces the cost of operations and reduces the chances of getting throttled.
Hi, @ramonsmits, I modified my application to use less delayed messages, I could reduce by 30% the number of requests. The new implementation didn’t solve the problem completely, so I’m working on a different method to schedule messages, using Redis ordered set to store the messages.