Saga Deadlock Problem

We recently kicked off a saga that processes a file where each line in the file is a financial transaction to be processed. The initial saga fires a message for each line in the file and each of these messages starts a new, different type of saga, to process individual records from the file.

Early on in the process we got about 10 failures in the record level saga due to a database deadlock while trying to find the saga. Our database is SQL server and our message transport is Azure Service Bus. The stack trace is below. We have separate databases for saga persistence and our business (transactional) data and we do not use distributed transactions.

The good news is that the logs indicate the messages were re-tried and everything turned out okay but we would like to understand why the deadlocks happened.

Timestamp=2024-07-08 16:30:21|Level=Information|Message=Immediate Retry is going to retry message ‘6ec9b341-3b64-47ad-9f25-b1a70172c24e’ because of an exception:|TraceId=|Exception=Microsoft.Data.SqlClient.SqlException (0x80131904): Transaction (Process ID 477) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction)
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at Microsoft.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows)
at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)
at Microsoft.Data.SqlClient.SqlDataReader.ReadAsyncExecute(Task task, Object state)
at Microsoft.Data.SqlClient.SqlDataReader.InvokeAsyncCall[T](AAsyncCallContext1 context) --- End of stack trace from previous location --- at SagaPersister.GetSagaData[TSagaData](SynchronizedStorageSession session, String commandText, RuntimeSagaInfo sagaInfo, ParameterAppender appendParameters) in /_/src/SqlPersistence/Saga/SagaPersister_Get.cs:line 104 at SagaPersister.Get[TSagaData](String propertyName, Object propertyValue, SynchronizedStorageSession session, ContextBag context) in /_/src/SqlPersistence/Saga/SagaPersister_Get.cs:line 32 at NServiceBus.PropertySagaFinder1.Find(IBuilder builder, SagaFinderDefinition finderDefinition, SynchronizedStorageSession storageSession, ContextBag context, Object message, IReadOnlyDictionary2 messageHeaders) in /_/src/NServiceBus.Core/Sagas/PropertySagaFinder.cs:line 43 at NServiceBus.SagaPersistenceBehavior.Invoke(IInvokeHandlerContext context, Func2 next) in //src/NServiceBus.Core/Sagas/SagaPersistenceBehavior.cs:line 164
at Ipc.Util.Capacitor.Behaviors.IncomingLoggingBehavior.Invoke(IInvokeHandlerContext context, Func1 next) at NServiceBus.LoadHandlersConnector.Invoke(IIncomingLogicalMessageContext context, Func2 stage) in /
/src/NServiceBus.Core/Pipeline/Incoming/LoadHandlersConnector.cs:line 42
at NServiceBus.ScheduledTaskHandlingBehavior.Invoke(IIncomingLogicalMessageContext context, Func2 next) in /_/src/NServiceBus.Core/Scheduling/ScheduledTaskHandlingBehavior.cs:line 23 at NServiceBus.InvokeSagaNotFoundBehavior.Invoke(IIncomingLogicalMessageContext context, Func2 next) in //src/NServiceBus.Core/Sagas/InvokeSagaNotFoundBehavior.cs:line 16
at NServiceBus.DeserializeMessageConnector.Invoke(IIncomingPhysicalMessageContext context, Func2 stage) in /_/src/NServiceBus.Core/Pipeline/Incoming/DeserializeMessageConnector.cs:line 34 at ReceivePerformanceDiagnosticsBehavior.Invoke(IIncomingPhysicalMessageContext context, Func2 next) in D:\a\NServiceBus.Metrics\NServiceBus.Metrics\src\NServiceBus.Metrics\ProbeBuilders\ReceivePerformanceDiagnosticsBehavior.cs:line 27
at Ipc.Util.Capacitor.Behaviors.TraceIncomingMessagesBehavior.Invoke(IIncomingPhysicalMessageContext context, Func1 next) at NServiceBus.InvokeAuditPipelineBehavior.Invoke(IIncomingPhysicalMessageContext context, Func2 next) in /
/src/NServiceBus.Core/Audit/InvokeAuditPipelineBehavior.cs:line 27
at NServiceBus.ProcessingStatisticsBehavior.Invoke(IIncomingPhysicalMessageContext context, Func2 next) in /_/src/NServiceBus.Core/Performance/Statistics/ProcessingStatisticsBehavior.cs:line 32 at NServiceBus.TransportReceiveToPhysicalMessageConnector.Invoke(ITransportReceiveContext context, Func2 next) in //src/NServiceBus.Core/Pipeline/Incoming/TransportReceiveToPhysicalMessageConnector.cs:line 61
at NServiceBus.RetryAcknowledgementBehavior.Invoke(ITransportReceiveContext context, Func`2 next) in /
/src/NServiceBus.Core/ServicePlatform/Retries/RetryAcknowledgementBehavior.cs:line 46
at NServiceBus.MainPipelineExecutor.Invoke(MessageContext messageContext) in //src/NServiceBus.Core/Pipeline/MainPipelineExecutor.cs:line 50
at NServiceBus.TransportReceiver.InvokePipeline(MessageContext c) in /
/src/NServiceBus.Core/Transports/TransportReceiver.cs:line 66
at NServiceBus.TransportReceiver.InvokePipeline(MessageContext c) in //src/NServiceBus.Core/Transports/TransportReceiver.cs:line 66
at NServiceBus.Transport.AzureServiceBus.MessagePump.ReceiveMessage(CancellationToken messageReceivingCancellationToken) in /
/src/Transport/Receiving/MessagePump.cs:line 298
ClientConnectionId:57bdaa88-245e-4059-b2ce-d6bae25495ba
Error Number:1205,State:51,Class:13

Hi @MarkBricker

Could you share a bit more information on your scenario:

  1. Are any other (NServcieBus-related) SQL statements executed on that table?
  2. Have you made any changes to the table and/or index structure?
  3. Is this a reproducible behavior or a one-off?
  4. Do I understand correctly that each of the messages creates a row in the saga table with a unique correlation property value?

Cheers,
Tomek

  1. There are no other SQL statements executed on the table by us. Maybe by NServiceBus behind the scenes but we don’t directly access any of the saga tables.
  2. We have not changed anything related to the saga tables or indeces.
  3. It seems to be a one-off behavior. We have run this process a few times before (it is not yet in production, we are still testing it) and did not see this behavior.
  4. You are correct - each line in the original file is turned into a NServiceBus message that starts a saga with its own correlation value.

We did capture some information about the deadlock processes:

Hi @MarkBricker,

Thank you for sharing more information on the scenario. Before I try to answer your question I wanted to note that the situation you have seen does not influence the consistency of your system. The fact that messages were retried is exactly what one would expect from NServiceBus.

When it comes to the explanation of what happened I’m afraid I can only make guesses. One scenario that could lead to the effects you have observed could be something as follows:

  1. Because the Correlation_SapDocumentId is not part (by design) of the clustered index each of the update statements needs to get hold of two locks. First a lock on the Index_Correlation_ index, and after that on the row.
  2. The winning transaction takes first a U lock on the index, later a X lock on the row, and finally releases the U lock on the index (and keeps the X lock)
  3. The losing transaction takes a U lock on the index and blocks waiting for the X lock on the row.
  4. The winning transaction tries to update the saga content (this happens when your handler executes, updates the Data property and NServiceBus is serializing that to the table row), but can’t get the index lock as the losing transaction holds it
  5. We have a deadlock.

Important: All this assumes that locks here are escalated from row locks to either page locks or table locks. This should happen rarely (e.g. a small table size like a few rows, or an SQL server instance reaching its available resource cap).

I hope this helps, as a best-effort explanation.

Regards,
Tomek