We are using NServiceBus Sagas with SQL Persistence on SQL Server and we are experiencing deadlocks between SELECT (Saga operations) and DELETE.
We are considering enabling READ_COMMITTED_SNAPSHOT (RCSI).
Given that only SQL Transport uses READPAST, and Saga Persistence does not, is RCSI officially supported and safe for Saga Persistence?
READPAST makes sense for SQLT as there is no point to read/process the same row/message concurrently. As sagas use a correlation ID that is unique the reader always want that row. Even if it is locked.
As sagas are updated pretty much any SELECT will also do either an UPDATE or DELETE.
Its not entirely clear how your deadlocks occur. If that is because of a lot of concurrency on the saga saga instance or because the database is very busy and deadlocks occur due to lock escalation.
With is it “safe” I assume you’re asking can enabling READ_COMMITTED_SNAPSHOT result in consistency issue. If that is your question that is dependent on your choice of transport and if you’re using outbox. When the persistence/database transaction commits and succeeds NServiceBus will dispatch the messages to the transport and if anything fails here the incoming message is retried unless you are using SQL Persistence with SQL Transport and use connection sharing OR you are using distributed transactions between transport and persistence which these days is not so common.
If you are having a lot of deadlocks due to concurrency then READ_COMMITTED_SNAPSHOT will remove deadlocks but you’re just trading them for error 3960 “Snapshot isolation transaction aborted due to update conflict.”
Its important to understand why the deadlocks are happening
-
What transport are you using with SQL Persistence?
-
Can you share the deadlock graphs?
-
What’s the frequency? - Are we talking 5 deadlocks per day or 5000? Is this a recent change or has it always been this bad? When did it start?
-
How many saga instances/rows exist?
-
Are you doing IO in your saga(s) as accessing apis’s or databases?
-
Do you have outbox enabled?
Thanks a lot for the detailed response and for taking the time to break this down
After discussing this internally with my team, we realized that our license agreement includes full support. Given that we will need to share deadlock graphs and other potentially sensitive information (concurrency details, schema, workload characteristics), we think it is more appropriate to continue this discussion via an official support ticket.
We will open a support ticket with the same headline as this topic. In the ticket, we will:
-
Answer all of the questions you raised (transport, outbox usage, concurrency, saga count, frequency)
-
Provide additional context about our setup
-
Attach the deadlock graphs and related diagnostics
Thanks again for the guidance, we will follow up shortly through support.