Using Sql Persistence and Outbox for DR scenarios

DorianGreen · February 26, 2021, 9:34pm

I need to account for DR while architecting my cloud solution on Azure.

It is easy to geo-replicate SQL databases and storage accounts across regions, but there is no native solution for Azure Service Bus (or any of the brokers really).

Since SQL Persistence + Outbox can guarantee Exactly-Once with deduplication, I thought it could be possible to use the Outbox records in the DR site to continue distributed workflows that got stuck due to abandoning the old site.

I am aware that SQL Geo-Replication will incur data loss in case of a failover, but that is acceptable in my domain.
what we would like though is that business flows either fully succeed or fully fail, without leaving the system in a corrupt state.

In the DR site I would read the all Outbox records for the last X mins, deserialize the outgoing operations, and re-send all the messages via the raw transport.

My only problem with this approach, is that the outgoing operations get removed from the Outbox record when it is marked as dispatched, and the SqlDialect is marked as internal, so I cant update the statement.

Is this approach wrong?
How does the rest of the community guarantee message delivery in DR scenarios?

Thanks

DavidBoike · March 4, 2021, 5:21pm

We always recommend you do disaster recovery using infrastructure capabilities, at a lower level than the “application” level, i.e. at the virtual machine level rather than trying to control which versions of endpoints are or are not currently processing messages.

Azure Service Bus contains disaster recovery features at the Premium level. I don’t know what you’re using for hosting but I assume in Azure there are options there too.

I would not recommend trying to (ab)use Outbox records as some sort of log shipping. That’s not what they’re designed for.

DorianGreen · March 4, 2021, 8:20pm

Azure Service Bus only provides geo-redundancy for metadata (queues and topics) not messages.

All infra solutions I found (Azure native or via ServiceControl auditing) rely on consuming messages from the queues and forwarding them to the failover zone manually.

These solutions fail the moment I lose communication with the transport.

With the Outbox solution, the outgoing messages are replicated together with the database and will never be out of sync with the business data.

Since the purpose of Outbox is to ensure Exactly Once processing, and atomic business and messaging transactions, it is not a stretch to extend the guarantee to delivery of the outgoing messages.

If messages in the transport are lost - and the transport doesn’t provide native geo-replication - having the ability to re-disptach the outgoing messages can provide a transport agnostic solution.

BBrandtTX · February 7, 2024, 5:10pm

How did it turn out @DorianGreen? We are currently at a similar decision point with DR of Azure Service Bus.