Can we keep outbox entries for a while?

bosko.stupar · March 5, 2024, 11:18am

Hi Particular,
I’d like a possibility to retain the outbox entries for a while to give the transport layer time to deliver the messages safely.
The reason for this request is the possibility that the transport (ASB in this case) gets compromised and data loss is possible. It doesn’t help that the messages were written in the outbox, because they werde deleted upon sending and for a short while they exist on the broker only.

Is it possible to easily delay the deletion of the messages in the outbox table?

MikeMinutillo · March 6, 2024, 4:01am

Which persistence are you using?

This is not possible with the SQL Persistence as it clears out the outgoing messages after they have been dispatched to the broker. The outbox record itself is kept for some time, but only contains the details of the incoming message that has been processed. If you did want to keep the details of outgoing messages you could write a custom trigger to copy the data out, but you would have to test the performance implications of this and clean up the copied data yourself.

Is there a reason that the transport should be considered more susceptible to compromise and data loss than the database containing outbox records?

TimBussmann · March 6, 2024, 9:06am

Would (optionally) keeping the outgoing messages in the record cause any downsides to the Outbox behavior other than the increased storage needs?

bosko.stupar · March 6, 2024, 10:19am

Because we had such a case, where some messages were lost
Short(ish) delay (say a few hours?) in deletion of the outgoing messages would enable us just to replay the messages once the transport is reestablished. All the message handlers should be idempotent and handle possible duplicates.

MikeMinutillo · March 7, 2024, 12:21am

It has the potential to cause more duplicates on the transport. The SQL Persistence returns all of the operations for the outbox records to core for dispatch when it receives a message. If the dispatched messages are not removed from the outbox record, and the incoming message is received multiple times, those same outgoing messages will be dispatched multiple times.

TimBussmann · March 7, 2024, 8:15am

Isn’t the dispatch state tracked via a separate Dispatched bool flag which is checked before dispatching messages? As long as the bool flag is properly set, keeping the data should not impact the behavior of the outbox.

mauroservienti · March 7, 2024, 8:45am

We could introduce an option to allow users to decide if transport operations should be cleared or not when setting the dispatched flag:

github.com

Particular/NServiceBus.Persistence.Sql/blob/7f6ff1081983bee52c0b4513d27cd612ad086a55/src/SqlPersistence/Outbox/SqlDialect_MsSqlServer.cs#L12-L21


      
                      internal override string GetOutboxSetAsDispatchedCommand(string tableName)
                      {
                          return $@"
          update {tableName}
          set
              Dispatched = 1,
              DispatchedAt = @DispatchedAt,
              Operations = '[]'
          where MessageId = @MessageId";
                      }

Setting that option to do not clear transport operations would allow the recovery of lost messages due to infrastructure failures with the downside of more storage and possibly some performance impact on the outbox storage.

Those records will be anyway cleared out at regular intervals when the outbox cleanup task kicks in. The cleanup process can be disabled or configured to keep records for an arbitrary amount of time.

bosko.stupar · March 7, 2024, 1:32pm

That would be awesome!

danielmarbach · March 7, 2024, 2:06pm

@mauroservienti Why would we introduce such a flag? The outbox table purpose is to store outgoing messages until they are dispatched and keep a deduplication record. The outbox table is not designed to be a disaster recovery mechanism.

To me it sounds like we have gone down the path of thinking from a workaround and promoting that to a feature as well as saying implicitely “The business database is more crucial than the message broker so it is OK to neglect disaster recovery concerns on the broker because ultimately we will have the database when the outbox is turned on”.

From an operational perspective you want to include similar availability and disaster recovery concerns on the message broker side of things. As an example (quite likely not an complete list):

Enable delete lock on the ASB namespace to make sure it doesn’t get accidentally deleted
For throughput and latency concern use premium over standard
For disaster recovery and outage prevention use the premium features such as Geo-DR and availability zones

bosko.stupar · March 7, 2024, 4:10pm

@danielmarbach we will for sure invest into outage prevention. This is however something else.
we do want to consider our messaging infrastructure at least as important as the DB infrastructure. It is however not reasonable to think of the message broker as a DB although it is responsible for the data at least for a short while. Because of the transient nature of its data storage, we can’t seriously think of backup/restore of it in the same way.

I think that enabling a short delay in deletion of the items in the outbox, we are reducing the risk of the data loss by a lot and we are not sacrificing much by doing it. It should be configurable and only applicable to the topics/queues that make use of an outbox (i.e. the critical ones).

You can think of it as a workaround as such. It doesn’t bother me. It is cheap. I like it

P.S. one of our custom outbox implementations has this feature already and it worked like a charm in this situation. We are currently eliminating all of the custom implementations in favour of the NServiceBus though and I’d love to have the same safety net.

TimBussmann · March 8, 2024, 9:19pm

While I agree that such an Outbox feature doesn’t provide an acceptable backup or disaster recovery mechanism (e.g. it can’t help with restoring delayed messages), it might still provide some value in various cases at a very cheap price (storage). I totally understand that this shouldn’t be seen as a replacement for doing proper backups and disaster recovery planning, damage mitigation would be to word to consider

I assume that every project will, at some point, experience some major problems, e.g. software bugs are inevitable. This can be bugs in user code*, framework code, external services, etc. Even AWS and Azure managed to lose customer data multiple times. Overall, it seems to be a fairly simple approach how NServiceBus could provide some additional damage mitigation to its users that they will be very thankful for once if they will find themselves in such a situation.

*I find the idea of being able to replay a message in case of a message-los-level bug in user code a really intriguing aspect (but it would also increase the scope of a proper functionality as it might require purging the inbox entry on the erroneous consumer endpoint). Auditing might be another angle for these type of mitigations but comes with other disadvantages and limitations.

MikeMinutillo · March 11, 2024, 5:28am

To answer the original question:

No, it is not. In the implementation the outbox, outgoing message records are cleared when the messages have been successfully dispatched to the transport.

One of the primary reasons the outbox stores outgoing messages at all is to ensure that they make it to the transport. Once they are there, the information is not needed and is discarded.

The goal of the transport is to ensure that messages are delivered from the sender to a receiver. Ensuring that the transport does not lose messages that have been properly delivered to the transport is a transport concern, not an NServiceBus one. Once the message arrives at a destination queue NServiceBus will ensure that it is either successfully processed by the endpoint or moved to an error queue before being removed from the input queue.

If the transport loses messages in between NServiceBus endpoints, for any reason, that should be resolved at the transport level.

danielmarbach · March 11, 2024, 7:54am

Honestly, things like that have never turned out to be “cheap” in my experience. Once you start mixing multiple concerns into a single approach, it becomes very difficult to manage the throughput and performance characteristics of the approach due to conflicting requirements.

As an example. There have been ideas around to make the outbox table more like a ring buffer implementation. As soon as such a “KeepThingsAroundForBackupPurposes” configuration flag comes into play such a ring buffer approach is almost immediately out of the window.

Furthermore when you think through the potential tuning the outbox table might receive to cope with the load and bring in the backup flag how would you reasonable tune those settings in relation to each other? Most of the time you can’t. For me that is an indication some sort of trigger, replication etc. for that table is the far better approach because then you can properly make tradeoffs for that backup solution in terms of synchronous vs asynchronous writes, acceptable lags, storage duration etc. It is also possible to manage those tables in your SQL server with special considerations for that backup/disaster case.

Yes, at first sight I agree it is “compellingly cheap”. At second thought it isn’t though for me.

These are my thoughts, though.

Regards,
Daniel

raulschnelzer · March 11, 2024, 8:18pm

After reading the conversation, I have to agree with Daniel and MikeMinutillo, that it’s better to keep things separate.

Trying to hold onto dispatched messages in the outbox by marking them could cause more problems than it solves.

For example, cleaning up old messages could slow down the system or even cause deadlocks. It also requires more maintenance for the cleanup job.

Changing from deleting messages uppon dispatching them, to updating these, might be slower and will generally create more transaction logs.

Plus while the data in the outbox might be considered transient, if we keep messages around longer you might need to deal with audit requirements or comply with your data protection.

Next you will be tempted to query the outbox tables or even monitoring these, although this is not the intended use of the outbox pattern and might introduce additional problems.

You would also need a proper way to replay/republish specific messages.

Finally I do think that trying to make message delivery more reliable is a good goal, however I suggest to bring the right reselience patterns at the right place and making sure the messages get delivered is the concern of the broker.

bosko.stupar · March 12, 2024, 9:33am

I appreciate the time and effort you’ve put into the explanation why this might be a bad idea.

I find it regrettable that speculative drawbacks take precedence over the practicality of a proven approach (albeit with a custom implementation in a single use-case).

We shall take measures to make sure the transport doesn’t vanish in the future for sure. If there is no other way to secure the data on the broker in the future, I’d be happy to revisit this thread and discuss it with you again.

For the time being though, I thank you for your time.

elinorceleste · March 13, 2024, 5:36am

Is it possible to do?

TimBussmann · March 15, 2024, 7:25pm

I understand the initial reaction to pushback on the suggestion but find at least the technical arguments unconvincing.

The ability to re-dispatch outbox message for outbox record lifetime still sounds like a valid feature request - do we have to assume that the official response from Particular is that such a feature request is not going to be considered then?

If the transport loses messages in between NServiceBus endpoints, for any reason, that should be resolved at the transport level.

It’s undeniable that sometimes things go wrong (sometimes due to user fault, sometimes not, they go wrong nevertheless and that’s just unavoidable) and while the discussion here is very solution focused, it might still be a type of problems that you might want to consider as a value benefit for the Particular Platform users.

BBrandtTX · March 25, 2024, 8:15pm

@bosko.stupar If you are using SQL Persistence with a DB that supports it, have you considered enabling Change Data Capture on the Outbox table and hook up something like Debezium to keep your audit stream of messages?

bosko.stupar · March 26, 2024, 7:50pm

Hi @BBrandtTX
yes, we are indeed using all of the mentioned technologies, however for different things. The events created from the persistence are without exception data events in our terminology and we send those via Kafka.

For the domain events however, we prefer using broker based messaging because of different features and guarantees it offers. The very nature of the domain events almost always makes them temporal so if a message is lost, the event doesn’t exist any more unless we keep the data history and calculate events again from diffs.

This is why the transactional outbox is such a lovely tool. It makes possible to bind the message sending to the source transaction and keep the data consistent with the sent messages. The only thing it doesn’t provide is the very worst case (that happened to us) which is to accidentally lose the critical infrastructure