Looking for messaging transports alternative to MSMQ

Ymc · October 3, 2017, 9:14pm

Could you please give us advise on what alternative transports could meet our needs (see the list of requirements below)? Currently we use MSMQ which makes us some problems from time to time, like consuming too much memory when we receive tons of messages. In general, MSMQ does not seem to us to be a reliable transport: occasionally it just stops working or gets slow until we restart the service. At least we want to try something else and compare results. Here are our needs:

we have dozens of different Nsb services deployed to 10 virtual and physical servers, handling huge volume of messages, like thousand messages every second for all services combined, and could be hundreds of messages per sec per single distributed service (master -> 5 workers)
our services work under Msdtc, committing or rolling back distributed transactions along with message deliverables. That means if transaction is rolled back it’s very important all messages are sent in the handler are rolled back as well
We’d prefer Bus over Broker to avoid single point of failure, though we may consider broker if some backup plan exists in case of failure that we can quickly turn on
incoming and outgoing messages should be stored in a reliable permanent storage like disk or database if Nsb service-receiver is not available. It’s critical to us that no one message should be lost, that’s exactly how MSMQ works: we can always find undelivered message either in error queue, transnational dead-letter queue, or outgoing queues.

Thank you!

andreasohlund · October 4, 2017, 6:26am

If MSDTC is a hard requirement your only option to MSMQ would be our SqlServer transport. Are you currently using SqlServer for other things?

Note: SqlServer is a broker style transport Transport types • NServiceBus • Particular Docs

Cheers,

Andreas

danielmarbach · October 4, 2017, 6:50am

Hi, Yuri,

Switching transport might be something that should not be underestimated from an operational and migration standpoint especially if you need to keep things running while migrating.

Have you considered lowering the transactionality level? For example, do you need MSDTC everywhere? Could you slightly redesign your handler and business logic to be able to deal with potential duplicates? With V6 and higher with NServiceBus by default, we batch all the messages in a handler and only send them out when the message handler and the pipeline completed successfully. This already decreases the chance of things going wrong for example.

Based on my experience with MSMQ I can say MSMQ is a super robust transport infrastructure. It can get “wacky” with MSDTC, and the two-phase commit reduces the throughput quite “drastically.”

NServiceBus by default uses a safe-by-default approach. That’s why when the transport and persister support it tries to leverage the highest transactionality level possible. With SQL Transport and Persistence, this means it would use the transactionality level which allows escalating to distributed transactions. This is ideally suited for existing applications that are migrated to distributed systems or customers who are building smaller systems and that don’t want to adopt the distributed systems thinking just yet due to tight timelines or only little scaling needs at the moment. Similarly, the outbox is a feature that allows achieving distributed transaction like behavior without having to have a distributed transaction coordinator. Which makes the outbox a bit more convenient to operate as well as less complex to administrate (gone are the days when the DTC has to be clustered, or an administrator needs to log into a local box to manually resolved a distributed transaction in an in-doubt state). While the outbox is more convenient than the DTC both the DTC and the outbox have something in common: They are a technical solution that tries to address idempotency and consistency while inflicting scaling constraints into your distributed systems architecture. Coordination whether it is from a DTC standpoint or the outbox involves IO. IO is not cheap and therefore limits the throughput of your system. It might sound like I’m totally against DTC and Outbox. That is not the case since they are perfectly valid solutions depending in your needs and that is what I’m trying to outline and let’s say challenge conventional thinking.

Having said that with modern architectural approaches like Microservices, Containers and more where non-functional requirements push you towards building a robust distributed system that allows being scaled dynamically and potentially even be lifted-and-shifted to the cloud or PaaS running on premises a different mindset is required. Since you are building a new Microservice platform, you have the steering wheel in your hands to make sound architectural decisions (if your non-functional requirements demand so) to drive your Microservices towards a design which does not rely on technical solutions to idempotency and consistency. In my personal belief we have to free ourselves from the notion of technical transactions and the utopia of technical consistency. Business processes always had clever ways of embracing the fact that consistency can be achieved by issuing compensations. So I would encourage you to think about the question how you can evolve your Microservices so that they can cope with the fact that business commands might be retried and happen multiple times. By doing so, you achieve a much more scalable solution in the long run because it requires you to think about business transactions and race conditions up-front and explicitly design for those instead of relying on some “black magic” solution that handles it behind the scenes. And btw. even when you look at DTC or outbox there is, for example, nothing prevents a user to submit the “same business information” twice. Your system needs to deal with those kinds of scenarios as well.

We have an excellent article in the Azure documentation that explains that very well

Although the article is filed under Azure the same applies to on-premises solutions.

Furthermore, Pat Helland wrote an excellent paper about the life beyond distributed transactions.

http://adrianmarriott.net/logosroot/papers/LifeBeyondTxns.pdf

Another thing: If the-the future you’d decide to switch the transport let’s say to RabbitMQ, that transport only supports the transaction mode ReceiveOnly. Therefore not even outgoing messages are transactionally safeguarded in some scenarios. Therefore your code would need to deal with such situations as well.

I hope that answers your questions or at least give you hints into the direction you want to take in your architecture

Regards
Daniel

andreasohlund · October 4, 2017, 7:02am

Just adding to what Daniel said, I gave a talk comparing transport options
a while back that might help as well:

Cheers,

Andreas

ramonsmits · October 4, 2017, 9:25am

I noticed two things in your post:

“In general, MSMQ does not seem to us to be a reliable transport: occasionally it just stops working or gets slow until we restart the service.”
Use of the distributor

MSMQ slowdown:

The first is not something that is caused by MSMQ. If you have a slowdown issue then that is very likely to be caused by something else. It is very unlikely that switching to a different transport will resolve this.

Divide and conquer deployment:

Are you deploying your endpoints to individual machines thus having a single endpoint per machine and were you unable to upscale this (virtual) machine and did you increase the maximum concurrency of that endpoint to make use of all those available resources? If not, try this different deployment model, deploy across multiple machines before using the distributor until you’ve reached the limits of a single machine that has a single endpoint installed.

Distributor and worker configuration for throughput:

Have you increased the maximum concurrency level on the distributor nodes?

The default concurrency of the distributor process is set to 1. That means the messages are processed sequentially. Make sure that the MaximumConcurrencyLevel has been increased in the configuration on the endpoint that runs distributor process. A good rule of thumb to set this value to 2-4 times the amount of cores of a given machine. While fine-tuning, inspect disk, CPU and network resources until one of these reaches its maximum capacity.

Source: https://docs.particular.net/transports/msmq/distributor/#performance

Increase capacity to prefetch:

In V6 the workers can be configured to have a capacity that is bigger than the configured maximum concurrency.

var appSettings = ConfigurationManager.AppSettings;

var maxConcurrency = System.Environment.ProcessorCount * 4; // 8 cores = max concurrency 32
var prefetchSize = maxConcurrency * 4; // 128 if number of cores is 8

endpointConfiguration.LimitMessageProcessingConcurrencyTo(maxConcurrency);
endpointConfiguration.EnlistWithLegacyMSMQDistributor(
    masterNodeAddress: appSettings["DistributorAddress"],
    masterNodeControlAddress: appSettings["DistributorControlAddress"],
    capacity: prefetchSize;

This above configuration configures the maximum concurrency of an endpoint. This can drastically improve the throughput of your endpoint but is only applicable if you don’t have single item congestion.

Increasing the prefetch size means that the worker does not have to wait for the distributor node to first receive a message that the worker is done and forward a new work item. It lowers the time to process latency involved if there are a lot of messages in the queue as it will always make sure that a new work item is available immediately. When the worker node fails more message potentially are stuck on the worker node that needs to be recovered.

Sender side distribution

The capacity argument can only be configured in the V6 worker API but in V6 we also have an alternative called. Sender side distribution.

Endpoints using the MSMQ transport are unable to use the competing consumers pattern to scale out by adding additional worker instances. Sender-side distribution is a method of scaling out an endpoint using the MSMQ transport, without relying on a centralized distributor assigning messages to available workers.

Source: Scaling Out With Sender-side Distribution • MSMQ Transport • Particular Docs

I hope this provides some info on how to improve the performance of your system without switching to a different transport at all as I think it is unlikely that MSMQ is your bottleneck.

Regards,
Ramon

Ymc · October 4, 2017, 8:14pm

Hi, Andreas

MSDTC is not a strict requirement. The only thing we use MSDTC for is to bind MSMQ message submission and delivery to business data commit. If during Nsb handler execution some error occurred and transaction is roll back, all messages been submitted in the handler should be cancelled as well. We’re on NServiceBus 5 right now, but moving our code toward Ndb 6 or 7 soon. And with Nsb5 we believe you need to use distributed transaction for that: busconfiguration.EnableDistributedTransactions(). Correct us if we’re wrong please. And I do not know if this information matters, but we used to use Bus.Publish before, but now we use Bus.Send operations only. Other than that we want our messages to be found in some queues, not lost in case of infrastructural, application or any system issues.

If you tell me that in Nsb 6-7 or even in our current Nsb 5 we can turn distribute transaction mechanism off and keep having this 2 features (guaranteed Send<>() rollback and guaranteed messages storage), I would be happy to try to turn it off.

Yes, we do use SqlServer for other things and have good experience working with it. I have few concerns over SQL Server as a transport: 1) Single point of failure 2) giving our system handles thousands messages a second, it’s likely supposed to be a dedicated powerful sql server instance with pretty expensive sql server license 3) One second polling interval might affect our system performance. Though we can measure the latter.

What I think we need is called “Sends atomic with Receive” in Nsb 6 Transport Transactions • NServiceBus • Particular Docs which is supported by SQL Server and Azure Service Bus as well. However I’m not sure if I configure this transaction level, would msdtc to be run automatically or not. Looks like it’s not.

Ymc · October 4, 2017, 8:36pm

Hi Daniel,

Thank you for such a detailed answer and your recommendations over redesign. Though I agree with the most of your points, the system we’re dealing with has been evolved for 7 years and supported by wide group of developers, and it’s not quite easy to considerably redesign its core architecture.

We can handle duplicate messages, it’s not a big deal, but we need to have an integrity in our multi-step process of registration handling. The whole process is split into multiple small tasks = Nsb services:
task1 → task2 → task3 → task4 → … They communicate through Nsb messages. If task2 fails and db changes are rolled back, we do not want task3 and all subsequent tasks to get fired. This way registration is left at task1 state, and that’s fine. I know all the tasks can be redesigned to handle situations of receiving messages from previously failed tasks, but it does not seem to be an easy and risk-free job.

Ymc · October 4, 2017, 8:53pm

Hi @ramonsmits,

thank you for your hints and recommendations over Nsb services performance. We actually do not have a problem with overall performance, but rather with stability of the performance. We played a lot with all the settings for years and achieved pretty good performance when things go normal. I rather talked about occasional performance degradation only which we attribute to MSMQ, but it could be resulted by msdtc as well. At least one thing we blame MSMQ for is high i/o disk intensity and memory consumption on Distributor nodes in pick hours when we receive thousands messages a second, and they start collecting in some queue like error queue. Then if it’s hundreds thousands of messages collected, we’re running out of op.memory coz MSMQ starting to eat dozens of gigabytes of it. However it’s not the only problem. There are other kind of occasional slowness which we can not explain well. Thant’s why our first idea was: let’s try other transport. Now having read you responses guys, I’m more incline to say: let’s try to get rid of distributed transactions, if it’s not too difficult in our case.

andreasohlund · October 5, 2017, 7:16am

This works as you describe in v6 and above due to the batched dispatch behavior introduced

And with Nsb5 we believe you need to use distributed transaction for that: busconfiguration.EnableDistributedTransactions()

Correct

Ymc · October 5, 2017, 3:25pm

Hi, Andreas

This works as you describe in v6 and above due to the batched dispatch behavior introduced

Let’s say we decide to keep MSMQ, but get rid of distributed transactions as a possible cause of many our problems with our services stability and performance. Could you please clarify for me Nsb 6 levels of transport transactions in little bit more details:

Do I understand it right that all we need to do in Nsb6 is to set Transaction level to “Sends atomic with Receive”, and disable distributed transaction, keeping MSMQ as a transport. This guarantees us that if Service1 handler throws any kind of unhandled exception, all downstream messages initiated by Service1 to other services prior to the exception throw are not supposed to get fired.
Could you please tell me what the difference is between “Distributed transaction” level and “Sends atomic with Receive” transnational level in Nsb6 in a practical sense. Could you give me an example of cases that “Distributed transaction” could cover that lower level “Sends atomic with Receive” does not.

Thank you!

SimonCropp · October 6, 2017, 5:08am

Could you please tell me what the difference is between “Distributed transaction” level and “Sends atomic with Receive” transnational level

The transaction levels are fairly well covered here Transport Transactions • NServiceBus • Particular Docs

andreasohlund · October 6, 2017, 11:39am

Yes that’s correct, batched dispatch would make sure nothing gets send out until all handlers have completed successfully

Ymc · October 6, 2017, 5:04pm

Distributed transaction:

In this mode handlers will execute inside a TransactionScope created by the transport. This means that all the data updates and queue operations are all committed or all rolled back.

Transport transaction - Sends atomic with Receive:

This mode has the same consistency guarantees as the Receive Only mode, but additionally it prevents occurrence of ghost messages since all outgoing operations are atomic with the ongoing receive operation

Alright, but what in MSMQ besides Receive and Send consistency could “Sends atomic with Receive” level miss, that “Distributed transaction” does not? I thought all MSMQ operations are about Send and Receive only, I do not see what else we can loose. That’s why practical example would be very helpful.

I have a guess that the difference could be that “Distributed transaction” covers cases when you use Sagas, and maybe something inside your handler apart from single database like other database or file data writings which you also might be interested to commit or rollback under single distributed transaction. However if we’re talking about “no saga” case, and single database connection inside a handler, this 2 transport transaction levels seem to be equal. Correct?

Thank you

andreasohlund · October 7, 2017, 8:18am

I just realized I mislead you regarding

The only thing we use MSDTC for is to bind MSMQ message submission and delivery to business data commit. If during Nsb handler execution some error occurred and transaction is roll back, all messages been submitted in the handler should be cancelled as well.

With MSMQ you would get the same behavior using the transport transaction mode

Since all outgoing operations would be enlisted in the receive transaction no messages would be emitted unless all handlers complete successfully.

On transports that lack support for cross queue transactions like RabbitMQ and Azure Storage Queues this would not be true though.

In short: On MSMQ you won’t need DTC to achieve what you want.

Already mentioned but if you haven’t already I really think NSBCon 2015: All about Transports • Particular Software is worth a watch.

Sorry for the confusion!

Ymc · October 9, 2017, 3:51pm

Hi Andreas,

This makes sense to us. Let us try to get rid of MsDtc first and see how it goes, and then we will see if we need to turn to new transport protocol as well. It might take several months, but I’ll let you eventually know if it works for us or not.

Thanks,
Yuri