Multiple ServiceControl Instances for an high availability scenario?

The documentation about it is all about coping with high load of audit processing. It also have a “main” instance, and secondaries ones processing only the audit queue.

In my situation, we use Service Control mainly for the Heartbeats and the Monitoring plugin. We also use Pulse to view failed messages. We disabled audits and we are not really interested in them.

We are trying to modify our architecture to reach high availability. We already setup the broker (rabbitmq) as a cluster for high availability, and are in the process of modifying our services to support multiple instances of each services.

But currently, the ServiceControl/ServiceMonitoring is only installed on one machine. I would like to have multiple instances so that if that machine shut down, it keeps being functional on the second one (mainly so that the heartbeat and monitoring queues doesn’t keep growing).

The configuration requires a specific queue names (SendHeartbeatTo/SendMetricDataToServiceControl), and the documentation linked above mention “Having multiple primary instances is discouraged.”, and to give a different names for the secondary instances.

So I guess this scenario is not supported yet?

Multi instance on the error ingestion is currently not supported.

We do support a scale-out to cope with ingestion performance issues for when a single instance cannot cope with the (average) load.

Multi-instance deployments consist of at least two ServiceControl instances. In this scenario, there is a single designated instance, a primary instance, responsible for processing error messages and optionally audit messages. All other existing ServiceControl instances are secondary instances responsible only for processing audit messages.

We recommend to host ServiceControl in isolation on redundant high available storage.

Although the virtual machine might not be redundant by having multiple instances running your virtualization layer should make sure that when the hardware fails the virtual machine will be restored on a different host by using the same virtual disks.

This fail-over to a different hardware host could result in temporary outage but should not result in the queues grow to insane sizes. Queues exists exactly for this reason to act as a buffer in case of (hardware) issues.

If you run in the cloud most vendors will auto restart a virtual machine guest if the virtual host would crash. If you host on-premise your virtualization cluster environment of choice should be able to deal with this.


So the direct answer is no at the software level.

Of course I’m aware that this can be resolved at the virtualization level, but we are looking in case extreme of disasters scenarios where the whole virtual host or region would go down.

Good thing, those services are not time critical and the rest of the services can continue functioning even when they are down. I also notice that the queues stop increasing after some time, probably because the message are set with a time to live, so there’s no accumulation issue.

If that would happen your queues would be down too. It would require your queueing infra to also be multi region / multi-master including your databases too. To guarantee consistency and high-availability is impossible without a severe performance/latency impact.

TTBR behavior is different per transport. Most transports will not clean queues of expired TTBR values if the messages are not ingested: