ServicePulse - Multiple teams monitor common bus

Petteroe · August 14, 2020, 7:56am

We have a lot of microservices owned by different teams, but they communicate with each other using NSB on a common Azure Service Bus. The monitoring of the services themselves are up to the teams, but ServicePulse is now way to cluttered to be of any real use.

Is it possible for the teams to select which endpoints they are interested in and filter on only things that are relevant for those endpoints? Preferably also something that is persisted across sessions for the team members. This would make our systems a lot easier to monitor, and reduces the risk of messages failing and not being picked up by the team.

Thanks

danielmarbach · August 17, 2020, 9:39am

Hi Morten

I’m not aware of any such functionality at this stage being built into ServicePulse. One way to achieve what you are asking for would be to have dedicated audit and error queues “per team” similar to what we describe in our multi-instance guide under multi-region deployment.

With that you would achieve true isolation from both the auditing as well as error ingestion and then the teams could fully own that part of the system. Have you thought about that?

Regards
Daniel

Petteroe · September 28, 2020, 8:54am

Thanks, Daniel.
This hasn’t been tried. I think both the teams and I were unaware of this documentation.

I will suggest this to the teams.

Thanks again,
Morten

AndreSorhusKB · October 2, 2020, 1:31pm

Hi Daniel.

The documentation you are referring to here describes how to have dedicated ServiceControl (or -Pulse) instances for audit queues only. The documentation explicitly excludes error queues, and to my understanding then any error handling of messages, except for on the primary ServiceControl instance: “Splitting into multiple ServiceControl instances is supported only for auditing.” “error data sharding is not supported” (multi-region), “Error queue sharding is not supported and all endpoints need to route error messages to a centralized error queue handled by the primary instance.”

So my interpretation here is that you cannot (quote) have dedicated audit and error queues “per team” or am I reading this wrong? The document linked is also annotated as “not recommended for new deployments”, but also the newer document does not suggest any changes to this.

The ability to distribute and isolate error handling in a per team fashion, as Morten is talking about here, is the primary concern (not monitoring/diagnostics).

TIA
André Sørhus

danielmarbach · October 5, 2020, 8:46am

Hi Andre

Thanks for clarifying. After rereading the original message in combination with your addition I see it now. I completely misunderstood the original text, sorry for that.

You are right that only the primary instance can handle errors up to this point. It would be possible though to have individual error queues for parts of the system and then have a primary instance per error queue and then configure their secondaries to be the audit instances.

In order to achieve that every endpoint would need to define the error queue and audit queue, it sends to based on the responsibilities. You could then still audit all messages to a centralized audit queue or have also audit queues per team.

Does that make sense?

Regards
Daniel

AndreSorhusKB · October 5, 2020, 9:35am

Hi Daniel, thanks for your answer.

I still need this clarified a bit more, so let me loosely describe the current infrastructure:

The organization has several (DevOps) teams that monitors many different systems each. Some of these systems uses separate Azure Service Buses (ASB), but many shares a single common bus. Each one of these systems has one or more endpoints, with all distinct queues. However they all share the same error and audit queues. (And the same servicecontrol and servicepulse instances)

But to what you are saying: We then can have multiple error (and audit) queues on the same ASB instance, e.g. one queue per system (or team) defined on each endpoint. And then for each one of these queues configure a separate primary ServiceControl instance, and designate these for a specific error and audit log queue (all on the same ASB instance). So lets say we have 10 systems, and decide to have one queue per system, then we would have 10 primary ServiceControl instances and a ServicePulse instance for each one of these.

Follow up questions:

Any documentation that describes this setup scenario? How to designate a service control instance towards a specific bus and specific error and audit queue?
All the service control instances can be configured on the same machine (VM), right? We don’t have to have 10 different VM’s for each service control instance?

TIA
André Sørhus

danielmarbach · October 7, 2020, 1:30pm

Hi Andre

A single ServicePulse instance can be used to connect to multiple ServiceControl instances by switching the connection.

Here is a setting of two teams where each team has their own dedicated error and audit queue

The only thing I made sure during installation was to set appropriate service names as well as configuring for each installation the queue configuration of the instance accordingly

The error queue configuration of team 1

The audit queue configuration of team 1

If you want to have a centralized audit queue instance you but dedicated error instances only you need to follow the remote instance guidance and do part of the work either manually or using the PowerShell scripts.

A quick example of how this can be achieved manually when setting up the team 1 instance I configure the queue configuration as before

the audit instance becomes a clear name that indicates that is the centralized one

Queue config points to the centralized audit queue

Then I setup the Team 2 instances

The audit instance of Team 2 is just temporary

so add any arbitrary queue name (we are going to delete that instance soon). Normally you could add !disable to the queue configuration but it seems there is a current regression that we have to address first.

Once installed removed the unnecessary Team2 audit instance. Stop the Team2 Error instance, edit the config and adjust the remote instance location to the location of the centralized audit queue, in my example it was

<add key="ServiceControl/RemoteInstances" value="[{&quot;api_uri&quot;:&quot;http://localhost:44444/api/&quot;}]" />

Restart.

To your other questions

Any documentation that describes this setup scenario?

Unfortunately this is an uncommon scenario that we don’t have documentation for. It is basically a mixture of regular installation with some remote instance modifications.

How to designate a service control instance towards a specific bus and specific error and audit queue?

That is just a matter of setting up the right connection string as well as filling in the correct queue configuration as pointed out.

All the service control instances can be configured on the same machine (VM), right?

They can yes as long as the machine takes into account the hardware requirements per instance. Also be aware that having a shared instance ingestion speed of several instances might be influenced due to “noisy neighbour” effects.

I hope that helps

Regards
Daniel