Clarity on expected/default setup, and operational implications

OranguTech · September 4, 2024, 6:27pm

(These are multiple, but semi-related questions, I can break them out if necessary. Some I’m 98% sure I know the answer after poking at both the docs and the products, some I’m very unsure about.)

Overall the documentation looks good and comprehensive, esp from the dev perspective. However, from the ops/infra perspective, I’m still unclear on:

Is ServicePulse meant to aggregate multiple environments/instances to a single ServicePulse install or is it expected/required to have separate instance of ServicePulse for each environment?
What ServiceControl-related operations, if any, (temporarily) block processing of NSB messages? That is, if an SC instance (Error or Audit) is unavailable (service is not running, network issue, whatever), do the regular “real” messages still get sent/processed by whatever client code is running, and the Error/Audit messages pile up in their queues? Said another way: If a production ServiceControl server/vm/instance is down, how bad is it?
How the monitoring que/Service Pulse is expected to be set up. We have separate Azure Service Bus Namespaces for each environment (Dev1, Test3, etc), each with the same que names. Seems straightforward. Configuring endpoints for monitoring • ServicePulse • Particular Docs says “Forward audit data to a single audit and error queue that is monitored by a ServiceControl instance.” Is that a single queue for both types, per environment? What’s the default name of the que?
The use of the word “Endpoint” is ambiguous: Are these the endpoints (Queues) on whatever transport is in use? Or are they the “clients” where the NSB-specific code is executing on?
For the no-downtime Audit upgrades, if there an instance w/ the old RavenDB3 format on one server, can it be live migrated/replaced w/ a new RavenDB5 backed instance on another server created w/ a newer version? (4.26 → 4.31 or even a 5.x?)

Jayanthi_Sourirajan · September 13, 2024, 10:21pm

Hi @OranguTech

Thanks for your feedback regarding the docs. Regarding your questions, pl see the responses below

SP does not aggregate data from different environments like Dev, Staging or Prod. But you could configure the SC to have error/audit instances from different environments and have SP point to those one at a time. If your endpoints scale out, SP will be able to show all those instances as part of its monitoring.
The actual processing of messages is not affected by the SC service being down. NServiceBus endpoints must be configured to send data about their operations to a set of centralized queues that are unique to the system. ServiceControl monitors these queues, then collects and processes the data from the NServiceBus endpoints.
The data is sent to queues, even when SC is down. Once SC is up and running, it will start picking up the audit and error messages from the queue. Sometimes you could have high CPU utilization and here is doc that explains the behavior.
SC and SP are server applications. They should be deployed in each environment, for example: Dev, QA, or Prod. Usually, the audit queues are sent to ‘audit’ queues and errors to ‘error’ queues with each environment having its own set of queues. You could change the queue names . Here are some additional docs on auditing and error handling
Please see this doc for further clarification on what an endpoint refers to.
Wrt to upgrades on Service control audit instances, please refer to the following docs and let us know if you run into any issues during the upgrade.
- Upgrade Tips • ServiceControl • Particular Docs
- Replacing an Audit instance • ServiceControl • Particular Docs

Hope this helps.

OranguTech · September 19, 2024, 9:58pm

Thank you @Jayanthi_Sourirajan - (Sorry I didn’t see notification earlier)

Yes I had already read, I think, every document you linked to - multiple times.

Having now had time to process things a bit more, (and worked through the developer quickstart), and re-reading the upgrade tips, I think I understand. (My main concern w/ upgrading is that our current DB storage size has grown to (IMO) ridiculous sizes*, and while having Service Pulse or whatever offline for a while is no big deal, I didn’t want the queues to back up since maint. might take a long time, so am trying to schedule when there is other system downtime.)

*Prod Error Que is 99GB, and the Audit is 55GB, but I’m 90% sure the actual data is much, much smaller, the DBs just need maintenance.