ServicePulse unable to connect to endpoint instances

Hi

The Monitoring in Service Pulse can’t consequently connect to the endpoint queues.

I’m getting the following
image
with the message Unable to connect to instance when hovering the error icon.

For some of the endpoints this warning is permanent, for a few it goes away for a second every once in a while. So basically we only get ? most of the time on the monitoring overview.

We’re using SQL Server Transport for all of our endpoints. The rest of ServicePulse appears to work fine (failed messages showing up, retries, heartbeats etc).

Do you have any suggestions on how I could investigate and solve this issue? Any additional information that you require?

Thanks,
Philipp

Hi Philipp,

I’m assuming you’ve already gone through the troubleshooting guide for SP?

Hi Sean,

yes, I did. None of the mentioned guides refer to the issue we’re experiencing.

Edit: also, non of the log files (monitoring, service control, audit and IIS Log of SP) surfaced any clues. Can you confirm that there’s no dedicated ServicePulse log, apart from the iis one?

Could it be related to the relatively long interval we use to send metrics? It’s currently configured as 1 minute, using

MetricsOptions options = endpointConfiguration.EnableMetrics();
options.SendMetricDataToServiceControl("Monitoring", TimeSpan.FromMinutes(1));

ServicePulse is only a bunch of static HTML files hosted in a simple webserver. It is a UI on top of the ServiceControl API.

@philippdolder That is very likely. Just go for something below 30 seconds.

Although there is nothing wrong with 1 minute it will take up to a minute to see new updates IF the endpoint is IDLE. The interval only applies to flush the internal buffer if it did not reach its limit yet.

5-15-30 are good values to use. Lower isn’t very useful.

@ramonsmits
Today, we could finally deploy this change to production. After changing the metrics interval value to 15 seconds we see reliable data in the Monitoring tab, so I imagine, that 1 minute is just too long of an interval to get data reliably, e.g. we didn’t see any data for most of the time before going down to 15 seconds.

I believe including the proposed good values in the docs would help other developers in the future. I’m happy to create a PR for the doc change